Open Refine: “Library Book Acquisition in Osage, Iowa” (Spencer Chau)

OpenRefine is an extremely helpful tool to better analyze datasets, especially when the dataset is huge in quantity, imperfect or disorganized.

Our group’s dataset is “Library Book Acquisition in Osage, Iowa” which consists of around 40,000 entries of books. Each book’s information is subdivided into categories which include the title, publisher, ID number, publisher, year of publication, and year of acquisition etc. Throughout the dataset, however, there are also a lot of incomplete information, especially for data regarding age, race, and language. It is rather quick and easy utilizing OpenRefine to clean up data (i.e. merging same but differently-spelled categories), whitespaces as well as editing mass data all at once. The application also makes navigating the large dataset via categories much more convenient as it allows users to view data with certain similar characteristics only.

One of our preliminary research questions is “Why is there an influx of autobiographies and biographies compared to other genres such as science and psychology books?” It is very likely that there will be certain typing inconsistencies and spacing errors with the humungous 40k entries. With OpenRefine, it will be much easier for us to merge differently-spelled publisher, author, and book title into fewer categories such that we can have clearer categories to analyze our data with. In addition, being able to click through and look at data trends by only looking at certain category items (i.e. certain year, publisher), we can compare the trend of such influx in a much clearer and easier way, and these features can even potentially lead us to notice new and surprising findings.

One comment

  1. Your blog have clearly illustrated what you gonna do with your data with related research problem. Let’s explore more about how to use this tool through out the course.:)

Leave a Reply