OpenRefine – Introduction to Digital Humanities

For my group’s data set (Osage Public Library Books), the length of our data set was enough to cause a little bit of confusion as we tried to tackle the research questions. Some of the data were missing under the categories, there were duplicate book titles, and we were unsure of how to cut down the data – by year? Publisher? OpenRefine seems like a promising tool to clean up the data and help make a little more sense of the information we have.

For starters, OpenRefine would help us quickly determine the duplicate book titles and how many copies of that book the library has. This information could provide insight into the popular subject matter during the decade it was acquired, and possibly political or cultural environment during that decade. In addition, a simpler feature that would be useful is creating a more uniform template of each data entry (e.g. capitalizing each word in the title). OpenRefine also has a more intuitive and user-friendly approach to organizing the data under subcategories in each column, which would be helpful for my group to see which publishers were highly used or the authors that were popular.

It would be nice to group our data into categories based on the topic that each book is on. In order to do that, however, we would need to research the subject matter of each book in the dataset, which would take a long time. Because OpenRefine is a tool for organizing the data, it wouldn’t be able to analyze each book title and pull content or research about the book for us to categorize them like that. While I am unsure if OpenRefine has an option like that, I would like to be able to categorize the data by topic or subject matter.

Leave a Reply Cancel reply