Applying OpenRefine to Science Fiction Dataset

After learning how to use some of the OpenRefine features, I found that these could be applied to our group’s dataset too. The dataset we got had a lot of columns which appeared to have the same information throughout the whole column (eg. Source Type, Subject Topic). For many of these columns, we wanted to get rid of them as it did not provide us any useful information, but we it was hard to look through all the columns to make sure it really was saying the same thing in every cell, because our dataset had a lot of records. So, using OpenRefine, I think one really helpful too will be the Facet tool, as we can use this to verify that there is only 1 facet before deleting the column, so that we do not accidentally delete any potentially useful information.

There were also a lot of columns which we will need to sort into groups, such as Location. Using the Facet tool again, we can also break down the data into smaller groups in order to make it easier to visualise and analyze.

One function I would like to perform but which I will need to look into further into, is to delete columns which are completely blank. Our data set has a lot of content types, and a lot of these content type columns are completely blank (possibly for recording future information which has not been collected yet). However, for our purposes, these columns are not helpful so I will look into if there is a feature to just delete all empty columns in one click, instead of having to use the Facet tool one by one for all the columns to see if they are blank. This would then allow us to have a much more concise dataset to work with so that it will be easier to generate data visualizations.

One other thing I would like to figure out how to do is that our dataset has a lot of information that is not easy to split up into groups just using the Facet tool, such as Subject name. Our group was interested in splitting the names up into gender categories so we can analyze the proportion of female and male attendees and various conventions , however I am not sure how we can do this with OpenRefine. 

One comment

  1. Hi, I think it is really great that you explain your idea in this blog post, I think it is interested that you brought up that you wish there is a feature that can delete all column in one click instead of having to do it one by one. I think this will really help me analyze my dataset now you have mention more issues in the blog.

Leave a Reply