Week 4: OpenRefine

Our data set focuses on information from inmates at Eastern State Penitentiary and it is split into two spreadsheets. Some basic issues that our dataset had were that there was one specific column that was labeled ethnicity, religion, and occupation. Another problem we had was that not all rows were filled for every column. There were a lot of empty cells that we needed to get rid of. There are also columns that had data that was hard to categorize. They had descriptions for each prisoner so it was more of an open ended question that didn’t have a set answer.

In order to fix some of these issues we would need to get rid of cells that were blank, split the ethnicity, religion, and occupation column into separate columns, and find a way to code or organize the very detail and open ended columns.

The one obvious OpenRefine operation that we would need to run is the split multi-valued columns operation. In order to do that, we would need to select edit columns, split into several columns, and separate it. The issue we run into is that I’m not sure how we would separate each answer because there isn’t a comma or anything separating each one. We might also considering using the cluster function to clean up some of the data in the ethnicity column.

Something I’d like to be able to do to our data is delete all the empty cells automatically instead of having to go in and manually delete every single one. Something else that would be helpful would be to turn the more detailed columns into something simpler like extracting specific words or phrases from it to help us better understand our data.

Leave a Reply