Blog Post 4: Exploring and Using OpenRefine

OpenRefine will be useful for cleaning up our dataset for the Nixon White House Recordings. I personally do not find it super intuitive however the tutorial was very useful on how to use it. Since my data is extremely large (over 20,000 entries) due to the large scale of the entries there are bound to be some errors that our data set will have. Our data is split into many different sections:

  • Conversation title
  • Tape number
  • Conversation number
  • Identifier
  • Start date time
  • End date time
  • Start date
  • End date
  • Start time
  • End time
  • And many more…

It feels like our data set is too large and we are struggling to find the trends that we want to find due to the vastness of the data set. One idea that we had about dataset in order to shrink the amount of calls we have to deal with is to ignore all the calls that were to the white house operator and only include the calls where Nixon was actually speaking.

Though our data set is already set up by chronology it will be useful to use the isolating tool to separate out data that is less pertinent to our research question. The trends in phone calls are very interesting to us. The length of the conversation would also be an interesting thing to seperate getting rid of all the calls that only last one minute or less. In the split multi value column function on OpenRefine we could isolate the calls that are with Richard Nixon only thus limiting our data set and allowing for more clear results.

OpenRefine will be useful in creating columns are lowercase and where I expect there will be a few errors due to the size of the data. The one thing that I know my group is struggling with is how to add another column that is the combination of the start time and the end time to get the length of the conversation. If anyone has figured this out on OpenRefine or some editing data site I would love to hear from you!

 

3 comments

  1. Hi, I believe you could try exporting the two columns, creating a function in excel, creating a third column with the subtraction function, and then reuploading the columns. I am not sure if there is an equivalent method in openrefine but I’m sure there is. I would suggest looking up equations in openrefine. Best of luck

  2. Hi Samantha! I thought the use of the isolating tool mentioned here was very interesting. I agree with the comment above; I think that adding another column reflects the length of the conversation would probably be something that is best done on another program besides OpenRefine. For my group’s data, I realised that there were some things that OpenRefine likely could not perform, and I would have to use Excel or another platform instead. So maybe like the comment suggested above, you can try using Excel too. Good luck!

  3. I think OpenRefine would be a great tool to have when organizing such a large dataset as yours. It reduces the need for your group to manually go through and analyze each entry (20,000 entries is a lot). Perhaps you guys can find a way to use the start and end time of the recordings to get the length of calls, which would be much more digestible.

Leave a Reply