Formatting Data – Nixon Audio Tapes

Inspired by the wise words of Uncle Ben from Spider Man, great datasets come great responsibility. After asking around about other group’s datasets, I quickly realized that the Nixon Tapes is one of the largest datasets that any group has to work with. Not going to lie, just going through the CSV file of the records of the audiotapes, not even the audiotapes themselves, was so intimidating that my laptop gave up and shut down on me. While I was waiting for my laptop to start, I thought a lot about how it’s important to balance both the needs of a humanities dataset and the business needs of the dataset. The humanities with this dataset being to show the emotional stress and paranoia that President Nixon often exhibited. And the business needs of the dataset being to represent these human elements in a consumable manner.

After I’ve restarted my laptop, closed a couple of applications, and opened up the csv file again, I thought a lot about how to organize the data. One of the first things I noticed was that duration time of each conversation would be useful for our dataset. Additionally separating out the details in the start and end date time in our dataset would be helpful. Currently the dataset is sorted by archival date released and I believe it would be more useful to have the data organized by the time and date of the conversation. Furthermore, it would be useful to have the text of all of the audiotapes to do more conversation and text analysis.

To accomplish these data transformation, we can use OpenRefine to clean and reorganize our data. First I would use the cell split functionalities to split the date from the time. Second, I would do some arithmetic to create an additional time duration column. Third, I would use OpenRefine’s sorting functionality to reorganize the data by conversation date and time, and not by archival release date. While the last task cannot be completed by OpenRefine, I plan on using software such as Cloud Converter and MALLET to do topic modeling on the speech to text conversion and analysis.

As my team and I go through this data, I constantly have to remind myself of the cultural, historical, and more uniquely, the emotional records that we are handling. While it is always useful to have data be in the format of rows and columns, date and time, and so forth, I constantly have to ask myself: How do we preserve the emotional power that these records hold in a way that can be accessible by more than just a listener? This definitely is not an easy task, but I honestly look forward to it and I think it is one of the last great step that technology needs to take to reduce the distance between man and machine.

 

One comment

  1. Hey Dhuangg I agree that applications such as OpenRefine make intimidating task such as breaking down a file with, in my case, nearly 6000 rows can be intimidating. But the tutorial helped you and I both make better use of our time by cleaning up the information for us. Amazing how technology works, it goes to show the humans are the ones in control of it, not the other way around.

Leave a Reply