My group’s dataset is the UCSB Cylinder Audio Archive, a large digitalized collection of cylinders that were commercially produced before vinyl and cassette tapes to produce sound recordings in the beginning of the 1900s. The dataset is really, really, large; UCSB has several collections of cylinders ranging in size from 6,000 cylinders to 1,000 cylinders. Altogether, the archive hosts over 10,000 cylinder recordings of popular music, vaudeville acts, comedic monologues, speeches, and more.
Fortunately, with this big of a dataset, the column headings break down the information in a granular and concise manner. Some of the headings include ‘Main Talent’, ‘Marketing Genre’, ‘First Take Date’ and ‘Type’. However, putting this data into OpenRefine and looking at the various Facets within just one category (I chose to look at ‘Description’ first) shows that there are many categories listed within this dataset that might be able to be combined in a similar manner as the facets of NJShipwrecks.csv could for ‘Vessel Type’. For example, the following list includes all the variations of the cylinders with the description “Band” at the beginning: Band; Band, with cornet solo; Band, with harmonica; Band, with male vocal solo; Band, with toy and trap effects; Band, with vocal chorus; Band, with whistling; Band, with xylophone solo. Just for this one type of cylinder recording, there are eight different categories under description. With 462 choices of facets within this one category of ‘Description’, there are surely numerous ways to combine these categories to analyze clear patterns within this dataset. And, with 26 categories of information, the data set will undoubtedly have other facets with repetitions like the ‘Description’ category which could be cleaned up and combined as well. Luckily, with some categories like ‘Type’ there are only 6 facets available, which will save our group from parsing through thousands of entries: there is Instrumental, Instrumental with vocal refrain, Spoken, Unknown, Vocal, and (blank).
An aspect of OpenRefine and dataset management in general that I am unsure how to navigate is what to do with entries within facets like the one listed above under ‘Type’, entries which are listed without categorization of ‘Type’ as (blank). I am sure that those entries contribute to the overarching message and understanding of the dataset, but without them being labelled into a category, and with there being 874 entries of blank ‘Type’ cylinders alone, it will be impossible for us to listen to all these entries and categorize them ourselves. I believe we will also run into a similar problem when we look at other facets of these categories, in seeing blank descriptions of the cylinder and only being able to listen to, and hand categorize, a finite amount. Is there a protocol within OpenRefine or dataset management in general that dictates what to do with blank data entries? Do they form a category on their own and the narrative of them being left blank says something about the cylinder in itself? Or should the entries left blank within the categories we choose to analyze be ignored entirely?
Hi, I noticed that you mention the dataset in your group is very large, my group also has a very large dataset so I think it was good to see you point out some of the problems that you have encountered or may encounter in the future. I got a better idea of how I can assist my large dataset project with the tools reading this blog.