Blog #4: OpenRefine

My group’s dataset is about classicists, their basic background information (such as date of birth and birth places), and information about their studies (such as main affiliated university and highest level of education obtained).

One stark problem that OpenRefine made visible was that the dataset most likely did not have allowed vocabulary during the data collection process. As a result, many of the same items or facets are expressed in different ways. For instance, “Boston University,” is also expressed as “Boston U” or “Boston U.” (with the period). Proper facets for the main affiliated institutions would be useful in answering our research questions because we would then be able to measure accurately where the classicists were concentrated. This would inform us on any geographical segmentation of classicists. To solve this issue, OpenRefine’s categorization by facet tool can be useful. Within that tool, I can combine these similar wordings that mean the same thing, into one. However, this tool alone will take an excrutiatingly long time. Alternatively, I can also use the cluster and trim whitespace functions to expedite the process. What I’m struggling with is how to make sure I don’t miss out on the numerous variations of how to express “Boston University,” even after the mentioned OpenRefine tools have been employed. In other words, I would be curious to know whether there is a way of verifying that the facets are all independent, without having to scroll through and eyeballing the list of facets themselves.

Leave a Reply