W3 – Data-mining, Classification, and Research

How a Math Genius Hacked OkCupid to Find True Love

This week’s readings reminded me of an interesting article about a Chris McKinlay, a UCLA grad student who “hacked OK Cupid to find the girl of his dreams.” Some friends shared it on Facebook months ago; apparently he was a TA in one of their lower-div math classes. It was interesting to read about his process and the visualizations included in the article were striking as well.

Chris McKinlay used Python scripts to riffle through hundreds of OkCupid survey questions. He then sorted female daters into seven clusters, like “Diverse” and “Mindful,” each with distinct characteristics.

His mathematical approach to online dating reminds me of how Alexis C. Madrigal reverse engineered Netflix’s vocabulary and grammar in “How Netflix Reverse Engineered Hollywood.” Both McKinlay and Madrigal started their projects with data-mining scripts. Once they had a sizable data set, they looked for patterns and then ran tests to (dis)prove these hypotheses. However, before they could do this, they needed a classification system. In McKinlay’s case, this meant “seven statistically distinct clusters based on…[women’s] questions and answers.” Once grouped into seven clusters such as “God,” “Tattoo,” and “Samantha” (nomenclature was nonstandard) with distinct characteristics, McKinlay could target women from a specific cluster with a profile tailored to their interests. For Madrigal, classification meant organizing genre descriptors into categories such as “Region,” “About…,” and “Based on….” A Netflix genre was a subset of these components that followed specific grammar rules.

McKinlay and Madrigal’s situation was unique because they were both hacking an established data set. Their data was pre-tagged, which made the process of classifying and pattern hunting much easier. In Madrigal’s case, Netflix’s movie taggers broke down movie content into “quanta” or “microtags” that could be fed into computer algorithms. The 76,897 altgenres scraped by Madrigal’s script were the product of these algorithms. In this way, Madrigal was working with “metametadata” or data about data about data. In contrast, the authors of Plateau Peoples’ Web Portal had to build their dataset from the ground up. According to the “About” page of the site, they were faced with the daunting task of curating a diverse collection of Native people’s cultural materials with varying metadata. A classification standard was set that would allow for both consistency and flexibility throughout the collection: “There are nine main categories (users can use the browse section of the portal to view these) within the portal. Each tribe can then add their own subcategories refining the typology further to allow for greater precision and flexibility in searching.”

Although McKinlay and Madrigal’s classification process may not have been as extensive as the the authors of Plateau Peoples’ Web Portal, their approach to metametadata was fascinating. I enjoyed reading about reverse-engineering large, cryptic datasets and using them in new ways.