Humanities Data: A Necessary Contradiction

This is a talk that I gave at the Harvard Purdue Data Management Symposium on June 17, 2015, in Cambridge, Massachusetts. The audience was mostly librarians and other data-management professionals. I was the only humanities person on the program, so I wanted to talk about the ways that humanists think about data differently from people in some other fields.

Two mosaics beside each other. The one on the left is made up of largely cool, blue images; the one on the right is composed of warmer, earthier tones.
Sometimes I start class discussions by comparing image quilts of Google searches for “digital” (left) and “humanities” (right).

Today I’d like to talk about the ways in which humanists think about data, and how that’s distinct from the ways in which scientists and social scientists think about it.

Even though I think our issues can be pretty different, I want to make the case that there are some very promising ways in which libraries could make meaningful interventions in the humanities research lifecycle, both for what we might call traditional humanists and for digital humanists. So I’ll start with what “traditional” humanists might need help with and then move on to the needs of what we call “digital humanists” (although I think in practice the distinction is a bit blurred).

I just want to say at the outset that there are people who specialize in humanities data curation, and I am not one of those people. A number of talented people, including Trevor Muñoz at the University of Maryland and Katie Rawson at the University of Pennsylvania, have started to take a very programmatic look at the data-curation needs of digital humanists. And I encourage you to check out their important work. But you don’t have Trevor or Katie; you have me! So what I can do is share my own perspective and experience on what it means to work with data as a humanist, and where libraries can help.

I’ll start with an anecdote, and I think that anyone who consults on digital humanities projects will be familiar with this scenario. Humanities scholars will sometimes describe elaborate visualizations to me, involving charts and graphs and change over time. “Great,” I respond. “Let’s see your data.” “Data?” they say. “Oh, I don’t have any data.”

This is not because we’re stupid or naïve; it’s that humanists have a very different way of engaging with evidence than most scientists or even social scientists. And we have different ways of knowing things than people in other fields. We can know something to be true without being able to point to a dataset, as it’s traditionally understood. We can know, to take just one example, that early silent film relied on the conventions of melodrama to create legible narratives, not because we have a spreadsheet somewhere, but because we’ve immersed ourselves so deeply in our source material that we’re attuned to its nuances.

That’s why humanists sometimes think you can make a visualization without data; because they want to illustrate ideas and movement, not necessarily data points as we’ve been discussing them here.

Screenshot of LARB article called "Literature is Not Data"
Los Angeles Review of Books, October 28, 2012

In fact, very few traditional humanists would call their source material “data.” You may have seen this piece in the LA Review of Books in October 2012. While the language is pretty hyperbolic, I do think it helps to convey how uncongenial many humanists feel the notion of data is to the work that they actually do.

When you call something data, you imply that it exists in discrete, fungible units; that it is computationally tractable; that its meaningful qualities can be enumerated in a finite list; that someone else performing the same operations on the same data will come up with the same results. This is not how humanists think of the material they work with.

This is not a perfect analogy, but imagine that someone called your family photograph album a dataset. It’s not inaccurate per se, but it suggests that this person just fundamentally doesn’t understand why you value this artifact. And it’s the same with humanists. With a source, like a film or a work of literature, you’re not extracting features in order to analyze them; you’re trying to dive into it, like a pool, and understand it from within.

Let’s take my silent film example again. It would be possible to enumerate all of the filmic conventions that recall the conventions of melodrama. Is there a villain? Is there a heroine? Are good and evil depicted in stark, black-and-white terms? You could even build a dataset like this and use it to show how film changed over time.

A grid listing silent films, their dates, and melodramatic conventions.
My silent film dataset.

But, seriously, who cares? There’s just such a drastic difference between the richness of the actual film and the data we’re able to capture about it.

A dataset like this is so much less interesting than the trained judgment of someone who’s seen many of these films and can turn a nuanced observation of these changes into a real argument.

(Of course, the video itself constitutes data, but I’ll get to that in a second.)

And I would argue that the notion of reproducible research in the humanities just doesn’t have much currency, the way it does in the sciences, because humanists tend to believe that the scholar’s own subject position is inextricably linked to the scholarship she produces.

However. Things are changing, in ways both obvious and not. All of our stuff is on our computers now — all of it, from books to movies to archival documents. This is why, more than anything else, I think digital humanities is here to stay. If you can analyze something computationally, I think it’s going to be really hard to tell people that they shouldn’t.

This state of affairs has created some real problems for humanists, and, I would say, some real opportunities for libraries. If you speak to any historian who works in an archive, I guarantee you that they have hundreds, maybe even thousands of photos shot in an archive that look like this:

An undifferentiated, baffling list of photos labeled "2012-12-24 19:19:04.jpg" and the like.
Behold, the historical dataset.


This is it! This is how historians are organizing hundreds of archival photographs!! The best-organized among them are trying to manage these irreplaceable source documents in iPhoto. And incidentally, this is a big problem for me, as someone who works on the history of lobotomy.

It’s not just historians who have a problem. Literature scholars, film scholars, everyone’s dealing with lots of journal articles, video clips, and other sources, and are really struggling to organize them so that they can produce scholarship.

So humanists — even those who aren’t digital humanists — desperately need some help managing their stuff, and libraries are in a great position to help them. I do feel that this is an underexplored opportunity space for libraries.

It’s just that if you advertise that help as “data management,” they’ll have no idea you’re trying to talk to them.

I used to offer a workshop on “managing research assets,” and even that felt way too clinical to describe humanists’ sources. But if you get a chance to look at the blog post that contains all the suggestions I used in the workshop, you’ll see that we’re cobbling together dozens of tools, none of which really do what we want them to do.

So all of that is to say that even if they don’t call their sources data, traditional humanists do have pretty pressing data-management needs. But the need becomes even greater when you’re talking about people who consider themselves digital humanists — that is, people who use digital tools to explore humanities questions.

In many ways, digital humanists will have similar data-management needs to scientists and social scientists — they’ll have spreadsheets, images, and video, and will probably at least know what metadata is. In addition, the NEH Office of Digital Humanities, like the NSF and other funding agencies, now requires a data-management plan; so you will very soon encounter, I’m sure, a humanist approaching you at the 11th hour with a request that you write their data-management plan for them.

Just to give you a sense of the kinds of things humanists might do with structured data, I’ll show you a project that my students just completed, as part of a collaboration with the Getty Research Institute. The GRI maintains what seems to me really big data — about 1.5 million records relating to the transmission and sale of works of art, which they call the provenance index. It’s a really complex and baffling database — a really great case study in humanities data, actually — because all of its records are derived from historical documents themselves, and so are eccentric, disparate, and historically and geographically uneven. But because it’s so big, you can do interesting, unexpected things with it.

For example, one of my students got really interested not in the paintings but the frames themselves, which are fairly understudied within art history. He gathered sales data for paintings sold between 1689 and 1787, and through a combination of text analysis and secondary reading, determined that two major factors made a frame valuable during this period: its beauty or its authenticity. With that information, he was able to show that there was indeed a market for frames described as “authentic” during this 100-year period.

So it’s quantitative evidence that seems to show something, but it’s the scholar’s knowledge of the surrounding debates and historiography that give this data any meaning. It requires a lot of interpretive work.

These are two relatively simple examples, but I think they do show a little bit about how digital humanists are tending to work with data.

First, we often find ourselves in conflict with publishers because of the kind of work we want to do. I mentioned that my IP address had gotten blocked, but this is mild compared to what has happened to other scholars; I definitely know people who’ve been threatened with lawsuits and the like for excessive downloading.

Second, we’re not generally creating data through experimentation or observation — more often than not, we’re mining data from historical documents. You name it, we’ve tried to mine it, from whaling logs to menus to telephone directories. This means that we tend to want different tools than scientists, and also that we have some interesting data-wrangling problems. More often than not, the categories that our historical sources used to divide up our data are not the same ones we’re interested in analyzing. So we often have to do some very creative transformations and interpretation, as my student did with the frames data.

Third, it’s just awful trying to find a humanities dataset. There are various humanities data repositories or registries, but they’re terribly limited. And right now we’re starting to see museums and cultural institutions releasing their data, and there’s just no way to know who’s released what, unless you’re the kind of person who stays on top of these things. So we urgently need some help locating these datasets, aggregating them, and perhaps even linking them.

Fourth, we do need those web services Sayeed was describing yesterday, that are built on top of existing datasets. We are working a lot with APIs, and it’s really insufficient for us to download one record at a time. And even for people who aren’t going to work with APIs, if you could build visualizations of datasets on the fly, or even just access the data in quantity, it would be a big help.

Fifth, we have a desperate need for help with data-modeling — and here is another place where I think libraries could really play a big role. This July, I’ll be directing a summer institute on digital humanities and art history, and as I’ve been reading through the participants’ project ideas, I’ve been struck by how often it seems that what the scholar really needs is data-modeling advice. For example: the art historian who wants to show how and when art objects traveled across the Indian Ocean and relate that movement to corresponding changes in artistic practice.

What she really needs is a data model that can accommodate historical and artistic periods, geographic movement, and conventional time. Most humanities scholars are not trained to build these kinds of databases. But I think — I hope — that this is an area where the library could be a huge help.

Finally, we may even need new kinds of data specifications, because the currently existing standards for describing time and space, for example, are actually really inadequate for our needs. To give one example, many standards for specifying dates require time calculated down to the exact day, and sometimes even the minute or second. But humanists tend to deal in words like circa, spans of time, or things like “before” or “after” this event. Two technologists at Stanford actually took on this problem with a project called Topotime. By specifying that certain characters represent things like uncertainty, contingency, or approximation, they’ve shown how we could move from depicting time as a point or a line to a much broader canvas of shapes.

Just as we need more nuanced data models for time, we find ourselves faced with a pretty limited palette of options for depicting important structures of power, like gender and race. Take the Union List of Artist Names, which is an incredibly important resource that places like museums use to establish authorities — that is, to make sure they’re all using the same name to refer to an artist, and to associate that name with other data about the artist. It’s a tremendously important resource, and without it museums couldn’t share and network information; we’d never be able to figure out who holds what. But look how it deals with gender!

Now, the fact that it captures gender is crucial — otherwise we wouldn’t be able to say that women are underrepresented in a museum’s collection — but no self-respecting humanities scholar would ever get away with such a crude representation of gender in traditional work.

We find ourselves needing models for gender that can accommodate much more nuance than our current standards. For us, the proper mode of visualizing data may not be a pie chart; it may be a heat map.

So I don’t know about you, but I actually find these problems to be quite interesting and challenging: taking the datasets we’ve been given — which were not at all created for our purposes — and working against their grain or reinventing them to try and tease out the things we think are really interesting.

It requires some real soul-searching about what we think data actually is and its relationship to reality itself; where is it completely inadequate, and what about the world can be broken into pieces and turned into structured data? I think that’s why digital humanities is so challenging and fun, because you’re always holding in your head this tension between the power of computation and the inadequacy of data to truly represent reality.

23 Replies to “Humanities Data: A Necessary Contradiction”

  1. Love this! Your analogy “imagine that someone called your family photograph album a dataset” is spot-on. I think it does a great job illustrating why some humanities scholars have a tough time with the notion of “data.” I would love to cite this in my own talks!


  2. The more I read about those kind of things, the more anxious I am. All that “datasets” approach, data mining in context of people’s behaviour (big data?)… Too many times someone is trying to put the one’s way of acting and thinking into pure math, pure numbers, someone’s IQ… This is kind of dehumanisation to me, which goes somewhere – yet hard to say where – where the person is no longer a human, he’s rather an ID.

  3. Loved this, as it captures so much of the complex interface between data, interpretation, and reality. But I don’t find your silent film example convincing; I think you do damage to the concept of a generalization here.

    Referring to the spreadsheet of silent film characteristics, you note “a dataset like this is so much less interesting than the trained judgment of someone who’s seen many of these films and can turn a nuanced observation of these changes into a real argument.” No doubt this spreadsheet fails to capture most of what is interesting about melodrama! But if you’re going to argue for “these changes”, meaning less melodramatic films as time goes on, you are making a fundamentally quantitative claim. Unless I have badly misinterpreted the claim, you are making a statement about all silent films and saying that the number which are melodramatic decreases over time.

    My sense is that the right way to handle this is to quantify “melodramatic” directly. That is, you as the culturally-aware human viewer can watch each film and rate it as melodramatic or not, or perhaps use a numerical scale. This generates data for the concept of “melodrama” directly, and requires the full power of human subjectivity. I’m not sure what would be the point of reducing melodrama to component parts, as your example data does; that’s a different analytic exercise entirely. (If you are worried that different people would code films as melodramatic or not differently, then that is itself a fascinating research question about the variation in human perception and understanding of “melodrama”!)

    The problem with the silent film data you present is not that they are data, but that they are the wrong data to support the claim. I read the claim that “early silent film relied on the conventions of melodrama” to imply that “later silent film did not” and to me the only permissible evidence for this is quantitative: a decrease in the proportion of melodramas over time. But nobody said those quantities had to be “objective.” Humans can and do quantify all the time, and this is a common and valid way to create data.

  4. I actually love your silent-film example, Miriam; I think it nicely demonstrates why humanists haven’t tended to produce a lot of structured datasets. (We’re likely to distrust our ability to define an appropriate data model — and I would say we’re often right to feel that distrust.)

    But there’s also a lot of diversity in “the humanities.” Looking at your silent-film spreadsheet I’m reminded of stuff Janice Radway does in Reading the Romance, which is really quite close to this example. E.g., there’s a checklist of qualities that distinguish positive and negative male figures in romances, and she ticks them off for each title on her list. Of course, Radway ended up in a Communications dept, and those folks are halfway to social science already.

  5. “Of course, Radway ended up in a Communications dept, and those folks are halfway to social science already.”

    You make that sound like a bad thing! So what’s the difference, in this case, between humanities and the social sciences? How would humanities vs. social science treat a claim like “early silent films were melodramatic and later silent films less so” differently? Does such a statement in the humanities make any claim to truth, correspondence, empiricism, etc? I get that subjectivity is key in the humanities, and that continuing the conversation is a worthwhile goal, but are claims in the humanities ever subject to the critique of “that’s not true, or at least, you haven’t provided the right kind of evidence that it is”? And if so, should we expect different standards of evidence between the humanities and the social sciences?

Leave a Reply

Your email address will not be published. Required fields are marked *