Data Packages for DH Beginners

The quarter is off and running again at lightning speed. At UCLA, we’re on the quarter system, and things move fast — just 10 weeks to get through all your material. I’m teaching DH101 again this year, and, as usual, it’s a race against the clock. The profile of my students changes a bit every year, but the typical student who enters my DH101 classroom has facility with Word, PowerPoint, maybe Excel, maybe some of the Adobe suite, but not a ton of other computer stuff. By the end of the quarter, my goal is to get them working with and thinking critically about structured data, data cleaning, data visualization, mapping, and web design.

I’ve written about this before: working in groups, my students are assigned a dataset at the beginning of the quarter. They learn how to work with it as the quarter progresses, doing a lot of secondary contextual research, interviewing an expert about it, manipulating the data, and finally building a website that makes a scholarly humanistic argument with the support of the data. You can see the mechanics of this on my course website.

People often ask me about the data I use, and indeed, that is a story in itself. I have 88 students this year, and since I don’t like any group to have more than seven people in it, I have 12 groups, each of which needs a dataset. (Really, some of them can share the same dataset; I don’t know why I get weird about this.) And they can’t just use any dataset. In fact, most of the data out there is inappropriate for them.

Here is what I look for in a dataset for my students:

  1. It has to be a CSV (or able to be wrangled into a CSV). My beginners want to be able to double-click on their dataset and see…something that they can work with. CSVs are great because they open in Excel, which is familiar to most students and allows them to immediately start doing things like filtering and simple manipulation. Plus, you can drop a CSV into almost any visualization tool. I can use a relational database, but I usually just give the students the spreadsheet that results from a query, since I just don’t have time in the quarter to teach them about more complicated data structures. Likewise, if a dataset is XML, I’ll just flatten it. But I prefer not to have to deal with this because, like I said, 12 datasets.
  2. Around 2,000 records is ideal. Here’s why: I want the dataset to be big enough that it’s too labor-intensive for the students to manipulate it by hand, but not so big that it breaks Excel. Really, I can work with bigger sets, too, but students do tend to get very anxious about working with datasets that big. Any number of fields is fine (actually more is better) because students understand fairly quickly that they can choose which fields to examine.
  3. It has to be…humanities-ish. You and I probably know that one could make a humanities argument about municipal water data, or public health information, but it takes a little bit of sophistication to get there. The most “natural” kind of analysis for these kinds of datasets would be urban planning or public health kinds of questions, and it’s too difficult for me to push students toward the kind of open-ended humanities questions I want them to pursue. It’s far easier if the data is about art, books, movies — subjects that are the traditional province of the humanities.
  4. It’s nice if it’s something they care about. I have confidence that my students will eventually become interested in any subject, once they really dig into it, but I can forestall a lot of grumbling if I can give them a dataset that’s immediately compelling to them. Things they like: fashion, food, performance, books from their youth, cartoons, comic books, TV, movies.

You can see this year’s datasets at the bottom of this page. I do not just give my students their datasets in raw form. I cut the sets down to an appropriate number of records, if necessary, and then I give them the dataset along with a “project brief,” which contains:

  1. Information about the provenance and composition of the data.
  2. The name and contact info of an expert on that subject who has agreed to allow my students to interview them.
  3. The names and contact info of librarians who can help them.
  4. The name and contact info of UCLA’s mapping specialist.
  5. Two or three secondary sources to get them going on their research. I also teach them how to citation-chain.

Here is an example of a “data package,” with the contact info removed.

If you’re thinking this is kind of an absurd amount of work for the instructor, you’re right. I really feel the students need this apparatus around their dataset, but I end up spending a good chunk of my summer hunting down data, persuading friends (and strangers) to serve as subject experts, and researching secondary sources.

Even with all of this scaffolding, students get very anxious about the project assignment, just because it’s so new to them. I’ve learned to expect it, to warn them that they’ll feel anxious about it, and to reassure them that if they’re hitting project milestones, they’ll get to the finish line on time, even if they feel at sea.

Sorry for the dashed-off blog post; I’ve been meaning to write about this for some time and finally had a few (just a few!) minutes!