The quarter is off and running again at lightning speed. At UCLA, we’re on the quarter system, and things move fast — just 10 weeks to get through all your material. I’m teaching DH101 again this year, and, as usual, it’s a race against the clock. The profile of my students changes a bit every year, but the typical student who enters my DH101 classroom has facility with Word, PowerPoint, maybe Excel, maybe some of the Adobe suite, but not a ton of other computer stuff. By the end of the quarter, my goal is to get them working with and thinking critically about structured data, data cleaning, data visualization, mapping, and web design.
I’ve written about this before: working in groups, my students are assigned a dataset at the beginning of the quarter. They learn how to work with it as the quarter progresses, doing a lot of secondary contextual research, interviewing an expert about it, manipulating the data, and finally building a website that makes a scholarly humanistic argument with the support of the data. You can see the mechanics of this on my course website.
People often ask me about the data I use, and indeed, that is a story in itself. I have 88 students this year, and since I don’t like any group to have more than seven people in it, I have 12 groups, each of which needs a dataset. (Really, some of them can share the same dataset; I don’t know why I get weird about this.) And they can’t just use any dataset. In fact, most of the data out there is inappropriate for them.
Here is what I look for in a dataset for my students:
- It has to be a CSV (or able to be wrangled into a CSV). My beginners want to be able to double-click on their dataset and see…something that they can work with. CSVs are great because they open in Excel, which is familiar to most students and allows them to immediately start doing things like filtering and simple manipulation. Plus, you can drop a CSV into almost any visualization tool. I can use a relational database, but I usually just give the students the spreadsheet that results from a query, since I just don’t have time in the quarter to teach them about more complicated data structures. Likewise, if a dataset is XML, I’ll just flatten it. But I prefer not to have to deal with this because, like I said, 12 datasets.
- Around 2,000 records is ideal. Here’s why: I want the dataset to be big enough that it’s too labor-intensive for the students to manipulate it by hand, but not so big that it breaks Excel. Really, I can work with bigger sets, too, but students do tend to get very anxious about working with datasets that big. Any number of fields is fine (actually more is better) because students understand fairly quickly that they can choose which fields to examine.
- It has to be…humanities-ish. You and I probably know that one could make a humanities argument about municipal water data, or public health information, but it takes a little bit of sophistication to get there. The most “natural” kind of analysis for these kinds of datasets would be urban planning or public health kinds of questions, and it’s too difficult for me to push students toward the kind of open-ended humanities questions I want them to pursue. It’s far easier if the data is about art, books, movies — subjects that are the traditional province of the humanities.
- It’s nice if it’s something they care about. I have confidence that my students will eventually become interested in any subject, once they really dig into it, but I can forestall a lot of grumbling if I can give them a dataset that’s immediately compelling to them. Things they like: fashion, food, performance, books from their youth, cartoons, comic books, TV, movies.
You can see this year’s datasets at the bottom of this page. I do not just give my students their datasets in raw form. I cut the sets down to an appropriate number of records, if necessary, and then I give them the dataset along with a “project brief,” which contains:
- Information about the provenance and composition of the data.
- The name and contact info of an expert on that subject who has agreed to allow my students to interview them.
- The names and contact info of librarians who can help them.
- The name and contact info of UCLA’s mapping specialist.
- Two or three secondary sources to get them going on their research. I also teach them how to citation-chain.
Here is an example of a “data package,” with the contact info removed.
If you’re thinking this is kind of an absurd amount of work for the instructor, you’re right. I really feel the students need this apparatus around their dataset, but I end up spending a good chunk of my summer hunting down data, persuading friends (and strangers) to serve as subject experts, and researching secondary sources.
Even with all of this scaffolding, students get very anxious about the project assignment, just because it’s so new to them. I’ve learned to expect it, to warn them that they’ll feel anxious about it, and to reassure them that if they’re hitting project milestones, they’ll get to the finish line on time, even if they feel at sea.
Sorry for the dashed-off blog post; I’ve been meaning to write about this for some time and finally had a few (just a few!) minutes!
5 Replies to “Data Packages for DH Beginners”
Could local GLAM sectors provide you with dummy datasets? Or the students could go there to create datasets?
Or are there students at other institutes who could work in remote tandem with your students?
They provide data to each other’s group and try to create visualisations for them? *Could be self-driven or external requirements, preferably with the donor group’s data being about their own locality/group perspective.
I’m humanist turned software developer turned back humanist again now that I’m no longer working for money. I saw a link to your post on the Digital Sinology group on FB. It seems very sensible to me except for one thing that may be an issue for some of your students.
I work a lot with East Asian data in Unicode format and have found that Excel cannot load UTF-8 text files with or without BOM. This goes for both CSV files and tab-delimited 8-bit .txt files. Also, saving an in memory Excel table, however loaded, that contains non-ASCII Unicode characters as a CSV will replace most of the characters with ‘?’ (I think the ones that cannot be converted to the Latin-1 character set). The one Unicode format that will load correctly without a manual import, and also save without loss of character information, is UTF-16 (either Big Endian or Little Endian) with BOM.
I have to say that I’m not a fan of the CSV format in general. To do anything with the files beyond very simple editing you need to either use a tool that knows how to load them (including dealing with UTF-8) or convert them to a simpler (generally escape-less) format like tab-delimited UTF-8 or UTF-16. I use excel a lot, mostly for the rudimental but very useful form of visualization (if you’re willing to call it that) via column filtering, but also for editing most easily done in a table context.
While the issue I describe is ubiquitous with East Asian character sets it comes up in any dataset that uses non-ASCII characters and uses a Unicode encoding. This includes datasets in most non-English European languages (and even in English with foreign words) and in most datasets in Slavic languages, whether in Latin or Cyrillic alphabets.
I’m not sure if you’ve encountered the CSV-Unicode problem, or, if you have, how you solved it. I can’t make a definite recommendation without knowing more about your toolkit, but a possibility with minimum impact on your “installed base” is to support other Excel-friendly file formats beside CSV, particularly UTF-16 table-delimited .txt files (called Unicode Text in the Excel Save As menu).