Written with Andy Wallace, with methods and ideas borrowed from Zoe Borovsky
As Zoe Borovsky brilliantly demonstrated when she visited my DH grad class, topic modeling starts with the assumption that each document is made up of multiple topics — like lumps of Play-Doh. Photo: “Play-Doh” by dbrekke.
If you’re reading this, you may know that topic modeling is a method for finding and tracing clusters of words (called “topics” in shorthand) in large bodies of texts. Topic modeling has achieved some popularity with digital humanities scholars, partly because it offers some meaningful improvements to simple word-frequency counts, and partly because of the arrival of some relatively easy-to-use tools for topic modeling.
MALLET, a package of Java code, is one of those tools. It’s not hard to run, but you do need to use the command line. For those who aren’t quite ready for that, there’s the Topic Modeling Tool, which implements MALLET in a graphical user interface (GUI), meaning you can plug files in and receive output without entering a line of code.
David Newman and Arun Balagopalan, who developed the TMT, have done us all a great service. But they may also have created a monster. The barrier for running the TMT is so low that it’s entirely possible to run a topic modeling test and produce results without having much idea what you’re doing or what the results mean.
So is it still worth doing? I think so. Playing with the results by altering variables and rerunning the test can be a useful way to get your head around what topic modeling is and isn’t. And, as I recently tried to convince my graduate DH class, screwing around with texts — even if you’re not totally sure what you’re doing — can be a surprisingly effective way of getting a new perspective on a body of work. Finally, seeing how many decisions need to be made about texts and variables is a great way to understand that topic modeling is not a way of revealing any objective “truth” about a text; instead, it’s a way of deriving a certain kind of meaning — which still needs to be interpreted and interrogated.
But in order to get any of these benefits from the Topic Modeling Tool, you need to be able to make some sense of your results, which is no easy task. The TMT generates some decidedly cryptic-looking files, and as far as I can tell, there aren’t many resources out there to help you make sense of them.
Once you survey the results of the Topic Modeling Tool, it becomes clear why topic modeling often goes hand-in-hand with visualization. The format of the results makes it difficult for a human being to discern patterns in them, and the files aren’t easy to visualize without doing some custom coding.
But say you’re a non-coder using the Topic Modeling Tool to screw around. You feed it some text, you get some files; now what?
What follows are some very basic ways you might begin looking at the results you’ve generated.
This content is published under the Attribution-Noncommercial-Share Alike 3.0 Unported license.