{"id":1335,"date":"2012-10-29T11:48:23","date_gmt":"2012-10-29T18:48:23","guid":{"rendered":"http:\/\/miriamposner.com\/blog\/?p=1335"},"modified":"2014-03-10T13:48:27","modified_gmt":"2014-03-10T20:48:27","slug":"very-basic-strategies-for-interpreting-results-from-the-topic-modeling-tool","status":"publish","type":"post","link":"https:\/\/miriamposner.com\/blog\/very-basic-strategies-for-interpreting-results-from-the-topic-modeling-tool\/","title":{"rendered":"Very basic strategies for interpreting results from the Topic Modeling Tool"},"content":{"rendered":"<p><em>Written with <a href=\"http:\/\/andrewbenedictwallace.com\/\">Andy Wallace<\/a><\/em><em>, with methods and ideas borrowed from <a href=\"http:\/\/ucla.academia.edu\/ZoeBorovsky\">Zoe Borovsky<\/a><\/em><\/p>\n<figure id=\"attachment_1339\" aria-describedby=\"caption-attachment-1339\" style=\"width: 300px\" class=\"wp-caption alignleft\"><a href=\"https:\/\/miriamposner.com\/blog\/wp-content\/uploads\/2012\/10\/play-doh.jpeg\"><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-1339\" title=\"play-doh\" alt=\"Many plastic tubs of Play-Doh, each a different color.\" src=\"https:\/\/miriamposner.com\/blog\/wp-content\/uploads\/2012\/10\/play-doh-300x199.jpeg\" width=\"300\" height=\"199\" srcset=\"https:\/\/miriamposner.com\/blog\/wp-content\/uploads\/2012\/10\/play-doh-300x199.jpeg 300w, https:\/\/miriamposner.com\/blog\/wp-content\/uploads\/2012\/10\/play-doh.jpeg 500w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><figcaption id=\"caption-attachment-1339\" class=\"wp-caption-text\">As Zoe Borovsky brilliantly demonstrated when she visited my DH grad class, topic modeling starts with the assumption that each document is made up of multiple topics \u2014 like lumps of Play-Doh. Photo: &#8220;Play-Doh&#8221; by dbrekke.<\/figcaption><\/figure>\n<p>If you&#8217;re reading this, you may know that topic modeling is a method for finding and tracing clusters of words (called &#8220;topics&#8221; in shorthand) in large bodies of texts. Topic modeling has achieved some popularity with digital humanities scholars, partly because it offers some meaningful improvements to simple word-frequency counts, and partly because of the arrival of some relatively easy-to-use tools for topic modeling.<\/p>\n<p><a href=\"http:\/\/mallet.cs.umass.edu\/topics.php\">MALLET<\/a>, a package of Java code, is one of those tools. It&#8217;s <a href=\"http:\/\/programminghistorian.org\/lessons\/topic-modeling-and-mallet\">not hard to run<\/a>, but you do need to use the command line. For those who aren&#8217;t quite ready for that, there&#8217;s the <a href=\"http:\/\/code.google.com\/p\/topic-modeling-tool\/\">Topic Modeling Tool<\/a>, which implements MALLET in a graphical user interface (GUI), meaning you can plug files in and receive output without entering a line of code.<\/p>\n<p><a href=\"http:\/\/www.ics.uci.edu\/~newman\/\">David Newman<\/a> and <a href=\"http:\/\/www.linkedin.com\/pub\/arun-balagopalan\/50\/34b\/672\">Arun Balagopalan<\/a>, who developed the TMT, have done us all a great service. But they may also have created a monster. The barrier for running the TMT is so low that it&#8217;s entirely possible to run a topic modeling test and produce results without having much idea what you&#8217;re doing or what the results mean.<\/p>\n<p>So is it still worth doing? I think so. Playing with the results by altering variables and rerunning the test can be a useful way to get your head around what topic modeling is and isn&#8217;t. And, as I recently tried to convince my graduate DH class, <a href=\"http:\/\/www.playingwithhistory.com\/wp-content\/uploads\/2010\/04\/hermeneutics.pdf\">screwing around with texts<\/a> \u2014 even if you&#8217;re not totally sure what you&#8217;re doing \u2014 can be a surprisingly effective way of getting a new perspective on a body of work. Finally, seeing how many decisions need to be made about \u00a0texts and variables is a great way to understand that topic modeling is not a way of revealing any objective &#8220;truth&#8221; about a text; instead, it&#8217;s a way of deriving a certain kind of meaning \u2014 which still needs to be interpreted and interrogated.<\/p>\n<p>But in order to get any of these benefits from the Topic Modeling Tool, you need to be able to make some sense of your results, which is no easy task. The TMT generates some decidedly cryptic-looking files, and as far as I can tell, there aren&#8217;t many resources out there to help you make sense of them.<\/p>\n<p>Once you survey the results of the Topic Modeling Tool, it becomes clear why topic modeling often goes hand-in-hand with visualization. The format of the results makes it difficult for a human being to discern patterns in them, and the files aren&#8217;t easy to visualize without doing some custom coding.<\/p>\n<p><strong>But say you&#8217;re a non-coder using the Topic Modeling Tool to screw around. You feed it some text, you get some files; now what? <\/strong><\/p>\n<p><strong><\/strong>What follows are some very basic ways you might begin looking at the results you&#8217;ve generated.<\/p>\n<p><!--more-->For the purposes of demonstration, I&#8217;ve used a set of 3,584 emails that constitute the years 2008\u20132012 of the <a href=\"http:\/\/dhhumanist.org\/\">Humanist listserv<\/a>. We originally downloaded the emails <a href=\"http:\/\/dhhumanist.org\/Archives\/Current\/\">here<\/a> and then divided each volume into individual emails. You can find the dataset I used <a href=\"https:\/\/www.dropbox.com\/s\/rcdlh4fusllv6zq\/Humanist%20emails%202008-20012.zip\" target=\"_blank\">here,<\/a> and the files that comprise my TMT results <a href=\"https:\/\/www.dropbox.com\/s\/jawu1nhve667ime\/humanist%20new%2050%20topics.zip\">here<\/a>.<\/p>\n<p>For further reading (and viewing) on topic modeling, I&#8217;ve listed my favorite resources <a href=\"http:\/\/dh201.humanities.ucla.edu\/?page_id=183\">here<\/a>. For more on the Topic Modeling Tool in particular, I recommend <a href=\"http:\/\/clc.yale.edu\/2011\/10\/07\/how-to-do-your-own-topic-modeling\/\">this summary and video<\/a> of a talk by David Newman, along with the accompanying <a href=\"http:\/\/odai.yale.edu\/node\/362\/attachment\">slides<\/a> (PDF).<\/p>\n<div class=\"LessonStep top\">\n<h3 class=\"StepTitle\">What are these files?<\/h3>\n<div class=\"StepImage\"><img loading=\"lazy\" decoding=\"async\" alt=\"media_1351529128083.png\" src=\"https:\/\/miriamposner.com\/blog\/wp-content\/uploads\/2012\/10\/media_1351529128083.png\" width=\"540\" height=\"80\" \/><\/div>\n<div class=\"StepInstructions\">\n<p>When you feed a body of text to the TMT, you get two folders: <strong>output_csv<\/strong> and <strong>output_html<\/strong>.<\/p>\n<p><strong>CSV<\/strong> stands for <strong>comma-separated values<\/strong>. The documents inside your <strong>output_csv<\/strong> folder are spreadsheet documents that usually open in something like Excel. <strong>HTML<\/strong> documents will open in a web browser.<\/p>\n<p>You have three CSV spreadsheets:<\/p>\n<ul>\n<li><strong>DocsinTopics.csv, <\/strong>which provides you a list of topics and shows you which documents they&#8217;re likely to appear in;<\/li>\n<li><strong>Topics_Words.csv, <\/strong>which offers you a numbered list of &#8220;topics&#8221;; and<\/li>\n<li><strong>TopicsinDocs.csv, <\/strong>which provides a list of documents, along with the topics that appear most prominently in each.<\/li>\n<\/ul>\n<p>You have one main HTML document, called <strong>all_topics.html<\/strong>. As we&#8217;ll see, this offers a numbered list of topics, along with a way of drilling down into each topic&#8217;s associated documents. Also inside your <strong>output_html<\/strong> folder are a folder called <strong>Docs<\/strong>, which contains an HTML page for each document; a folder called <strong>Topics<\/strong>, which provides an HTML page for each topic you&#8217;ve generated; and a document called <strong>malletgui.css,<\/strong> which provides your web browser with some instructions for displaying each HTML page.<\/p>\n<\/div>\n<\/div>\n<div class=\"LessonStep top\">\n<h3 class=\"StepTitle\">Start with TopicsinDocs<\/h3>\n<div class=\"StepImage\"><img loading=\"lazy\" decoding=\"async\" alt=\"media_1351530126749.png\" src=\"https:\/\/miriamposner.com\/blog\/wp-content\/uploads\/2012\/10\/media_1351530126749.png\" width=\"540\" height=\"93\" \/><\/div>\n<div class=\"StepInstructions\">\n<p>I find it useful to start with the TopicinDocs.csv file. But the file requires a bit of explanation. For one thing, it&#8217;s helpful to know that each row is meant to be read <em>across<\/em>.<\/p>\n<ul>\n<li>In the image here, <strong>Column A<\/strong> gives each document a number.<\/li>\n<li><strong>Column B<\/strong> provides that document&#8217;s filename. (We&#8217;ll make that easier to read in the next step.)<\/li>\n<li><strong>Column C <\/strong>tells you which topic (from a separate numbered list, which we&#8217;ll find in a different document) is represented most prominently in that particular document.<\/li>\n<li><strong>Column D<\/strong> tells you what contribution that topic makes to the document in question. For example, for document 1, we can see that Topic 9 makes a contribution of 0.384 \u2014or 38.4% \u2014 to the document&#8217;s contents.<\/li>\n<li><strong>Column E<\/strong> provides the next-most prominent topic in the document (in this case number 27).<\/li>\n<li><strong>Column F<\/strong> tells you what contribution that topic makes to the document&#8217;s contents.<\/li>\n<li><strong>Column G<\/strong> provides the next-most prominent topic &#8230;<\/li>\n<\/ul>\n<p>&#8230; and so on.<\/p>\n<\/div>\n<\/div>\n<div class=\"LessonStep top\">\n<h3 class=\"StepTitle\">Make filenames easier to read<\/h3>\n<div class=\"StepImage\"><img loading=\"lazy\" decoding=\"async\" alt=\"media_1351530387183.png\" src=\"https:\/\/miriamposner.com\/blog\/wp-content\/uploads\/2012\/10\/media_1351530387183.png\" width=\"540\" height=\"147\" \/><\/div>\n<div class=\"StepInstructions\">\n<p>Your spreadsheet is trying to be helpful by providing you with the entire path to each document in Column B; that is, by showing you how to navigate to each document. But that&#8217;s hard to read. I find it helpful to get rid of most of the path information \u2014 in this case <strong><span style=\"color: #000000;\">\/Users\/miriamposner\/Dropbox\/UCLA\/DH 201\/Topic Modeling\/Datasets\/Humanist emails 2008-20012\/By individual email\/<\/span><\/strong><span style=\"color: #000000;\"> \u2014 by selecting Column B and using Excel&#8217;s <strong>Find and Replace<\/strong> function (in the <strong>Edit<\/strong> menu) to get rid of every occurrence of that path.<\/span><\/p>\n<p><span style=\"color: #000000;\">Once you do this, you should be left with just the filename of each document, which is much easier to read.<\/span><\/p>\n<\/div>\n<\/div>\n<div class=\"LessonStep top\">\n<h3 class=\"StepTitle\">Get a sense of what each row means by making a pie chart.<\/h3>\n<div class=\"StepImage\"><img loading=\"lazy\" decoding=\"async\" alt=\"media_1351531451258.png\" src=\"https:\/\/miriamposner.com\/blog\/wp-content\/uploads\/2012\/10\/media_1351531451258.png\" width=\"540\" height=\"235\" \/><\/div>\n<div class=\"StepInstructions\">\n<p>The kind of topic modeling that we&#8217;re doing assumes that every document contains multiple topics. That&#8217;s why each row lists multiple topics for each document. To get our heads around this, let&#8217;s make a simple pie chart. Copy a row from your document \u2014 I&#8217;ll choose <strong>Row 2<\/strong> \u2014 and paste it into a new Excel sheet.<\/p>\n<p>Now you need to reformat this data a little bit so that Excel can make it into a pie chart. Delete the first two columns, which provide the document number and filename. Then, make one column called <strong>Topic<\/strong> and one column called <strong>Contribution<\/strong>. Put each topic in the <strong>Topic<\/strong> column, and put each topic&#8217;s contribution right next to it, in the <strong>Contribution<\/strong> column.<\/p>\n<p>The Contribution values may not add up to 1 (meaning 100%), because the TMT is only showing you those contributions beyond a certain threshold. So in the last row, create a topic called <strong>Other<\/strong>. For its contribution, type in this function: <strong>=1-SUM(B2:B4)<\/strong>. (Replace B2 and B4 with the first and last cell numbers of each of your contributions.)<\/p>\n<p>Once you&#8217;ve got a grid that looks similar to the one above, highlight the cell values, and from Excel&#8217;s <strong>Charts<\/strong> menu, select <strong>Pie<\/strong>. Now you have a visual representation of the contribution of each topic to the document.<\/p>\n<p>I can&#8217;t do this for each of my documents, because I have more than 3,000 of them. But now I have a better understanding of what each row is telling me.<\/p>\n<\/div>\n<\/div>\n<div class=\"LessonStep top\">\n<h3 class=\"StepTitle\">But what are these topics? (1)<\/h3>\n<div class=\"StepImage\"><img loading=\"lazy\" decoding=\"async\" alt=\"media_1351532131705.png\" src=\"https:\/\/miriamposner.com\/blog\/wp-content\/uploads\/2012\/10\/media_1351532131705.png\" width=\"540\" height=\"403\" \/><\/div>\n<div class=\"StepInstructions\">\n<p>Now we sort of understand that each topic contains multiple topics in different proportions. But what do the numbered topics refer to?<\/p>\n<p>That information is contained in a different document, and this next step requires a lot of toggling back and forth between our <strong>TopicsinDocs.csv<\/strong> file and our <strong>all_topics.html <\/strong>file.<\/p>\n<p>Start by double-clicking <strong>all_topics.html<\/strong>. It should open in a web browser. You&#8217;ve got a list of topics (or, more properly, word clusters).<\/p>\n<p>It&#8217;s important to realize that these topics are listed in no particular order. But the number of each topic corresponds to the topic&#8217;s number in your <strong>TopicsinDocs.csv <\/strong>spreadsheet. It&#8217;s also important to realize that there are more words in this topic than we&#8217;re being shown. By default the Topic Modeling Tool just shows us the top 10 words associated with each topic.<\/p>\n<p>So the topic modeling algorithm thinks these are meaningful. We have to figure out why.<\/p>\n<p>In the document we were looking at in the previous step, we saw that Topic 9 made the most prominent contribution. So I&#8217;ll click on that topic \u2014 here, it&#8217;s <strong>uk ac london kcl www king http college research centre<\/strong> \u2014 and see what it&#8217;s all about.<\/p>\n<\/div>\n<\/div>\n<div class=\"LessonStep top\">\n<h3 class=\"StepTitle\">But what are these topics? (2)<\/h3>\n<div class=\"StepImage\"><img loading=\"lazy\" decoding=\"async\" alt=\"media_1351532466939.png\" src=\"https:\/\/miriamposner.com\/blog\/wp-content\/uploads\/2012\/10\/media_1351532466939.png\" width=\"478\" height=\"362\" \/><\/div>\n<div class=\"StepInstructions\">\n<p>When we click on an individual topic, we get a webpage that shows us another way of looking at it: a list telling us which documents feature that particular document most prominently. Here, we see that the file called <strong>3276.txt<\/strong> contains the most words associated with this particular topic.<\/p>\n<\/div>\n<\/div>\n<div class=\"LessonStep top\">\n<h3 class=\"StepTitle\">But what are these topics? (3)<\/h3>\n<div class=\"StepImage\"><img loading=\"lazy\" decoding=\"async\" alt=\"media_1351532843469.png\" src=\"https:\/\/miriamposner.com\/blog\/wp-content\/uploads\/2012\/10\/media_1351532843469.png\" width=\"540\" height=\"278\" \/><\/div>\n<div class=\"StepInstructions\">\n<p>We know that we&#8217;re interested in Topic 9, and we know that it&#8217;s featured most prominently in this list of documents. Now we can try to start figuring out what it refers to. We might start by just taking a guess. I have a basic familiarity with the corpus in question, and my hunch is that this word cluster is associated with King&#8217;s College London, and particularly its research activity. To confirm my hunch, I&#8217;ll drill down into the top-ranked documents in our <strong>Topic<\/strong> list, as shown in the image above.<\/p>\n<p>Unfortunately, my data doesn&#8217;t tell me which words are associated with which topic, but by surveying a number of documents, I should be able to either confirm or question my hunch.<\/p>\n<\/div>\n<\/div>\n<div class=\"LessonStep top\">\n<h3 class=\"StepTitle\">Name your topics<\/h3>\n<div class=\"StepImage\"><img loading=\"lazy\" decoding=\"async\" alt=\"media_1351533044510.png\" src=\"https:\/\/miriamposner.com\/blog\/wp-content\/uploads\/2012\/10\/media_1351533044510.png\" width=\"540\" height=\"223\" \/><\/div>\n<div class=\"StepInstructions\">\n<p>I find that it&#8217;s a useful exercise to try to name the topics in my list. Doing so requires me to alternate between reading individual documents and looking for patterns, and it&#8217;s an interesting way to look for clusters of meaning that surprise or confuse me. When a topic doesn&#8217;t make sense to me, it&#8217;s a good excuse to investigate!<\/p>\n<p>If I&#8217;m able to name my topics, I&#8217;ll have a quick-and-dirty concordance of sorts to the Humanist emails.<\/p>\n<p>By now it should be abundantly clear that no part of this process is &#8220;scientific&#8221;; it&#8217;s just one way of getting your head around a large body of text. So there&#8217;s no right or wrong topic name, just schemas that do and don&#8217;t help you find interesting features of the text you&#8217;re looking at.<\/p>\n<\/div>\n<\/div>\n<div class=\"LessonStep top\">\n<h3 class=\"StepTitle\">Look for patterns<\/h3>\n<div class=\"StepImage\"><img loading=\"lazy\" decoding=\"async\" alt=\"media_1351534010353.png\" src=\"https:\/\/miriamposner.com\/blog\/wp-content\/uploads\/2012\/10\/media_1351534010353.png\" width=\"489\" height=\"245\" \/><\/div>\n<div class=\"StepInstructions\">\n<p>Topic modeling is generally very useful for, say, learning about change over time. And if you&#8217;re running the TMT on a small enough set of documents, you might be able to glean that information from your spreadsheet. But if you, like me, have a bunch of documents, it&#8217;s quite a bit harder for a human being to suss out patterns without visualization tools.<\/p>\n<p>But you can try. Using Excel&#8217;s Sort and Filter functions, you can, for example, show only those documents for which Topic 9 makes the greatest contribution. You can look for documents that seem to have a similar distribution of topics. It&#8217;s challenging, but it&#8217;s worth doing, if only to get a better sense of how topic modeling works.<\/p>\n<\/div>\n<\/div>\n<div class=\"LessonStep top\">\n<h3 class=\"StepTitle\">Alter variables and try again<\/h3>\n<div class=\"StepImage\"><img loading=\"lazy\" decoding=\"async\" alt=\"media_1351534826955.png\" src=\"https:\/\/miriamposner.com\/blog\/wp-content\/uploads\/2012\/10\/media_1351534826955.png\" width=\"540\" height=\"90\" \/><\/div>\n<div class=\"StepInstructions\">\n<p>The results of any topic modeling test will change a great deal depending on some important decisions that you make. For one thing, the way you divide up documents makes a big difference to your results. In the preceding example, I chose to divide the Humanist emails into 3,584 separate documents. So the TMT is determining topics by looking for clusters in each individual email.<\/p>\n<p>But instead of thousands of individual emails, I could have chosen to feed the TMT five separate chunks of emails, one chunk per year. In that case, the TMT would look for the clusters of words that characterize each year of emails. The image above is my <strong>TopicsinDocs<\/strong> spreadsheet for this scenario. You can see that it&#8217;s easier to discern gross patterns, but I lose a lot of the nuance I got in the preceding example. And why should a year be the interval at stake here? Might there be a more meaningful way to divide time \u2014 perhaps a series of events that I think might be watershed moments for DH? But then again, would that division just tend to confirm my bias?<\/p>\n<p>In the scenario I described here, I chose to ask the TMT for 50 topics. But I could have chosen to ask for fewer, which would tend to make each topic more broad, or more, which would tend to make each topic more specific.<\/p>\n<p>Which set of variables is best? I&#8217;m not sure. It depends on what interests you. For me at the moment, what&#8217;s &#8220;best&#8221; is trying multiple strategies, comparing them to each other, and making a list of things that surprise or confuse me.<\/p>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Written with Andy Wallace, with methods and ideas borrowed from Zoe Borovsky If you&#8217;re reading this, you may know that topic modeling is a method for finding and tracing clusters of words (called &#8220;topics&#8221; in shorthand) in large bodies of texts. Topic modeling has achieved some popularity with digital humanities scholars, partly because it offers [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3,17,5],"tags":[267,265,266,268],"class_list":["post-1335","post","type-post","status-publish","format-standard","hentry","category-digital-humanities","category-research","category-tools","tag-dh201","tag-topic-model","tag-topic-modeling","tag-tutorials"],"_links":{"self":[{"href":"https:\/\/miriamposner.com\/blog\/wp-json\/wp\/v2\/posts\/1335","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/miriamposner.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/miriamposner.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/miriamposner.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/miriamposner.com\/blog\/wp-json\/wp\/v2\/comments?post=1335"}],"version-history":[{"count":12,"href":"https:\/\/miriamposner.com\/blog\/wp-json\/wp\/v2\/posts\/1335\/revisions"}],"predecessor-version":[{"id":1686,"href":"https:\/\/miriamposner.com\/blog\/wp-json\/wp\/v2\/posts\/1335\/revisions\/1686"}],"wp:attachment":[{"href":"https:\/\/miriamposner.com\/blog\/wp-json\/wp\/v2\/media?parent=1335"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/miriamposner.com\/blog\/wp-json\/wp\/v2\/categories?post=1335"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/miriamposner.com\/blog\/wp-json\/wp\/v2\/tags?post=1335"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}