## DH101

### Introduction to Digital Humanities

#### Month: October 2015 (page 2 of 18)

Scatterplot of Winning Numbers in Weekly Florida Lottery 1988-2008.

At first glance, it may seem as though the numbers used by the Florida Lottery are random, and that is to be expected. Florida certainly cannot have patterns arising from the winning Lottery numbers, else people discover that pattern and start winning the lottery constantly. However, despite the intentions of the data to be completely random, there is a very obvious pattern that emerged from this data set.

Here are the winning numbers separated by order:

Winning Number #1

Winning Number #2

Winning Number #3

Winning Number #5

Winning Number #6

Winning Number #7

While prior to the 21st Century all the numbers seemed to be completely randomized, after the year 2000, each numbered-place 1-7 seems to gravitate around a distribution of approximately 10-15 numbers- much better odds that the original 60 number spread of the old lottery format. Each winning number cluster is distinct from the other, meaning that you have a much higher probability of guessing each number if you breakdown the lottery by winning number slots. This pattern effectively improves your odds of winning from the pre-21st century odds of 1 in 2,799,360,000,000  (60^7) to a manageable 1 in 35,831,808 (12^7). It is unclear from my data whether this is due to a formatting change of the Lottery form itself, or if this is an interesting pattern that can help Floridians win the lottery. I will test my hypothesis by moving to Florida and playing the lottery every week. If I become a millionaire then I can confirm my hypothesis.

This week I attempted to observe patterns in the data FL_Lottery.xls, which contains the winning numbers form the Florida Lottery since 1988, a total of 1558 rounds of lottery were recorded. In order to visualize this data, I used RAW and plugged the data into the scatterplot tool. I set ‘date’ as my x-axis, so as to conveniently represent a timeline, and ‘winning numbers’ as my y-axis.

Before I had analyzed this data, I was sure that the spread of winning numbers would be more or less even within the ranges of possible numbers. What surprised me is that in fact, there is a clustering of frequency of winning numbers in the lower values. In particular, after 2004, this frequency significantly increases. In fact, after 2005, there were no winning numbers above 30.
19901992199419961998200020022004200620085101520253035404550

I chose to examine the Body Fat dataset. I honestly struggled a lot in creating this graphing and analyzing the data. I was confused in deciding which graph would be the best way for me to display my data and I found the visualization websites pretty tricky. Any way, I chose to compare age and weight from the Body Fat dataset using Google Fusion Tables because it seemed the the most simple efficient site. Before creating this chart I thought that as you get older you gain weight until you reach a certain age, where then you begin to lose weight. From my data I saw that this was not true and that body fat grows accordingly with age. There was no real direct correlation I could see between age and body fat. However this may not be the case because in working with so much data from the original dataset, a lot of data was lost and there so a complete picture is not painted. from doing this I saw how complex and complicate creating a data visualization is and how they can be manipulated very easily to show the interest of the person creating the graph, however I found it hard to manipulate anything and will be practicing data visualization a lot more!

James Cameron’s famous movie Titanic (1997) have brought many people to tears when the “unsinkable ship” hit an iceberg and sank to the bottom of the Atlantic Ocean, ending the short-lived romance between Jack and Rose. In the film, many have lost their lives, and those who were lucky enough to climb onto the lifeboats were mostly women and children. To examine the proportions of women and men that have survived, I have chosen the Titanic Dataset. This dataset has 4 variables: class, age, sex, and survive, and it has 2,201 records. These variables are actually dummy variables, which means that the numerical values are simply codes for categorization. Thus, for class, 0 = crew member, 1 = first class, 2 = second class, and 3 = third class. For age, 1 refers to adult, while 0 refers to child. For sex, 1 = male and 0 = female, and for survive, 1 = yes and 0 = no. According to Nathan Yau, the ingredients to a data visualization are visual cues, coordinate system, scale, and context. As you can see from the data, we can only compare between categorical variables, since the only continuous measure is the number of records. Therefore, for visual cues, I combined the length and the color aspects along with a cartesian coordinate system to create a side-by-side stacked bar chart to best represent the data using Tableau Public.

The data has been divided into class, gender, and survival. We can now compare the groups that have either survived the voyage on Titanic or not in a side-by-side chart. And within these groups, we can also compare between classes of the passengers. The classes have also been split into gender (blue for males, red for females).

There are more people who did not survive than those who have survived, as indicated by the average line in each pane. And among those who have survived, more crew members survived than the rest of the passengers, including the first-class. More females have seemed to survive than males in each of their respective classes. We can also see that there were barely any females among the crew members, which explains the disproportionate amount of males that have survived in that group.

It is interesting to note that among those who did not survive (0), the crew members (0) and the third-class passengers (3) lost the most lives, while the first-class group lost the least amount of people. And the third-class lost the most female lives out of all the other classes. Perhaps, social class played a big part in determining the passengers’ survival.

Because the data is coded as dummy variables, it is hard to see any pattern or relationship in the data without seeing a visual representation of the data. It was helpful that the codes were defined in the dataset, but it is difficult to make meaning out of these binary codes without data visualization. Binary codes and dummy variables are useful when it comes to recording data quickly and efficiently, but data visualization puts context into the data, making it possible for humans to read and understand the data. Charts and graphs, such as this one, show us what the data is trying to tell us. And in this Titanic Dataset, the data shows us the proportions of passengers that have or have not survived based on their age, gender, and class.

I examined the U.S Population Census dataset. This dataset marks the change in U.S Population since 1790 to now. I used Tableau Public to create a line marked line chart to visually present the change as follows:

By looking at the data visualization, it is apparent that there is a rise in population as the years progress. The counts are marked in ten year increments on the x axis, and the actual population number marked on the y axis. Looking more specifically, we can see that there is a slow gradual increase in the first fifty or so years, then the increase becomes more rapid. The overall shape of the graph resembles somewhat of an exponential line, in mathematical terms. This is something raw data in a spreadsheet would not have been able to tell. We could see exactly how much the increase is, but unless it is visually represented, the shape would not have been visible without a graph.

One thing this visualization fails to tell is the context behind the numbers. It is the job of the viewer with humanities information to give the reasoning to the dataset. Surely, there is a reason for the trend in increase in population. Some historical background is required to decipher the “why” behind the numbers. Why did the data start at 1790? How was the data collected?

One aspect I am personally very curious to know is if the populous counted slaves. If some states counted them, and some didn’t, there would also be discrepancies in the numbers. Around the years of 1840 to 1860 is where we see the beginning of large jumps in population, which was around the time of the Civil War. Since then, the jumps in population seem to get larger and larger. In more recent years, I’d also like to know if undocumented Americans are counted into the populous. With census populations, it is very difficult, if not impossible to get an exact number because we cannot keep track of every single person in one place.

The dataset is on the weights of 20 ounce boxes of Chocolate Frosted Sugar Bombs breakfast cereal, with 10,000 data entries. Immediately, what came to mind is the range of the data and what the near-average is within a sample of 10,000. With that said, the scatter plot helps to delineate where most of the entries fall in between, with the density of the plotted dots. It can be seen immediately that the range is between 19.924 and 21.  Although the dataset only highlights one variable – weight – it is extremely useful to take a glance at whether or not this dataset falls within certain requirements, like a range for weight.

This data set puts 10,000 data entries into one graphic, which helps us understand where the MAJORITY of the data is centered around. This cannot be done easily by navigating through 10,000 data entries so a scatter plot helps to plot the individual weights as a whole to give a more immediate and complete idea. As well, hovering over the data points where the points start to condense gives us the range of where most of the data sits.

The visualization can be seen here.

The use of data visualization aids the viewer in identifying patterns that may not be recognizable in dataset form. This week, I decided to look at the “Body Fat” dataset, which is a compilation of data, from the Journal of Statistics Education website, for 252 men regarding their percent bodyfat measurements and other body size measurements.

To visualize the data, I chose to use RAW. Using its scatter plot option, I set “weight” as the x-axis, “bodyfat” as the y-axis, and “IDNO” and “age” as labels (separated by a comma and denoted on the graph as “IDNO, age”).

The result looks like this:

Previously, I predicted that the more the person weights, the higher their bodyfat percentage. This graph, which shows an upward, increasing trend, generally supports this hypothesis.  The data visualization does, however, indicate instances of outliers and other extremities that you may not be able to see in a spreadsheet of the data.

Next, keeping the y-axis the same (“bodyfat”) set the x-axis as “height”:

Interestingly enough, although you can observe the proportion of body fat with height, you can also pull other interesting information off this visualization. For example, the average heights for these men seem to be between 65-75 inches (5’4”-6’, with a median around 5’8”). This information would have been a bit more difficult to obtain from just looking at the dataset.

To make it more complicated, I decided to change the x-axis to “neck,” leave y-axis as “bodyfat,” color-coded “age,” and left the “IDNO” as the label. The colors corresponded to 5-10 ages per age groups; what I mean is: 22-29 yo (red), 30-39 yo (purple), 40-49 (green), 50-58 yo (hot pink), 60-69 yo (orange), and 70, 72, 74, 81 yo are blue.

From this graph you can say that as the percentage of bodyfat increases, so do the neck measurements. Looking at the 40-49 yo block (green), I can see that this age group is more widespread, a bit more dynamic and larger in numbers compared to the other age groups.

If I continue to change the x-axis to the different body measurements, I can observe the different proportions of body part to percent body fat. For the most part, the trend continues to be increasing, so the measurement of the body part would increase as the percent body fat increases. This trend supports the purpose of this dataset as a reference for estimation of bodyfat percentages based on specific body measurements, “the goal is a regression model that will allow accurate estimation of percent body fat, given easily obtainable body measurements.”

(Apologies for the blurry snippets of the graphs, trying to embed the images kept causing my browser to crash. )’: )

For this blog I chose to look at the US population dataset of the United State. The data was taken from the United States decennial census. The dataset includes information on the US population beginning in 1790 and ending in 2010. To build a visualization of this dataset I chose to work with Raw. It was the most interesting to me in class and seemed the most user-friendly. I still do not feel very confident in my data visualization skills so I thought I would start somewhere that was more straightforward and simple to use.

The beginning of Yau’s text begins with exploring the reasoning as to why you chose to make your visualization a certain way. Obviously the graphing method you choose is a huge factor in how your visualization turns out, but even smaller modifications can have a great effect on your outcome such as size or color. In order make your visualization the best it can, all of these elements should be appropriate for the data you are working with.

For my data I explored several different formats including the delaunay triangulation option, a scatter plot, and a cluster dendrogram. Though I used the same data set for each visualization, each of the graphs were able to tell be different information about the data.

The delaunay triangulation format was not helpful at all for the assignment. I could not figure out how to make any of the text appear and I’m assuming that this template dose not include any numerical information, it is simply a way to represent data purely visually.

The scatter plot did an adequate job representing the data. With more practice I believe I could achieve a more correct sizing for the plot, which would improve on the ability to read the data.

The cluster dendrogram is also very simple to read however it had more room for data than the dataset I chose actually had. This format seems to be more effective when having 3 or more different columns of data.

To be honest the original excel sheet that my data came on was a better visualize it than the graphs I created in Raw. The dataset I chose was very straightforward and did not really need to be complicated by graphs and charts to be understood easily. The only other method of graphic that I believe would be helpful to visualize this information would be through a simple bar graph, and depending on what you were using the data for, a tool that illustrated percent increase of the population throughout the years.

From this exercise I have learned that sometimes you can complicate your dataset from over exhausting visualization tools. There is a time and place for programs such as Raw and you must be observant as to when it is acceptable to use them.

This week, I took Dr. John Rasp’s dataset on bodyfat percentage sampled from 252 men, along with additional measurements of body size including chest, abdomen, arms, legs, neck, etc. The data states that body fat percentage is “normally measured by weighing the person underwater – a cumbersome procedure.” The data aimed towards estimating body fat percentage with these alternative body measurements which are much easier to collect. The data were taken from the Journal of Statistics Education website.

I utilized the free online data visualization tool Plot.ly to create a scatter plot showing the relation between body fat percentage and body weight.

This visualization allows the viewer to not only see the data that was collected, but also see a relationship between the variables selected–something the table alone does not show. The x-axis shows the body weight measurements of the sample of 252 men, and the y-axis shows their calculated body fat percentage based off measurements collected and calculated by the researchers. I made this into a scatter plot, as any other plot did not seem effective in accurately representing the collected data. By adding a best-fit line, the plot indicates a positive relationship between body fat percentage and body weight. I also went ahead and made another scatter plot  to visualize the relationship between the estimated body fat percentage and thigh circumference (in centimeters), one of the many various body measurements.

Again, the plot shows a positive relationship between the variables.

All in all, this data visualization allows users to clearly see positive or negative relationships between selected variables, or if a relationship is present at all. I learned from this exercise is that one should be mindful of which variables to represent on each axis and to careful evaluate whether or not one’s selections are relevant to what one is trying to observe. I ran into this problem a few times, and ended up with plots presenting contradicting relationships.

The dataset I selected for my visualization project is a set of two sheets containing daily closing prices and percent returns for Amazon and Coca Cola from 2005 until 2014.  Initially, I had considered using RAW to visualize my data interactively; however, after trying various approaches, I realized RAW is more suitable for visualizing datasets with multiple columns of attributes, whereas my dataset was more time-centered, featuring only two attributes. Thus, I decided that Tableau would be a better tool since it allows for better time-centered visualization and juxtaposition of multiple charts.

First, let me start by explaining my dataset: there are two sheets – one for Amazon and one for Coca Cola. Each has three columns: Date (by day), Closing Price (in dollars), and Daily Return (by percentages). In terms of interpretation and analysis, we can either look at relationships within a single company, or how these attributes compare between these two companies across time.  I decided to examine if there are any patterns first between Amazon’s and Coca Cola’s daily Closing Prices, and the between their Daily returns.

I changed the date from “year” to “month”, and set the Closing Prices to display the “average” value. The blue is AMZN, while red is KO. I also added the Label option to make it easier to read points on the graph.  From this visualization it seems that the average daily Closing Prices of both companies seem to increase from 2005 to 2014. Note, however, that each company has a different price scale: for Amazon the range is from about \$44 to \$407, while for Coca Cola it’s \$16 – \$45. This difference could influence us to see patterns that are not really there. In this case, however,  it is valid to interpret the data in this way because the pattern is consistent for each company within its own price range. Also, it is noticeable that somewhere toward the end of 2009, prices for both had decreased.

Next, I created a similar visualization to compare Daily Percentage Returns for Amazon and Coca Cola.

Again, I set the Date to display by month, and the percentages to show by average. Notice, again, that the percentage ranges are not the same for each company: for Amazon it’s between -2% and 3%, and for Coca Cola it’s between -0.7% and 0.7%. And, just like the previous chart, we can interpret this one relatively in terms of general patterns. So here, it looks like both Amazon and Coca Cola experienced a significant decrease sometime near the end of 2008. Recalling what that the first chart illustrated both companies’ prices having decreased near 2009, it is possible to infer some connection between the two events.

Since I am not knowledgeable about stock prices, I am aware that I may be making inaccurate inferences about relationships. However, by completing this visualization project, I realized that by presenting your columns and rows visually, allows to see relationships that could hardly be spotted before. Furthermore, it allows to make comparisons across multiple sheets, and to view each attribute more closely (e.g. day by day), or more objectively (year by year).