Course reflection

This course was not at all what I expected.  I was unfamiliar with what exactly was meant by “digital media” and I was unfamiliar with r studio.  However I have always been someone who picks up on things pretty quickly so I wasn’t too worried. I was interested to see what using this new software would be able to teach me about literature, since it was a new approach to examining texts that i had never thought about. The beginning was pretty easy more or less the finding of the files and converting them to text files was not something hard to do.  Downloading and understanding how to work R studio was the real challenge for me. It forced me to become better associated with the software in my mac, such as learning how to use automator, in this sense i feel that I benefitted from the class.

Learning the logic behind the methods we were using were interesting because a lot of the methods required of us to analysis and hypothesize. However I do wish we could have all worked on one standard corpus to learn the basic steps first and then gone on to do individual assignments as for my classwork it would have been easier to follow along and help each other. However I am glad that I took this class because it has helped to think of words as data and its something we can use potentially for research. It’s also important because we are developing as a society and alot of things are tech based jobs and it’s kind of a requirement, even for ENGLISH majors, to be comfortable with analyzing data.

Network Analysis

Network Analysis of all the other methods is now by far the most confusing. It is this shape that looks like nothing if ever seem before but that within its shape it saying something about the connections of the nodes. I think that for my corpus it might not be the most useful because I already know who is in conversation with each other and how the files will more or less split based on party lines. However it is a method that is cool to look at.

started out my network analysis journey using the palladio stanford website to generate a graph. I did this by running stylo on R studio to get a CSV file I could use on the website. The CSV file showed all the files were undirected. I figured out undirected meant when two things are connected mutually in a two way direction. The nodes are connected by edges that do not have a direction. Nodes are the text files and the size of the nodes correspond with how much importance we should be giving them, because of how connected they are in relation to other nodes. 

The palladio website gave me a network graph that looked a bit strange because it was split into two distinct groups. One held most of the text files and another was just a few select bill clinton files, so that it looked like bill clinton was in conversation with himself. the files are from 1996 to 2000, where it looks like hes just talking to himself,

The other network graph is this very straight vertival looking shape that has republicans on one side and democrats on the other, with a few exceptions. The two sides are bridged by Obama’s 2013 speech and trump’s 2017 speech, these are the biggest nodes we see, they have the most undirected edges on the graph. Between them are three trump speeches Trump’s 2018, 2019 and Bush’s 2001 speech. I found it interesting that obama’s 2013 and trump’s 2017 speeches would by the ones with the most connections and that trump’s speeches would be linking anything in the graph. 

The next network analysis image came directly from r studio, where I ran my corpus usign stylo. The graph image I got from r studio was different from the one I got in the palladio website, there was no two separate networks. The shape also looked different it looked like something along the lines of a puffy triangle. Clinton was on one edge, obama on another and W. Bush was on the last one. They kept mostly to themselves and the space between them on taken up by trump speeches and some H.W. Bush speeches. 

The data I used to make the graphs were both the same but I got different results on different methods, I do not know why but I have that they were similar in some ways. Obama’s 2013 speech was still a big node and it had many edges running between other nodes, as was trump’s 2017 speech and Clinton’s 1996-2000 speeches were connected to other nodes but they were in grouped closely together.

Something else I noticed on the second graph is that Clinton’s speeches are closer to the republican side of the graph then they are to obamas. As well as the republican side and Clinton side have a lot more darker edges than Obama side of the triangle. I do not quite understand what this means but I know that I good bet for me would be to look at Obama’s 2013 speech more closely because they is the one that seems to have a connection with the rest of the sides of the graph. Also something to look at would be trump’s 2018 speech and H.W. Bush’s 1993 speech, in the second graph they looked to also be important nodes that connect to outling nodes to other parts of the network. Center in the network in the second graph were two H.W. Bush’s speeches, one Clinton speech and one trump speech.


Contrastive analysis

This was probably one of the harder things we had to do this semester. Once I got it, I didn’t find it so difficult but at the beginning I just didn’t get it. I started off by realizing that I would probably need to get more texts files for my data set. So I went from just looking at state of the union speeches for presidents to looking at one other speech from each year of their presidency. For example for Barack Obama’s 8 years as president I would have all his state of the union speeches and one other speech from each of his eight years as president for a total of 16 speeches for Barack Obama. I went along doing this for all the other presidents in my data set( H.W. Bush, Clinton, W. Bush, Obama, Trump). Once that was I separated my new data set into two folders, the primary_set and secondary_set, the primary would be comprised of the democratic presidents and the secondary would be republican presidents. I ran oppose to get a list of preferred and avoided words, taking a look at the preferred and avoided words Craig’s Zeta graph was interesting. On the preferred side of the chart we saw words like businesses, change, college, health, why, what, job, know, do. On the avoided side we saw words like great, freedom, yet, confront, nations, friends, allies, nation, united, free, military. It was interesting to see that the democratic side had more question words, like who, what, when, where, why. This leads me to think that perhaps they are more prone to asking questions before making a final decision. But then again I could be reaching and my analysis might be bias. The republican side had words that could be related to the war, the words, september and terror was in the words avoided side. The word men was also on the avoided side but not on the preferred side, this leads me to think that the republican side was much more.

Next I ran oppose one more time but this time I changed it from words to markers. I got a graph with two groups the republican and Democrats were in two different oval looking shapes. Where they overlapped would be the similarities, surprisingly enough trump on closer to the part of the charts where the similarities over lapped. W.Bush was closer to the one of the republican sphere and most of Obama’s where at the end of the democratic sphere. Next I ran oppose one more time but this time I had a test set folder. The test set when be comprised of all Hilary Clinton speeches, I tried to have at least one for every year of her career extending back to the 90s and I tried to pick speeches that she had given on diverse topics however I am human so once again I’m probably bias towards some speeches more than others. I ran oppose with my new test set group and got as I’d predicted a Craig’s zeta graph that showed me that Hilary was more similar to the Democrats then she was to the republican camp, however her data set was not as far left as some of Obama’s and her husband bill Clinton. She stayed closer to the middle. 

Finally I wanted to make a PCA of the wordlists of words preferred and words opposed along side Hilary’s data set. To see how close to these wordlists her speeches would come. For the democratic PCA her data was inconclusive, I saw some texts that were close to zero but not all were, her speeches were spread out from zero on the four quadrants but not never too far. The republican wordlist showed me something very different. It’s funny because when I had first renamed the words preferred and words avoided set folders to wordlist I had gotten the two confused I had thought the words preferred were the words avoided. So I had named the avoided_words as wordlist.  The PCA had showed me most of Hilary’s texts on the second 2 quadrant and very close to zero and a couple on the first and fourth not as close to zero. Most of the speeches that were clustering together were based on national security and the war and international relationships. The outliers were woman based topics and this is when it hit me that I had mistaken the wordlists because I remembered that one of the preferred republican words had been men. Still it brought up an interesting question, could possibly Hillary’s views on the war line up more with republicans than with the Democrats ? 


My data set consists of the last five presidents state of the union addresses, George H.W. Bush, Bill Clinton, George W. Bush, Barack Obama and Donald Trump. We have two democratic presidents and three republican. I ran my data set through R studio and used style to see what Id get. What I got first was a cluster analysis because thats what I asked the stylo package for, under  features MFW was at 100 for min and max and in 100 increments.  The cluster analysis came back with some interesting results. More or less the texts were similar according to the president, and they stayed in their own little groupings. However the Barack Obama’s 2009 and 2010 state of the union addresses and Bill Clinton’s 1993 and George H.W. Bush’s addresses are shown pretty close to each other.  Obama’s 2009 and 2010 addresses are more similar to  Bill Clinton’s 1993 and George H.W. Bush’s 1992 addresses than they are to Obama’s other state of the union speech. I fact Barack’s  next two addresses 2011 and 2012 are  the least similar to any of his other addresses. While his state of the union addresses he makes between 2013-2016 have the most in common.

Another interesting thing to note is that the cluster analysis has two big groups. one is mostly all Bill Clinton and Barack Obama with the exception of the two texts, Clinton’s 1993 and Bush’s 1992 addresses. The other is all George H.W. Bush, GeorgeW. Bush and Donald Trump. The two groups looked to be split up amongst republicans and democrats. The two texts that have the least in common are Obama’s 2010 address and W. Bush’s 2001 address. Both speeches  were their first addresses to a joint session of congress, these are considered their unofficial first term state of the union addresses. Looking at Bush’s 2001 speech it starts off with a message of hope for America , while Obama’s speech is centered around recovery from the 2008 recession and getting the economy back on track. Something else that is interesting is that Bush’s 2002 addresses given only a couple months after 9/11 is closest to Trump’s first state of the union address in 2017.

Next I did boot strap consensus analysis on stylo. The parameters were 1000 MFW at 100 word increments. It funny because nothing changed when I changed the MFW from 100 to 1000 or from 1000 to 1500. When i increased it to 2000 MFW the only change that happened was that Clinton’s 1994 and 1995, and Obama’s 2011 and 2012 addresses moved from the right side to the left side, but they were still branched off together.

This bootstrap consensus tree what what looked like 3 to 4 major branches with most of them being mostly the addresses of one or two presidents.  One branch was all George W. Bush and Donald Trump. Another branch was a mix of Obama’s 2009/2010 , Clinton’s 1993, George H.W. Bush’s 1992, and Bush’s 2001 address.  In the boot strap consensus Bush’s 2001 address was not as dissimilar to Obama’s 2009 address. The farthest texts from each other were instead Clinton’s 1998 speech and Bush’s 2003 speech . Clinton’s address in 1998 is his first state of the union while Bush’s is his third state of the union.

Mostly that stayed the same in both the cluster analysis and the boot strap consensus were the Obama’s 2009/2010 , H.W. Bush’s 1992 and Clinton’s 1993 addresses staying close to each other.  Obama’s 2009/2010 are his first and second state of the union’s , H.W. Bush’s is his fourth and last address and Clinton’s is his first address. I have a feeling that the similarities between these texts could potentially come from the topic of the economy.  That would also explain why Bush’s 2001 branch is close to these 4 texts.  However this makes me wonder again why the 2001 and 2009 addresses are so far from each other in the cluster analysis chart.

The other branches contain Clinton and Obama close to each other and Trump and Bush close to each other. George H.W. Bush’s addresses branch off not to far away from Bill Clinton’s and Barack Obama’s  but they are not as close to Trump and Bush as they are shown in the cluster analysis.

Overall some things stayed the same and same things moved around from cluster analysis to bootstrap consensus. For me it was useful to look at both because they helped in to see patterns in topics talked about.




Topic Modeling

Topic modeling was not easy for me to understand because I did not really understand what I was looking at. The pie chart coupled together with the excel data sheets helped me though and it made more sense once I realized that the contributions for corresponded to the topics and I could use the excel sheet with the topic words to figure out what was the topic. It also meant that when looking at the pie chart I now knew what I was looking at.

The contribution for topic 1 takes up the most space in the pie chart, its in a dark red color. Its topic words are america, world, economy, years, workers, home, end, americans, protect, states. From looking at just the topic words I’m guessing this could be about ensuring that americans can keep up with the world economy and have job back here at home. This is just a wild guess of course. Its interesting to note that in Barack Obama’s 2012 speech topic 1 takes up almost half the pie chart. Topic 1 is also a big part of Obama’s 2011 speech and Donald Trump’s 2019 speech.

Topic 2 is the second largest contributor to the pie charts. This one was harder to guess the topic for. The topic words for topic 2 are americans, year, system, family, states, president, put, small, set, stop.  I’m guessing its a topic that has something to do with the president doing something for american families in the states , something that either he will stop or set in motion, maybe set standards for.

Topic 3 ‘s words were american, congress, support, today, tax, long, tonight, work, state, strong. Topic 3 was easier to decipher , I believe it pertains to the president talking directly to congress that night of the address, to convince them to support some tax law designed to potentially make more jobs for the states, and in doing so make the state stronger.

Topic 4 ‘s words were world, health, country, budget, people, children, good, plan, peace, crime. This topic could be about either world peace or about keeping children safe and healthy. It could be referencing to statistics of crime or our healthcare system as compared to other countries. either way it seems like there is a plan involved thats supposed to be good for the country

Topic 5’s words were years ,make, children, tonight, schools, child, lives, opportunity, ago ,parents. This topic seems to  be pretty straight forward it is talking about children, parents, and school. Its an education based topic, it speaks about opportunity and lives. could be referring to giving children better education and more opportunities to succeed in their lives.

Topic 6’s words were jobs, ve, energy, businesses, country, make ,companies, made, business, american. I was confused at first as to what “ve ” was and thought that there might  be something wrong with my data set but when I checked the text files the word “ve” did not appear.  I then realized that the computer was recognizing the contracted words are two separate words, we’ve was “we” and “ve”,  they’ve was “they” and “ve”, and so on. I decided to treat “ve” as “have”. Have did not end up being very important in figuring out the the topic though.  I can tell this topic is about business in the the country , energy and business, creating new jobs in america. The topic could be able creating new jobs in america through energy based businesses.

Topic 7’s words are people work year care time families give working government jobs. They seem to be talking about work and the government providing jobs for families. Topic 8’s words are applause american great tonight country time united congress decades states. This one is a topic that pertains to just recognizing how great america is so lets give it up for america. This would explain why “applause” is one of the words that shows up.

Topic 9 is referring to the iraq war. Its words are nation freedom iraq government terrorists terror free fight citizens tax. Its interesting that the word “tax” would be included in this topic, I’d like to know what do taxes have to do with the war, maybe that our taxes are going towards funding the war. Topic 10’s words are america security century americans hope future ve social act continue. This one seems to be all about the future of america and keeping hope and safely for america.

The topics were all more or less what id expect the president to talk about. If you look at the pie chart we can tell which president gave priority to which topics. The funny one I find is Donald trump’s pie chart. He has almost every topic in there.  If you look at his 2017 and 2018 pie charts all the topics with the exception of topic 1, take up as much space in the pie. I wonder what this says about him, is he simply talking about everything and anything.



My data set is composed of state of the union addresses from 1989 – 2018. I felt that this would be tricky because it means that there are certain topics that will always be addressed. The president or whoever is giving the speech would have to talk about things like the budget, war, education, America, etc.  I was afraid that because of this more or less the data would look the same. My prediction was not totally correct, the PVA showed me the most frequently used words as American, people, new, year, work, make, world. It also showed me the proximity these words had to other words and to each other. It was interesting to note that at 125 terms, the word people and health were on top of each other in PCA # 1. Which is to mean that whenever the American people were being talked about, their health was also being mentioned.

Also in the graphs the word “tonight” kept appearing, I was confused as to why this was a frequently used word, but then I realized that the word “congress” appears just as many times and is relatively close to the word “tonight” on the correspondence analysis #1 graph, using 160 terms.  It made sense that this words would be so close to each other considering that the speaker of the state of the union address would be speaking to the house of representatives during the speech. This helped me realize that this tool in Voyant could be useful in finding the audience of text. This address is not only for the people watching at home but also for the representatives.  I had been having issues with finding the usefulness in Voyant, and understanding how to read the data.

Next I looked at another PCA, this time at 122 terms. In PCA #2 the words government, social, security, freedom, and Iraq were all clustering together. I assume that this cluster is about the war in Iraq. Possibly it could be referring to the government needing to preserve our security and freedom in the war against Iraq.  I found it interesting that these were clustering together, also clustering together in a prior graph was weapons and freedom.

The words work, make, together, time, challenge also appear together in a cluster, this makes me thing that the topic of discussion here is something relating to coming together to make something or work together towards something. I at first had thought it had something to do with jobs but jobs do not cluster together together with these words. Instead jobs it more towards the bottom of PCA # 2 and it clusters together with energy, businesses, homes. This is the cluster that is concerned with jobs, and its looking to businesses and energy for these jobs. I do not know where home comes in but this is my guess for now as to what it could mean. Maybe homes could be the concern for both a home and jobs in the economy.

Something I found particularly interesting was that the term education was not among the big terms. It had been used a number of 163 times. I found it interesting that education wouldn’t be a more commonly used term considering children had been used 307 times. It makes me think that maybe the state of the union address doesn’t particularly concern itself with the education of children.

Lastly I looked at document similarity, on the Y-axis they were all spread out everywhere. It looks like the y-axis pertains to the time period of these speeches. They were not perfectly organized by the time period but the older presidents were towards the bottom and the newer presidents were higher up. Obama was at the very top of the graph. On the x-axis the date grouped itself mostly to the right. With the exception of Obama’s 2010 speech, Bush’s 2007 and 2008 speech, and Trump’s 2019 speech, they are all more towards the right of the graph. Even more interesting is that Obama’s 2010 speech and Trump’s 2019 speech are in the same quadrant. This leads me to think that there most be something similar in these speeches.  As to what could be the component for the x-axis I could not decipher that.

Overall I observed that the older president’s like Clinton, George W. Bush and George H.W. Bush clustered close together in the document similarity graph. It could have to do with the fact that most of them were written pretty close to each other and that the war on terror had not begun yet.

<!– Exported from Voyant Tools (
The iframe src attribute below uses a relative protocol to better function with both
http and https sites, but if you’re embedding this into a local web page (file protocol)
you should add an explicit protocol (https if you’re using, otherwise
it depends on this server.
Feel free to change the height and width values or other styling below: –>
<iframe style=’width: 100%; height: 800px;’ src=’//′></iframe>

Data Set Reflection

For my Data Set I started out being unsure of what I would make it about. At first I wanted to do a data set based on short stories. I thought it would be interesting to see what they have in common or what pops out about short stories that I hadn’t noticed before. But upon further thinking and discussing the idea, I realized it would be better to shorten my range from short stories to Short stories written during a certain time period. The time period I picked was the 1950s. So with this new limit on myself I set out to find short stories written in the 1950s. It was here that I ran into another problem. How was I to pick which short stories from the 1950s to pick? I would want my data set to be a representative of a wide range of authors and genres in the 1950s. But I came across mostly short stories that had to do with some kind of suburban lifestyle and a fear of otherness. I became disinterested. Short stories from the 1950s were not for me.

The next step was to think about the short stories I did enjoy. Those came mostly before the 1950s which for copyright reasons worked out perfectly. But there were a lot that came in the late 1900s. Which meant they were harder to find for free. I could not find some on Gutenberg and the ones I did find were all very old. It would require a lot of work to find some of the short stories and the dates ranged widely. With this in mind I decided to try something that wasn’t short stories. And I thought what is something that people still read? Online articles.

I have a friend who has YouTube and Instagram accounts where she posts very generic things that are trending and that almost anyone can replicate if they follow a trend. She has a decent follower base, and non of her content is in anyway mind-blowingly original (her words). I could not use YouTube or Instagram to do a data set but this got me thinking about using the blogs that a lot of these YouTube or Instagram accounts link to. In particular health/wellness/lifestyle/beauty accounts have blogs. And I thought that’d be something interesting to look into. These verses their older counterparts- women’s magazines. Are blogs talking about the same issues? or Potentially are magazines out of date? Having found a new topic of interest for my data set I set out to acquire articles. But once again that didn’t seem like the route of me. Some blog posts were short and a lot of them had pictures to go along with it or there was just too much going on in the page and I was having a hard time figuring out how to get this into txt. files.

But the social media idea did lead my mind somewhere else. I remembered that when Donald Trump had been elected there had been a lot of talk about fake news and bots spreading fake news and false articles on fb. And then I also remembered that Obama’s speech writer had been in his early twenties when he had been hired and a lot of Obama’s campaigning had also happened online. Social media lead me to thinking about Politics, so with the idea of presidents, technology, and speeches linking together I decided to do a data set based on the last 4 presidents state of Addresses, starting with George H.W. Bush and ending with Donald Trump.

Having finally settled on what to use for my data set. I went about looking for the speeches I would need. Thankfully the speeches are all readily available online and are public access. I copied and pasted them each into textedit. I then used Calibre to mass convert them all into txt files. I used automator to name my files a lot quicker and my data set was done. I found that once I actually pinned down what it was that I wanted my data set to be about everything else became easier. I was worried that some of the more technical parts like naming and converting the files would be difficult but I got the hang of it and it really wasn’t that hard at all. Making this data set actually calmed some of the anxiety I had about the digital part of this class. So far making the data set wasn’t too difficult and I’m excited to see what we find.