My data set consists of the last five presidents state of the union addresses, George H.W. Bush, Bill Clinton, George W. Bush, Barack Obama and Donald Trump. We have two democratic presidents and three republican. I ran my data set through R studio and used style to see what Id get. What I got first was a cluster analysis because thats what I asked the stylo package for, under  features MFW was at 100 for min and max and in 100 increments.  The cluster analysis came back with some interesting results. More or less the texts were similar according to the president, and they stayed in their own little groupings. However the Barack Obama’s 2009 and 2010 state of the union addresses and Bill Clinton’s 1993 and George H.W. Bush’s addresses are shown pretty close to each other.  Obama’s 2009 and 2010 addresses are more similar to  Bill Clinton’s 1993 and George H.W. Bush’s 1992 addresses than they are to Obama’s other state of the union speech. I fact Barack’s  next two addresses 2011 and 2012 are  the least similar to any of his other addresses. While his state of the union addresses he makes between 2013-2016 have the most in common.

Another interesting thing to note is that the cluster analysis has two big groups. one is mostly all Bill Clinton and Barack Obama with the exception of the two texts, Clinton’s 1993 and Bush’s 1992 addresses. The other is all George H.W. Bush, GeorgeW. Bush and Donald Trump. The two groups looked to be split up amongst republicans and democrats. The two texts that have the least in common are Obama’s 2010 address and W. Bush’s 2001 address. Both speeches  were their first addresses to a joint session of congress, these are considered their unofficial first term state of the union addresses. Looking at Bush’s 2001 speech it starts off with a message of hope for America , while Obama’s speech is centered around recovery from the 2008 recession and getting the economy back on track. Something else that is interesting is that Bush’s 2002 addresses given only a couple months after 9/11 is closest to Trump’s first state of the union address in 2017.

Next I did boot strap consensus analysis on stylo. The parameters were 1000 MFW at 100 word increments. It funny because nothing changed when I changed the MFW from 100 to 1000 or from 1000 to 1500. When i increased it to 2000 MFW the only change that happened was that Clinton’s 1994 and 1995, and Obama’s 2011 and 2012 addresses moved from the right side to the left side, but they were still branched off together.

This bootstrap consensus tree what what looked like 3 to 4 major branches with most of them being mostly the addresses of one or two presidents.  One branch was all George W. Bush and Donald Trump. Another branch was a mix of Obama’s 2009/2010 , Clinton’s 1993, George H.W. Bush’s 1992, and Bush’s 2001 address.  In the boot strap consensus Bush’s 2001 address was not as dissimilar to Obama’s 2009 address. The farthest texts from each other were instead Clinton’s 1998 speech and Bush’s 2003 speech . Clinton’s address in 1998 is his first state of the union while Bush’s is his third state of the union.

Mostly that stayed the same in both the cluster analysis and the boot strap consensus were the Obama’s 2009/2010 , H.W. Bush’s 1992 and Clinton’s 1993 addresses staying close to each other.  Obama’s 2009/2010 are his first and second state of the union’s , H.W. Bush’s is his fourth and last address and Clinton’s is his first address. I have a feeling that the similarities between these texts could potentially come from the topic of the economy.  That would also explain why Bush’s 2001 branch is close to these 4 texts.  However this makes me wonder again why the 2001 and 2009 addresses are so far from each other in the cluster analysis chart.

The other branches contain Clinton and Obama close to each other and Trump and Bush close to each other. George H.W. Bush’s addresses branch off not to far away from Bill Clinton’s and Barack Obama’s  but they are not as close to Trump and Bush as they are shown in the cluster analysis.

Overall some things stayed the same and same things moved around from cluster analysis to bootstrap consensus. For me it was useful to look at both because they helped in to see patterns in topics talked about.




  1. I think this is an interesting data set. I’d be curious to see how the role of different speechwriters plays an effect, since the SOTUs are heavily “scripted” (as an aside, looking at transcripts of presidents’ unrehearsed comments would also be illuminating).

    I think our assumption would be that presidents should cluster together based on their party, but do you have any sense of what other “signals” might be impacting the clustering? Like why is Bush 1992 mixed with the Obamas?

    I’m actually not sure how many speechwriters the presidents had, but the difference between cluster analysis and bootstrap trees *may* be the difference between the MFW authorial style and the more political topical words that would start appearing after 100 MFW?

