Principal Component Analysis

My data set is composed of fairy tales, Christmas stories, and mythology. I did not know what to expect. Most of it was pretty understandable. I believe that what messed up my data was when I tried to export the images, the scatter plot would change.  This became difficult because the data that I was happy with, was not 100% what I wanted.  Overall, the experience was pretty great, I learned a lot and most of the stories that I have gathered have been able to prove my hypothesis on various accounts.

What I noticed was that most of the fairy tales stuck together and they were more clustered and grouped in a consistence state. King Arthur was the most separated of the group. King Arthur and Robin Hood were also more closely together, just not 100% part of the clusters with the fairy tales. The assumption can be made that maybe King Arthur was not apart of the fairy tale land, and because it dealt with more history than fiction, it separated itself. Also, the language, such as 18th century language was also used in King Arthur which could have been a deciding factor why it was separated from the other fairy tales.  Also, it is shown very clearly that the mythologies like The Iliad and Troy are closer together because of similarities.

In the PCA, you can see that the most common words were bunched up together through out the graph. The word said appeared to be a large part of the stories but was not apart of the cluster. There is a big gap between the word King and said. One assumption towards that would be that maybe they can be dissimilar in many ways and also, the word King is not used as much in fairy tales, especially when there is magic involved and heroes and villains are the ones that are being emphasized.

I found the entire project interesting because when playing around with the amount of terms and words that are being used within each text, you get to see how most of the time, some texts clusters with one another and other times, other texts are distant. One issue that I had was just being able to clarify what words were what because they were so clustered together, I was unable to identify it.


One Reply to “Principal Component Analysis”

  1. Did you try adding stop words in? I wonder if you would notice the same results at both levels, particularly the question of why King Arthur stands out (and what distinguishes fairy tales from history).

    I was hoping you could do a better job explaining what you see with the two principal components. Is principal component 1 capturing the difference between fairy tale (three bears) vs. more historical (Iliad, King Arthur)? What do we notice in the title when we move along the x-axis. Similarly, why are so many texts in a similar area on the y-axis, but with four strong outliers? Is it simply the word “sir”? and what would happen if we took that word out of the calculation?

Comments are closed.