My data set is composed of fairy tales, Christmas stories, and mythology. I did not know what to expect. Most of it was pretty understandable. I believe that what messed up my data was when I tried to export the images, the scatter plot would change. This became difficult because the data that I was happy with, was not 100% what I wanted. Overall, the experience was pretty great, I learned a lot and most of the stories that I have gathered have been able to prove my hypothesis on various accounts.
What I noticed was that most of the fairy tales stuck together and they were more clustered and grouped in a consistence state. King Arthur was the most separated of the group. King Arthur and Robin Hood were also more closely together, just not 100% part of the clusters with the fairy tales. The assumption can be made that maybe King Arthur was not apart of the fairy tale land, and because it dealt with more history than fiction, it separated itself. Also, the language, such as 18th century language was also used in King Arthur which could have been a deciding factor why it was separated from the other fairy tales. Also, it is shown very clearly that the mythologies like The Iliad and Troy are closer together because of similarities.
In the PCA, you can see that the most common words were bunched up together through out the graph. The word said appeared to be a large part of the stories but was not apart of the cluster. There is a big gap between the word King and said. One assumption towards that would be that maybe they can be dissimilar in many ways and also, the word King is not used as much in fairy tales, especially when there is magic involved and heroes and villains are the ones that are being emphasized.
I found the entire project interesting because when playing around with the amount of terms and words that are being used within each text, you get to see how most of the time, some texts clusters with one another and other times, other texts are distant. One issue that I had was just being able to clarify what words were what because they were so clustered together, I was unable to identify it.