PCA – Angelica Hidalgo

Working with my previous corpus of twenty-six texts of witchcraft, I generated 3 types of PCA on Voyant. With my understanding of how PCA works it “condenses features into components, these techniques keep the features as distinct dimensions”. It composes multiple features into “principal” components that represent somewhat closely, but not perfectly the amount of variance in the data.

I think uploading the corpus onto Voyant was pretty simple and self-explanatory. I like the layout of the site, and it’s pretty easy to navigate. I also like how on voyant there are serval features that can be used to create different models. I understood that on the left hand side of the website when looking at the scatterplot you could change between PCA, correspondence analysis and document similarity. Principle component analysis consists of frequented terms, correspondence analysis shows both the text document names and the frequented terms and the document similarity just consists of the text document names.

So looking at my PCA models I can see that the word “devil” and “spirits” were very frequent on in the texts in the corresponds analysis. I can see the word “ye” was the least frequent. Also by looking at the models it is visually easier to understand. But I am confused how on some charts of the PCA models the and Y axis are contrasting. For example, on the Y axis the numbers ascend from higher to lower or lower to higher, I’m not sure how the distinction is made. Another thing is perhaps when uploading the corpus on to the generator I can remove the last names from the texts, because it does make it look messy, and kind of lumps up together which can make it confusing. I can see in the corresponds analysis the document terms & the document names cluster to the right side of the scatterplot. Where in just viewing the document similarity they all clump to the left side of the scatterplot. When just viewing the terms they reside in the middle of the scatter plot. In this scatterplot it shows the word “mind” to be the least frequented on both plots, while “Mr.” is most frequented on the Y axis, and “said” to be frequented the most on the x axis. I can see that the word “Mr.” and “God” were very apart, and to also think that the word God was not amongst most frequent in this scatterplot is interesting. But referring to my first mention that the words “devil” and “spirits” are most frequented does make sense because of the context and topic of witchcraft.

In the document similarity PCA analysis it was refreshing to see the “Mathers” texts were clustered together, which shows that they are similar. Two of Upham’s works were also closely together, and similarly close to Mathers. However, there was one of Upham’s text which was all the way to the far right, and which remained by itself. I can see that the historical texts were the document texts that were clustered to the right, while about two of the fictionalized texts included in the corpus are both in the middle on top of each other. I also appreciate that there are different colors in the models, so that it is easier to compare and spot the differences.

I will say that I need more practice to get a better understanding of how to use the PCA models, or better yet how to read it more clearly, but I can say it is more understandable for me than using the topic modeling. I kind of get a grasp of these models to a certain extent, and understand how PCA uses serval components to combine different features of the texts.

 

One Reply to “PCA – Angelica Hidalgo”

  1. I’m not sure I understand the difference between the first and second visualization. What did you do differently there?

    Don’t forget that you should run these results both with and without stop words, to see if your argument holds at both levels.

    I was hoping you could do better in explaining what you see along the two principal component axes? In your second visualization, it looks like most of the different is explained along the x-axis, while the y-axis is fairly compressed. What does it look like if we remove the word “footnote”? The first visualization has more variety, but raises the same question of what your two principal components mainly seem to be capturing.

Comments are closed.