Daniella Jimenez

Professor Ferguson

English 391 w

PCA Project


Voyant Tools is used to analyze texts. It can be used to analyze online texts or ones uploaded by users. Utilizing this website I was able to get a much better understanding as opposed to last project of my texts that I have chosen for my corpus. In voyant tools we utilized the Principal Component Analysis which is used on Voyant to optimize the data . There are three forms of Principal Component Analysis. The three forms are Correspondence Analysis, Document Similarity, and Scatter Plot. My corpus consists of fictional texts from different countries and continents. In the previous project I really was not able to understand exactly why certain topics were picked. By using the Correspondence Analysis on Voyant, I was able to understand not only my texts but, my previous project better.This tool displays the results of a statistical analysis using a scatter plot visualization. There are two types of analysis available: Principal Component Analysis and Correspondence Analysis. Scatter Plot is used to show the relation of the words used in a corpus. This visualization provides a statistical analysis that takes the word’s relation from each document. Each document ids used to show a dimension. Correspondence Analysishandles the data in such a way that both the rows and columns are analyzed. This means that given a table of word frequencies, both the words themselves and the document segments will be showed in the picture.

In this first picture we see the countries in light blue and, terms in dark blue. Towards the left of the x- axis we see Canada, Australia, Africa, and Italy. Italy is more towards the top of the y-axis. The only Italian author lower in the y-axis is Davis and, they are close to Canada. Africa seems to be all close together hover, it is a little close to Australia. Towards the right of the y-axis there is one African text and Canadian text that are pretty much the outsiders. Ine the second picture we are able to see the words come and came closer together towards the center of the visualization.  The word “said” all the way to the left has a bigger circle compared to the other words displayed so, that can only mean to me that it is a very common word used.

I also used the same corpus and completed the Document Similarity offered on Voyant. Document Similarity is essentially the same as Correspondence Analysis, but terms aren’t shown in the graph.

In the Document Similarity I see only words but, no text names. Towards the left I see more speech words; showing me there is more dialogue. Towards the right I see more sight words. I see a similarity in the y-axis but, not in the x- axis.

In the bottom of the visualization I see more references to people. I see the name “anne” which was also one of the words I saw in the last project, Topic Modeling Tool. I see Miss, Mrs, Mr,, men. On the right of the top of the y- axis and right of the x-axis, I see references to the body. Words such as eyes, face, hand , head. And also close to those descriptive words I see actions that can be completed with these body parts. For example, heard, looked, saw, think. In the middle I see references to location with the term place. Terms like day, night, moment, went, going, work, came. Can all be associated with a location and a person in relationship to a location. I see also regular terms that have a relationship with one another. Words like, right, good, great; think, know. I still see the words said all the way to itself on the left of the x-axis and in towards the bottom on the y-axis. Also, in the same color as the previous visualization dark blue. I was able to get an idea what type of dialogueis being used in these texts and, what were the priorities of the texts. What I think of as the priority is descriptions of the characters, location and setting. It was able to help me understand how similar all these countries actually are.


One Reply to “PCA”

  1. Don’t forget that you can also add the stop-words back in to see if that would make a difference in terms of the results you see. Along those lines, I bet “footnote” is something you could remove to try to get better results, since it is clearly skewing your y-axis.

    It seems like you’re seeing nations spread across the x-axis. Is there any other word-based evidence to support that idea?

    Did you use a different number of terms for each visualization? I’m not sure why your results look so different in each one.

Comments are closed.