Primary Component Analysis

Principal Component Analysis (PCA) takes correlated variables and turns them into linearly uncorrelated variables. These variables are then collected and placed into a scatter plot to help visualize and allow us to begin to make inferences and draw conclusions based on the information that we see before us. These scatter plots do not give us all of the answers, and they require a bit of finessing and interpreting in order to understand and bring the data that is shown to life. Although PCA can help us to understand certain things about our chosen corpus, because we still need to interpret the scatter plots we are given, I feel as though I am still at somewhat of a loss when trying to make inferences and draw conclusions from my scatter plots. I understand that there are two components being represented and we have to take what we know about our corpus and what is in it to then make more specific deductions, but I think that without express knowledge of all of the texts in a given corpus, it is difficult to make inferences that make sense and are strongly supported. More than likely, this is the issue I am having, as although I am slightly familiar with the works in my corpus, I am not so much acquainted with them that I am able to make the best deductions in terms of what the scatter plots are able to tell me. However, I feel that my other issue may lie in my lack of understanding of the plots once we begin to add more than just two texts and no longer are told what the two components are. Regardless of the issues that I faced with this assignment, I did my best to understand what was placed in front of me and applied it to the texts that I am quite familiar with.

One of the issues I had when completing this project was working with Voyant Tools themselves. As the server is sensitive and prone to being overloaded pretty quickly, it was a bit difficult to create my scatter plots. I personally feel that as useful as this server is, the fact that there are multitudes of tools makes each of them a bit finicky. When creating my scatter plot, there were a few things that went wrong, namely taking multiple tries to add the stop words back into the graph with multiple refreshes and resets, and after finally getting the scatter plot, having the number of words that it was analyzing begin to change with absolutely zero prompting, skewing my scatter plots and requiring me to attempt in getting the same results, or at least comparable results for each of the plots as to better understand what components they were addressing and whether or not they were capable of aligning.

In the Document Similarity scatter plot, the novels in my corpus were arranged into three groups, denoted by one of three colors: purple/blue, green, and pink. The purple and green novels seemed to be dispersed within each other, not really creating much of a distinction between them. The pink novels, however, almost created their own small group to the right of everything else. These novels, funnily enough, all have a focus on a character who happens to be a female. It makes sense in a way that these novels would then be grouped together as their protagonists share the same gender. I also thought it was interesting that in the lower left corner there were three novels that happen to be grouped together that also center somewhat around the ideas and inclusion of water and islands in their stories. While this might be an interesting correlation between these three novels, it doesn’t explain why Kingsley’s novelĀ The Water Babies which takes place almost exclusively in the water, does not align with the other seemingly water-centric novels, like Verne’sĀ 20,000 Leagues Under the Sea.

In the PCA scatter plot, there was less that I could deduce about what it was showing me. There was a more linear presentation of the data, and it seemed as though it went from the more general and usual stop words like “the”, “and”, “to”, and “of” to more specific pronouns like “she”, “he”, “you”, and “our”. I experimented with changing the number of words that the scatter plot was analyzing, but the seemingly diagonal line from the top left corner to the bottom right did not seem to change very much at all. The only other thing that I found interesting to note was the fact that component one seems to make up over 90 percent of what is being analyzed and looked at. When I changed the number of words, this percentage did not change at all, regardless if I looked at 50 words or 1000 words.

The Correspondence Analysis scatter plot was the one scatter plot that simultaneously clarified and confused me at the same time. I was under the impression that the Correspondance Analysis would simply take the two previous scatter plots and overlay them almost, to help us see the relationship between the individual words and the books as a whole. On some scale, it did do this, although there are still differences between all three of the plots. The connection I believe lies in the location of “she” and “her” and the location of some of the female-centered novels that are located in kind of the bottom right corner of the plot. I also thought it was interesting to note at the top there are the words “man”, “his”, “he”, and “him”, which are also located near novels that have more male-centered stories, with male protagonists. Regardless of these small inferences and conclusions I am drawing, I still am not entirely sure as to what components are being accounted for on either axis in any of the scatter plots. I hope to be able to better understand these scatter plots and make stronger inferences and conclusions as we continue to work with our corpus, finding out even more.

Document Similarity
Primary Component Analysis
Correspondence Analysis

One Reply to “Primary Component Analysis”

  1. I hope you find PCA in R easier! Not as many options, but much less fiddly with final results.

    The results you got here, though, look very compelling and ordered. The pink and green clusters are rather distinct, as is the more dominant upper left group. Looking at MFW words, clearly the use of “the” is a good predictor of how to separate the authors/texts (almost 92%), but on both x and y axes. This is why it appears more central in correspondence analysis: it’s a word that can explain the difference in texts, but is a word that every text frequently uses so it’s equally important to all texts (unlike the example of a distinguishing word like “sir,” which may appear heavily, but only in a handful of distinct texts).

    Isn’t The Water Babies a fairy tale or kid’s story? That might explain why even though it shares a watery theme with others, its genre is enough to push it apart. Moving down the y-axis, do the texts get more “realistic” or “historical” or something? Looking at the correspondence analysis, I wonder if we can’t explain the x-axis with something accounting for the gendered pronouns being so spread out: we/us/our on the left, her/she on the right, and my/I/he/him/they in the middle.

Comments are closed.