Principle Component Analysis (PCA)

I didn’t know what to expect from the data with my corpus because it is consisted of African fiction novels. The first data I analyzed is the graph. PCA 1 is consisted of the authors, and PCA 2 is the relative frequency of the words; the words that are selected are: “said,” “man,” “like,” “time” and “little.” I’m assuming these are the most popular words among the authors. Among the authors, these words range from 0.001-0.004. However, the word “said” changes significantly, from the first novel of Sabatini, close to 0.001, to the max on the scale—0.015. It remains at a high but starts to fall for the author Wallace. However, it remains higher than the other authors—Buchan, Collingwood, Conrad, Haggard and Kingston.

I’ve decided to use different words, but I chose only the generic version to each word. The words I chose are “great,” “old,” “God,” “father,” and “mother.” I chose these randomly. This scale is ranged from 0.000-0.010. “Mother” is the lowest of them all, and it remains low, barely getting to 0.001 at its peak for Conrad. “Great” gives a steady reading among the authors. Like the first graph with the word “said,” old remains about the same as “great” until it gets to Sabatini’s last novel and Wallace’s first novel—Bones. The peak is just about 0.009. Then, “old” falls back down to 0.001, but never exceeding 0.003 which is the peak for the other words. Just to see the result of the two most frequent words, I also added the word “said.” I quickly viewed the cirrus flavor and, by looking at the sizing of the words, I could tell that “Said,” “man,” “like,” “time,” and “came” are the most frequently used words. But, it’s a different reading from the graph, so I’m not sure if by just looking at the sizing of the words, rather than looking on a scale, I’d get an adequate reading.

The scatter plot is especially interesting because so much is occurring at once. The first flavor I used is the Correspondence Analysis I’m not sure why, but the first thing I check was the frequency, and it is one hundred and ten. The scale is in negative as we’ve established in class. On the y-axis, the scale is -1.2-0.5, and zero is a little more than half on the scale. The x-axis is from -0.5-2.9 with zero being between -0.1 and 0.1. the second thing I noticed is that the graph if consisted of the author’s name and title (how I labeled them), and frequent words used. One thing I found interesting is that “Mr” and “said” over laps although “Mr” ranks at 2599 and “said is at 9876. The other words that are visible that I haven’t listed are “captain,” “bones,” “young,” “having,” and “water.” When I say visible, I mean that are not so clustered with the other data presented. “Bones” and “captain” are the most interesting because they clustered far away from the other words and authors, and they are also far away from each other. On the x-axis, “captain” is about -0.4 while “bones” is approximately 2.8 and a half. On the y-axis, “bones” is -0.6, and “captain” is -1.1 and a half.

When I removed the documents label and check the terms, the name of the author/title is removed, and I’m left with the most frequent words or the raw frequency. When the roles are reversed, the raw frequency words are removed, and the name of the author/title is present. When both the document and term are removed, the raw clusters are available. Dimension one and two shows the percentage of the x-axis and the y-axis which are 35.84% and 12.57%. In the Principal Analysis Component flavor, the scared is changed, but the data is clustered similarly in the same area. One major difference is that “said” is very far away from the cluster with “bones” following close behind. In this flavor, only the raw frequent words are available. The PCA 1 (a-axis) is 55.92%, and PCA 2 (y-axis) is (17.32%). When the term is unchecked, the visibility of the words is removed, and only the cluster is available as well as for the document. The number of terms also changed from one hundred and ten, to eighty.

For Document Similarity, dimension 1 (x-axis) is 28.76% and dimension 2 (y-axis) is 15.01%. I decided to change the number of terms for this flavor; I chose fifty randomly. Two of Edgar Wallace’s novels are group on the opposite side of the cluster—The Keeper’s of the King’s Palace and Bones. While, Rider H. Haggard’s The People of the Mist is at the bottom of the cluster but about five points away. When documents is unchecked, the author’s name/title is removed and only the clusters are available.

https://voyant-tools.org/?corpus=21c06af5f7f5cbe1bda219769f8b0a3c&query=said&query=man&query=like&query=time&query=little&view=Trends

https://voyant-tools.org/?corpus=21c06af5f7f5cbe1bda219769f8b0a3c&view=Trends&query=said*&query=great*&query=old*&query=father*&query=god*

https://voyant-tools.org/?corpus=21c06af5f7f5cbe1bda219769f8b0a3c&limit=80&view=ScatterPlot

 

One Reply to “Principle Component Analysis (PCA)”

  1. I’m not sure I understand what you the two principal components are. In the third visualization, there is a strong effect on the Y dimension, but only a little variance on the x-axis. How do you explain that? I wonder why “bones” is so important in accounting for difference? You note how it and “captain” are unique outliers, but I want to know what explanation you might have for that. I also wonder what it looks like with this word removed, or stop words added back in.

    What made you pick the words you did? Why choose at random, instead of the words that PCA revealed as bearing extra weight? Why chose a number of terms randomly? I think you want to be more methodical about how you move between the larger visualizations and the specific aspects of your corpus.

Comments are closed.