PCA (Principal Components Analysis)

            According to Jockers, Macroanalysis, PCA is a method of condensing multiple features into “principal components,” components that represent, somewhat closely, but not perfectly, the amount of variance in the data. It is a process that involves “training” a machine to recognize an author’s writing pattern which, in turn, allows the machine to classify an unknown text according to how it matches or is similar to the training data. Using the voyant tools, we can use visualization tools and use a Scatter Plot to be able to graph a PCA. PCA is able to plot data from a 4-dimensional graph into a 2-dimensional graph that’s easier to observe. It is basically a dimension-reduction tool that can be used to reduce a large set of variables into a smaller set that still has the same information as the larger set.

            PCA comes in three flavors: Principle components analysis, correspondence analysis, and document similarity. PCA uses data to graph words that are counting the similarity and differences. Correspondence analysis is a technique for graphically displaying a two-way table by calculating coordinates representing its rows and columns it looks at both words and documents title. Last but not least, document similarity is a central theme in information retrieval, it uses documents and defines if they’re similar if they’re semantically close and describe similar concepts. With this data, we are to place unknown texts and see where on the graph is it most closely to the corpus we are using. In my case, I can be given a random text and I can plot it into my corpus/ scatter plot and see whether the author is Jane Austen, Charles Dickens or Mark Twain. The machine is able to figure this out based on the most frequent words used. 

            Using my data set which is a corpus that contained 7 texts by Jane Austen, 9 texts by Charles Dickens and 10 texts by Mark Twain I first observed the first flavor, Document similarity, which just looked at the documents as a whole first. I noticed one text that was completely far away from the huge cluster in the middle. That text was called The Pickwick Papers by Charles Dickens which was green. Then I realized all of Dickens texts were in a similar cluster to Austen’s texts, they were all purple, and Twain was a little further away in its own cluster which was pink away from the purples. I only used 3 clusters and it was interesting to see that Dickens’ texts were cluster with Austen’s texts because I would have assumed Twains and Dickens to be clustered together since they were both males. I thought gender would have made a huge difference based on authors from closer time periods. But then I realizes Dickens’s and Austen are both English writers while Twain was an American writer so I would assume that nationality and region writing makes a bigger difference based on dialects used between Europeans and Americans. 

            I tried to investigate deeper as to why the Pickwick Papers by Dickens was farther apart from any of the three authors. It couldn’t be because it’s the longest as it’s the 4thlongest text I’ve used. I looked into the category of distinctive words, which is compared to the rest of the corpus, and in that text, Pickwick was used 2,174 times while every other word in the list was listed to be used below 700 times. While I was looking into more information The Pickwick Papers was Charles Dickens first novel. It surrounded itself around law and politics, and marriage and love while Dickens’ usually themes revolved around the suffering of the poor, hard times as he was writing around the Industrial Revolution. I played around with document similarity changing clusters into 4. Twain became divided into two clusters blue and pink. Dickens became its own cluster, purple, while its one text was still completely far away and Austen became its own cluster, green, as well. 

            Then I used Correspondence Analysis, the word Pickwick was completely far away which also corresponded with the document similarity graph. In the correspondence analysis, gender became the focus. Words such as young, lady, Mrs. and miss were clustered around Jane Austen novels and the words: gentleman, boy, old and sir where clustered around Dickens and Twain texts. In PCA, there was a huge cluster in the middle and the words time, Mrs., said and Mr. stood out the farthest. These are the most frequent words in the corpus used possibly because of gender and the format the texts are written in as in he said, she said, etc.

Correspondence Analysis 1
Correspondence Analysis 2
Document Analysis 1
Document Analysis 2

One Reply to “PCA (Principal Components Analysis)”

  1. Great observation about how the “nation signal” seems to take over the gender on in your corpus. I’m curious if you would have had the same results adding the stop words back in? That might also reduce the effect of “pickwick” somewhat.

    I was hoping you could say a little more about what was being expressed by the two principal components. You do this a little, but you could do more to answer that by connecting the documents and words that appear near each other in PCA space.

    I’m also not sure I understand the difference between the two sets of graphs you made; some look inverted, and some are repeats?

Comments are closed.