Contrastive Analysis


The data sets I used for this project is not the initial dataset I created, because I wouldn’t generate useful data, I used data sets available to us in the drop box file. For the primary set, I used men modernist, for the secondary set, I used men contemporary authors, and for the test set, I used female contemporary authors. I didn’t have any ideas about what the result would be from my corpora, except, a similarity between the male authors despite the era in which the texts were written. The only reason for this is because of gender; I think there are gender markers regardless of the time the novels were written. To my surprise, there is a major difference between the male modernist and the male contemporary author on the Craig’s Zeta preferred, and avoided list. The male modernist prefers words like shall, cried, rather, suddenly, Mrs, afraid, etc., while male contemporary prefers words such as yeah, shit, fuck, kid(s), guy(s), phone, etc. The results are shocking because I didn’t expect them to be so different, but after viewing the words preferred and avoided, it makes sense that this would be the result. The modernist and the contemporary author word use would be different because of the era, regardless of their gender. This is something I didn’t consider before I received the results.

After running oppose() to generate a word list for primary set, I used the words preferred txt file (I renamed it word list) and the test set—the female contemporary—to generate a Principle Component Analysis covariance by using the existing word list. I didn’t have any predictions for this result, and it was a bit hard to read since the results were in the same color, red. To my understanding, to properly interpret the PCA results, the closer to zero the items appear, the more related the are to the word list, which are the words that the primary set, the male modernist, prefers. Despite knowing this, I still didn’t know how to put the results in words. I failed to realize that the primary set is consisted of different novels and authors, not just different novels by the same author (I’m not sure why I thought this). The test set, the female contemporary, also has different novels and authors. The results for the PCA is a bit different to read because the novels in the test set that are closer to zero are more related to the words preferred in the primary set, male modernist.

The closest novels to zero is Alice Munro’s novels; the closest is Moons of Jupiter. This proves that Munro style of writing is more like the male modernist than the male contemporary writer. I am surprised, and I am not. Some women writers are more reserved as I’d describe the male modernist writer. One interesting thing I see is that the novel, Love of a Good Woman is the second closest novel to zero. It wouldn’t occur to me that a novel with that title would be more related to male modernist authors, that to male contemporary authors. However, I am not surprised to see The Regeneration by Pat Barker the farthest from zero on Principle Component 2. This indicates that The Regeneration is more like the male contemporary writer. Hilary Mantel’s Bring Up the Bodies is -0.01, while The Regeneration is 0.2.5. Most of the novels in the test set are aligned. On Principal Component 1 (the x axis), all the novels are just about 0.01; to be exact, they’re all less than 0.02, except one novel. The novel that is 1 on Principle Component 1 is Alice Munro’s Beggar Maid. However, it is in the top five closest novels to zero on Principle Component 2. The percentage for Principle Component 1 is 37.4 while for Principle Component, it is 14.3.

The Craig’s Zeta paragraph with markers and anti-markers is a bit confusing because I don’t remember how to read the data. I also forgot to include the key, but if I remember correctly, the green represents the primary set (male modernist), the red represents the secondary set (male contemporary), and the blue represents the test set. There is an obvious overlap of the novels, but I can’t tell, here, which set (the primary or secondary) the test set prefers. I’m not sure if the cluster of blue (test set/female contemporary) and the green (secondary set/male contemporary) outside of the red (primary set/male modernist) is showing that female contemporary is more akin to the male modernist which is the data shown on the Principle Component Analysis.


One Reply to “Contrastive Analysis”

  1. It’s interesting to see some of your assumptions be revealed around the “consistency” so to speak of a male authorial voice over time.
    You’re correct about interpreting the PCA space: the closer to zero, the more similar in usage of the preferred word list. One thing that would make it easier to interpret would be to remove the leading “f_” so that you color code by authors’ names.
    Your PCA space is interesting because, as you note, the majority are aligned in terms of the first principal component. That Munro is so far to the right by itself makes me wonder if the file was corrupted or something. Can you make any guesses about what is being expressed in principal component two? Perhaps Barker is pushing that axis?
    For the final visualization, it seems to me that your test set is primarily in the overlapping area, but about a third are with the green

    Also, we should always keep in mind that we’re asking the program to find difference, so we have to compare these results with other tests to draw the best conclusions. This approach finds distinctive words, but we could also try MFW.

Comments are closed.