Contrastive Analysis

This contrastive analysis project was a bit more complicated than I had first anticipated. I started off fairly confident in my abilities, knowing how to complete and interpret all the steps individually. However, when it came to putting things all together, it became a little bit more difficult and a little bit harder to complete the task.  We had previously practiced small parts of this larger project using a pre-created corpus, primary set, secondary set, and test set.  Although I started out confident, it was clearly beginning to become apparent to me that as I began to separate my own corpus into a primary and secondary set, I was going to have issues deciding what the distinguishing factor of my corpus I was going to be testing.  I had originally wanted my corpus to be divided according to the nationality of the author, namely European versus the US, but as I started to divide my corpus I found only one or two American authors against a very large amount of European authors. I then decided to attempt to divide the corpus based on gender, namely the female authors versus male authors. Here too I found a little bit of an issue as the majority of my corpus was written by male authors.  Grasping at straws, I then decided to focus on the protagonist of each text. I divided the stories based on whether the main character of each piece of literature was a female or a male. There were a few cases where the protagonist was not a human and therefore I went with whatever gender markers were used to denote that character.  My primary set became texts that featured female protagonists and my secondary set then became texts that featured male protagonists. I figured since I was working with the subject of children’s literature, my test set would then be J. K. Rowling‘s Harry Potter novels.

Once all three of my sets were completed, I began the project. I started by using Stylo to run oppose, and created a words preferred and avoided list (keeping in mind that my primary set was texts with female protagonists) and the following visual.

As I had separated my corpus into sets distinct of their genders, it wasn’t too surprising to see that the top avoided word was men, followed by words like man, master, sir, and blood. All of these words are not ones that would be used in relation to females, and especially ones in children’s literature at that. I then ran oppose again using the test set as well and created the following visual.

This second visual was a little alarming, as it did not show the distinction that I was hoping to appear. Although I had painstakingly looked for a factor with which I would be able to separate my original corpus, it seems as though my male protagonist texts and female protagonist texts were not as different as I had originally believed. The overlap between the two was quite large, and the addition of Rowling’s novels did not help in making things that much clearer. Rowling’s novels too were interspersed with the other texts, with nothing special denoting that all seven of her novels were centered around the titular character, Harry Potter. Regardless of the results that this visual gave me, I continued with the next part, running the test set against the words preferred list and creating the following PCA visuals.

I created both a PCA Classic and a PCA Symbols as looking at both next to each other helps me to better understand what exactly I am seeing. Although I already knew that there was a lot of overlap with the texts, to begin with, there was a distinction between the seven Harry Potter novels and how much they align with the preferred words of female protagonist texts or conversely, the preferred words of the male protagonist texts. The text that seems to be the most aligned with the words preferred is the fourth Harry Potter book, Harry Potter and the Goblet of Fire. I thought this was interesting as this book has a ball in it, and I wonder if this seemingly “girly” topic of a ball might be what made it most like my primary set. There is also the question of there being more females in the book in general, as there is the appearance and inclusion of an all-girls school named Beauxbatons Academy of Magic, which may also be the reason why this novel, in particular, was skewed so that it seemed to be more like the female-centered texts rather than the male-centered. I also found it interesting that Harry Potter and the Sorcerer’s Stone and Harry Potter and the Deathly Hallows are two of the texts that are least like female-centered texts, and more like male-centered texts as they are the books that are a bit more action-centered than the others in the series.

After getting the results from the PCA graphs, I decided to run a cluster analysis, at this point just looking for some further understanding as to what I may have messed up or why the project didn’t produce the results that I was originally expecting or hoping for. I ran stylo, looking at the 100, 500, and 1,000 most frequent words of all of the texts that I had been working with, in my primary, secondary, and test sets. The results for all three of the cluster analyses were almost the same, with little variation in the grouping and distance within the different clusters. I have included the 1,000 most frequent words cluster analysis below. It was interesting to see how the Rowling novels were in their own cluster away from everything else, and yet when it came to the oppose functions, the Rowling novels were pretty cleanly dispersed with the other texts when it came to overlapping.

Overall, I started off with some pretty high expectations and understandings as to what this project was going to give me. However, actually working with the corpus and getting a bit mixed up with the different aspects and pieces made quite a few road bumps that hindered me only slightly when completing the assignment. I wasn’t able to gather as much of an understanding as to the differences between the texts as I did the overlap and similarities between them. If I were to rework this project, I would need to find a different way to separate my corpus and help to create a better distinction within what the visuals represent.

One Reply to “Contrastive Analysis”

  1. I think you’re absolutely right to reflect on how making decisions about dicing your corpus sort of determines what exactly you are hoping to test. It also is the moment where our preconceived ideas about literary style can sneak in . . . such as the relative importance of nationality, gender, etc. Looking at protagonist gender is interesting, and now I’m curious if/how it aligns with authorial gender on a large scale.

    That the top avoided word is “men” and “man” is kind of fascinating to me. Is it that female characters intentionally avoid this word? Or that male protagonists use it excessively? For instaance, we don’t see “women” or “woman” atop the preferred words. To me it also raises a larger question of how gender is discussed in literature–it frequently appears not as a “real” thing in itself, but rather something that exists in opposition to something/someone else. (and keep in mind that appose will always find difference, so we have to reflect on whether or not this difference is meaningful.)

    I think your markers visualization suggests that while there is different in preferred/avoided words, it might not be super meaningful in separating the two sets from each other. I see, though, a dominant green cluster in the lower right, with a smattering moving to the upper left. I wonder what green segments are in the upper left, interspersed with the red? It’s almost as if those “belong” to the other secondary set.
    And also, what are those text segments in the bottom left corner, that have both a lot of preferred and avoided words? Identifying those texts might help understand a bit more what’s going on.

    The PCA space for Rowling is very interesting (as is the cluster analysis): as you point out, from one visualization, it seems as if Rowling is square in the middle of these sets, but from another, her work looks very spread out and unclustered. Thinking about why the use of principal components introduces that dispersal (and what your two axes are capturing) is useful. My guess, working from the cluster analysis, is that you’re picking up a strong time signal…the Rowling is just so much more modern in its use of MFW.

Comments are closed.