This contrastive analysis project was a bit more complicated than I had first anticipated. I started off fairly confident in my abilities, knowing how to complete and interpret all the steps individually. However, when it came to putting things all together, it became a little bit more difficult and a little bit harder to complete the task. We had previously practiced small parts of this larger project using a pre-created corpus, primary set, secondary set, and test set. Although I started out confident, it was clearly beginning to become apparent to me that as I began to separate my own corpus into a primary and secondary set, I was going to have issues deciding what the distinguishing factor of my corpus I was going to be testing. I had originally wanted my corpus to be divided according to the nationality of the author, namely European versus the US, but as I started to divide my corpus I found only one or two American authors against a very large amount of European authors. I then decided to attempt to divide the corpus based on gender, namely the female authors versus male authors. Here too I found a little bit of an issue as the majority of my corpus was written by male authors. Grasping at straws, I then decided to focus on the protagonist of each text. I divided the stories based on whether the main character of each piece of literature was a female or a male. There were a few cases where the protagonist was not a human and therefore I went with whatever gender markers were used to denote that character. My primary set became texts that featured female protagonists and my secondary set then became texts that featured male protagonists. I figured since I was working with the subject of children’s literature, my test set would then be J. K. Rowling‘s Harry Potter novels.
Once all three of my sets were completed, I began the project. I started by using Stylo to run oppose, and created a words preferred and avoided list (keeping in mind that my primary set was texts with female protagonists) and the following visual.
As I had separated my corpus into sets distinct of their genders, it wasn’t too surprising to see that the top avoided word was men, followed by words like man, master, sir, and blood. All of these words are not ones that would be used in relation to females, and especially ones in children’s literature at that. I then ran oppose again using the test set as well and created the following visual.
This second visual was a little alarming, as it did not show the distinction that I was hoping to appear. Although I had painstakingly looked for a factor with which I would be able to separate my original corpus, it seems as though my male protagonist texts and female protagonist texts were not as different as I had originally believed. The overlap between the two was quite large, and the addition of Rowling’s novels did not help in making things that much clearer. Rowling’s novels too were interspersed with the other texts, with nothing special denoting that all seven of her novels were centered around the titular character, Harry Potter. Regardless of the results that this visual gave me, I continued with the next part, running the test set against the words preferred list and creating the following PCA visuals.
I created both a PCA Classic and a PCA Symbols as looking at both next to each other helps me to better understand what exactly I am seeing. Although I already knew that there was a lot of overlap with the texts, to begin with, there was a distinction between the seven Harry Potter novels and how much they align with the preferred words of female protagonist texts or conversely, the preferred words of the male protagonist texts. The text that seems to be the most aligned with the words preferred is the fourth Harry Potter book, Harry Potter and the Goblet of Fire. I thought this was interesting as this book has a ball in it, and I wonder if this seemingly “girly” topic of a ball might be what made it most like my primary set. There is also the question of there being more females in the book in general, as there is the appearance and inclusion of an all-girls school named Beauxbatons Academy of Magic, which may also be the reason why this novel, in particular, was skewed so that it seemed to be more like the female-centered texts rather than the male-centered. I also found it interesting that Harry Potter and the Sorcerer’s Stone and Harry Potter and the Deathly Hallows are two of the texts that are least like female-centered texts, and more like male-centered texts as they are the books that are a bit more action-centered than the others in the series.
After getting the results from the PCA graphs, I decided to run a cluster analysis, at this point just looking for some further understanding as to what I may have messed up or why the project didn’t produce the results that I was originally expecting or hoping for. I ran stylo, looking at the 100, 500, and 1,000 most frequent words of all of the texts that I had been working with, in my primary, secondary, and test sets. The results for all three of the cluster analyses were almost the same, with little variation in the grouping and distance within the different clusters. I have included the 1,000 most frequent words cluster analysis below. It was interesting to see how the Rowling novels were in their own cluster away from everything else, and yet when it came to the oppose functions, the Rowling novels were pretty cleanly dispersed with the other texts when it came to overlapping.
Overall, I started off with some pretty high expectations and understandings as to what this project was going to give me. However, actually working with the corpus and getting a bit mixed up with the different aspects and pieces made quite a few road bumps that hindered me only slightly when completing the assignment. I wasn’t able to gather as much of an understanding as to the differences between the texts as I did the overlap and similarities between them. If I were to rework this project, I would need to find a different way to separate my corpus and help to create a better distinction within what the visuals represent.