Project #3– Topic Modeling and Visualization
March 28, 2019
In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently. The “topics” produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document’s balance of topics is.
It was a little difficult for me utilizing this tool towards my corpus. My corpus is Comparing Countries and some of the texts are written in different languages. Therefore, some of the topics were words in different languages that I could not understand. I immediately thought to google the translations. Some words I was able to get a translation and, others I was not. Some words I received translations for and they were for languages that I would not think that I would get such as, Arabic. When I look and read the list of Topics I am able to understand how some of the topics and, words may be grouped together but, other time I really am not. I’m not sure if honestly my corpus is the best example to use for Topic Modeling since it is involving other languages.
However, overall looking at the topics that were provided I can understand why these were chosen since the genre I chose was Fiction for all countries. For example some words that I would think are more common in fiction texts are; Madame, Don, King, and love. Some words made sense that they were grouped together. For instance on one group there were words like; Anne, Mrs., miss, Mr., Marilla (Spanish version of her), Diana. So these are all words that say to me we are talking about people, mostly woman. But then in that exact same group also that’s in there is the word “good.” That was honestly driving me crazy. I did not understand how the word good can be in the same group as these other words.
I also tried to use other graphs that were offered on the website. I stuck with the pie chart option because honestly it’s the easiest for me to understand. Also, in my laptop at home I had all the tools but, I was not able to make the graphs online. I was not able to because when I logged on the website not all the contributions or topics were popping up for me. I ended up having to wait until I was able to use the school’s laptop to complete the project. Other than the complications, the topic modeling tool is super easy to navigate and, doing the graphs was an even easier task for me. To be honest I just feel when it’s my home laptop and with my corpus as stated before it makes things a bit more complicated. But none of the things are hard or complicated for me to do. However, the other graph visualizations were difficult for me to understand.
Another thing that I really couldn’t figure out is for the Topic Modeling some words kept repeating themselves which I did not understand why. I went back to my corpus to see if maybe I did something wrong and I even went to the IT desk at school and they said everything looked fine. So I honestly was not sure if that was normal or, not. Other than that everything seemed pretty good. Compared to everything that was previously learned I know I grasped on this topic faster than the other ones that were previously introduced to us. I thought the pie chart was in fact helpful because I was able to see why certain texts were grouped together.