Second Try: Sentiment Analysis in Python
Originally published on my old blog.
Introduction
After my first experiments with using R for sentiment analysis, I started talking with a friend here at school about my work. Jackson and I decided that we’d like to give it a better shot and really try to get some meaningful results. After a lot of research, we decided to shift languages to Python (even though we both know R). We made this shift because Python has a number of very useful libraries for text processing and sentiment analysis, plus it’s easy to code in. We launched right into tutorials and coding, and this post will be about that process and our results.
We also met with Christopher Potts, a professor of linguistics here at Stanford. Prior to meeting with him, we consulted his sentiment analysis guide extensively and found it incredibly useful. We had a fantastic chat with Professor Potts and he helped us grasp some of the concepts we were working on.
If you’d like to jump straight to seeing the full code, you can head over to the GitHub repository.
The Setup
One of the resources we got a lot of mileage out of was StreamHacker, especially the articles on basic techniques, precision and recall and eliminating features. More to follow about each of those elements.
Another great discovery was the Natural Language ToolKit (NLTK). This is an incredible library for Python that can do a huge amount of text processing and analysis. This would end up forming the basis for our program.
During our first attempt, we basically just tried to convert my program in R into Python. We quickly realized that not only did Python have more efficient ways to do some of the steps, but it also was missing some functionality that I used in the R version. So instead, we started based on StreamHacker’s code.
An important piece of sentiment analysis terminology: “features” are whatever you’re analyzing in an attempt to correlate to the labels. For example, in this code, the features will be the words in each review. Other algorithms could use different types of features — some algorithms use bigrams or trigrams (strings of two or three consecutive words, respectively) as the features.
An idea from StreamHacker that we really liked was writing a function to evaluate different feature selection mechanisms. That means that we would be able to write different methods to select different subsets of the features (read: words) in the reviews and then evaluate those methods.
As an aside, here are the imports we used for the project, so I won’t have to reference them again:
We began to write our feature evaluator (again, many thanks to StreamHacker.):
This is the same polarity data that was used in my previous post, so check that out if you’re curious about the data. Essentially what that block of code does is splits up the reviews by line and then builds a posFeatures variable which contains the output of our feature selection mechanism (we’ll see how that works in a minute) with ‘pos’ or ‘neg’ appended to it, depending on whether the review it is drawing from is positive or negative.
The next bit of code separates the data into training and testing data for a Naive Bayes classifier, which is the same type of classifier I used before.
Now, thanks to NLTK, I can very simply train my classifier:
Pretty cool, huh? The last thing this function needs to do is check how well the classifier does when it tries to classify the testing data. This code is a little challenging so I’ll walk through it thoroughly.
First, I have to initiate referenceSets and testSets, to be used shortly. referenceSets will contain the actual values for the testing data (which we know because the data is prelabeled) and testSets will contain the predicted output.
Next, for each one of the testFeatures (the reviews that need testing), I iterate through three things: an arbitrary ‘i’, so be used as an identifier, and then the features (or words) in the review, and the actual label (‘pos’ or ‘neg’).
I add the ‘i’ (the unique identifier) to the correct bin in referenceSets. I then predict the label based on the features using the trained classifier and put the unique identifier in the predicted bin in testSets.
This gives me a big list of identifiers in referenceSets[‘pos’], which are the reviews known to be positive (and the same for the negative reviews). It also gives me a list of identifiers in testSets[‘pos’], which are the reviews predicted to be positive (and similarly for predicted negatives). What this allows me to do is to compare these lists and see how well the predictor did. Here’s where one of the StreamHacker articles (on precision and recall) really helped.
The essence of those two terms is that precision is a measure of false positives — a higher precision means fewer reviews that aren’t in the desired label get labeled as being in there. A high recall means fewer reviews that are in the desired label get put in the wrong level. As you can imagine, these metrics correlate very closely. Here’s the code (again from the NLTK library) to print out the positive and negative recall and precision, as well as the accuracy (a less-specific measure just showing what percentage the classifier got right). NLTK also has a cool function that shows the features (words) that were most helpful to the classifier in determining whether a review was positive or negative. So here’s the code:
The Basic Method
So after all that, we can start to figure out our feature selection mechanism. This basically means the way we select which words to train the classifier on.
In my previous post, I didn’t actually train the classifier on words at all. The classifier was trained on the number of words in each category from very negative to very positive. With this Python program, Jackson and I chose to look at the individual words themselves rather than counting positive and negative words.
We did this because there is inherent error in picking positive and negative words — there’s a huge loss of information there: sentence-long reviews were reduced down to just a few digits. With this method, we’re keeping a lot more of the information in the review.Although we certainly don’t keep all the information in the review, since sentence structure and punctuation disappear and we just look at words in a vacuum. Improving that might be a direction for us to go in the future.
The most obvious feature selection mechanism is just to look at all the words in each review — it’s simple to code and provides a great base case. Here’s all we need:
This just builds a dictionary object (what we need for the evaluate_features method) that has each of the words in the review followed by ‘True’.
Then we can just run the testing method:
That might seem overwhelming, but we can go through it bit by bit!
First, we see that the accuracy is 77% — this is already seeming a lot better than my first attempt with R. Then we see that the precisions and recalls are all pretty close to each other, which means that it is classifying everything fairly evenly. No problems there, and we don’t need to read a whole lot more into it.
Then we see the most informative features — for example, if ‘engrossing’ is in a review, there’s a 17:1 chance the review is positive.
As a side note, there’s a couple interesting ones in there. For example, ‘flaws’ being in a review strongly indicates that the review is positive. Perhaps that’s because people rarely use “flaws” in a very negative sense, typically opting for stronger words. However, it is common to hear “it had some flaws, but…”
The Next Step
Moving on, the next way to select features is to only take the n most informative features — basically, the features that convey the most information. Again (for the millionth time), thanks to StreamHacker here.
We first need to find the information gain of each word. This is a big chunk of code, but we’ll break it up.
First, we broke up the words in a similar way to in the evaluate_features function and made them iterable (so that we could iterate through them):
Then we set up an overall frequency distribution of all the words, which can be visualized as a huge histogram with the number of each word in all the reviews combined. However, with just this line, all we do is initialize the frequency distribution — it’s actually empty:
We’ll also need a conditional frequency distribution — a distribution that takes into account whether the word is in a positive or negative review. This can be visualized as two different histograms, one with all the words in positive reviews, and one with all the words in negative reviews. Like above, this is just an empty conditional frequency distribution. Nothing is in there yet.
Then, we essentially fill out the frequency distributions, incrementing (with .inc) the counter of each word within the appropriate distribution.
The next thing we need to find the highest-information features is the count of words in positive reviews, words in negative reviews, and total words:
The last thing we need to do is use a chi-squared test test (also from NLTK) to score the words. We find each word’s positive information score and negative information score, add them up, and fill up a dictionary correlating the words and scores, which we then return out of the function. Chi-squared tests, as you can read in the Wikipedia article I just linked to, is a great way to see how much information a given input conveys.
We then make another function that finds the best n words, given a set of scores (which we’ll calculate using the function we just made) and an n:
Are you a little confused by that ‘lambda’? I certainly was when I saw it. Essentially what it does is allow you to temporarily make a function to return something. In this case, it’s helping to sort the words into the correct order. You can check this out for more information on lambda functions.
Finally, we can make a feature selection mechanism that returns ‘True’ for a word only if it is in the best words list:
Last, I ran it using the best 10, 100, 1000, 10000, and 15000 words. Here’s the code to do that:
Here’s my output (I’ve cut out the informative features list because it’s the same for all of them, including the one using all the features):
evaluating best 10 word features
train on 7998 instances, test on 2666 instances
accuracy: 0.574643660915
pos precision: 0.549379652605
pos recall: 0.830457614404
neg precision: 0.652841781874
neg recall: 0.318829707427
evaluating best 100 word features
train on 7998 instances, test on 2666 instances
accuracy: 0.682295573893
pos precision: 0.659868421053
pos recall: 0.752438109527
neg precision: 0.712041884817
neg recall: 0.61215303826
evaluating best 1000 word features
train on 7998 instances, test on 2666 instances
accuracy: 0.79632408102
pos precision: 0.817014446228
pos recall: 0.763690922731
neg precision: 0.778169014085
neg recall: 0.82895723931
evaluating best 10000 word features
train on 7998 instances, test on 2666 instances
accuracy: 0.846586646662
pos precision: 0.868421052632
pos recall: 0.81695423856
neg precision: 0.827195467422
neg recall: 0.876219054764
evaluating best 15000 word features
train on 7998 instances, test on 2666 instances
accuracy: 0.846961740435
pos precision: 0.862745098039
pos recall: 0.825206301575
neg precision: 0.832494608196
neg recall: 0.868717179295
The Conclusions
There’s lots to read into here. Obviously, using very few features didn’t do great in terms of accuracy, precision, and recall because there wasn’t enough data to build the model off of. Using 1000 features is about as good as using all the features, but at 10000 and 15000, there’s a pretty huge increase over the base case, getting up to ~85% accuracy and similar precision and recall statistics.
That means that using intelligent feature selection increased the accuracy by around 8 percentage points, which seems like quite a significant jump. Jackson and I were very happy about that.
We’re also happy about the results as a whole — classifying reviews with over 80% accuracy is pretty impressive and we can see lots of applications for this technology.
Of course, there are tons of ways to improve these results. These include (but are not limited to):
- Adding different feature selection mechanisms
- Pre-processing the text to get rid of unimportant words or punctuation
- Doing deeper analysis of the sentences as a whole
- Trying a different classifier than the Naive Bayes Classifier
A disclaimer applies: we’re just learning all of this, and fairly independently too. There’s a decent chance that there’s a mistake or an inappropriate conclusion somewhere. If there is, please don’t hesitate to email me. I’d also love to hear from you if you have any other input on the project!
For the full code, check out the GitHub repository.
Addendum
Pieter Sheth-Voss pointed out in the comments section that there’s a linear relationship between the accuracy and the log of the number of features. I decided to take a look at this in R. Here’s the code I used.
This takes advantage of the great ggplot2 library for R. It simply puts in the features, calculates the log of them, puts in the accuracy, and plots these two variables. Here’s the output:
I noticed here at the last 4 values seem a bit out of line and ran the script again with those removed to see how well the relationship performed through 5000 features:
This is quite a strong correlation. Very interesting how well that worked out. Thanks for the heads up, Pieter!
Looking for more to read?
Subscribe to my roughly-monthly newsletter of recommendations, experiments, and ideas:
And I'd love to hear from you: andy@andybromberg.com