## Introduction

A very common machine learning algorithm is a Support Vector Machine, or SVM. SVMs will allow you to predict information about data — we’ll see an example shortly.

In this post I’m going to walk you through the concept and intuition behind SVMs — to understand the content here, you need no technical background. I’ll be following up with another post containing the math behind SVMs, which will be substantially more advanced. Stay tuned!

## Example 1

Let’s imagine that we have a dataset of houses that have been on the market. We’ll imagine that we have three pieces of information for each: the house size in square feet, the list price of the house, and whether the house has been sold or not. Here’s a graph of that data (1):

Now let’s say that I give you the size and cost of a new house, as below:

If I asked you to predict whether you think the house has sold or not, you’d have a pretty clear answer: yes, it has probably sold. This simple bit of pattern recognition is exactly what we’re trying to get a computer to do when we talk about writing a machine learning algorithm!

Now, if I pushed you to describe the pattern you saw, you might say that there seems to be a sort of dividing line between the sold houses and the unsold houses. You might draw something like this and say that it seems like everything above the line is unsold and everything below the line is sold.

This is a fairly simple observation, but as it turns out, this is exactly what an SVM algorithm is doing on a basic level!

Effectively, what SVMs do is take your data and draw the line (or hyperplane, but we’ll get to that in a minute) that divides your dataset into groups of positive (sold) and negative (unsold) observations. Then, when you feed it a new data point, the algorithm figures out which side of the line the data point is on and spits back the predicted classification!

Of course, I’m simplifying the algorithm a great deal here, but this is the basic idea of a Support Vector Machine algorithm.

## Example 2

The example above is all well and good, but it has one characteristic that makes it a bit too simple: the fact that I can draw a line that divides the dataset perfectly, with no observations falling on the wrong side of the line. That dataset is what we call a linearly separable dataset, though in reality, datasets are rarely linearly separable. Let’s take the same example but make the data just a touch more realistic.

Now we can’t draw a line that perfectly divides the dataset into sold and unsold houses! On the conceptual end of things, this actually doesn’t change much for us. If I asked you to predict the status of the “new” data point again, you’d probably still tell me that it has been sold.

If I asked you to show me the pattern again, you’d probably still draw a line that looks something like the one below and say that anything below the line has probably been sold and anything above the line has probably not been sold.

Maybe you’d be a little less confident with your predictions this time around, but it still seems like the obvious pattern.

It turns out that this is pretty much how an SVM works with non-linearly separable data, too. It uses what is called a regularization parameter to draw the best dividing line it can, and then it predicts new data points based on which side of the line they lie on.

## More Features

In the examples above, we only had two features: the house size and the house price. Because of this, we could draw a nice 2-dimensional plot and draw a line to divide the data. But SVMs work with any number of features, whether there are 1, 2, 3, or 1000s. With our two features, we could draw a dividing line. But with 3 features, we need to draw a dividing plane. Here’s what that might look like:

Then, intuitively, we can classify new data points by determining which side of the plane they fall on.

Once we get beyond 3 features, we can no longer effectively visualize the data (or at least, I can’t see 4-dimensional plots!). With four or more features, the SVM creates what is called a hyperplane — a higher dimensional representation of a plane. But just like the line in 2-d or the plane in 3-d, the hyperplane divides the feature space in half and allows you to classify new points by determining which side of the hyperplane they fall on.

## Conclusion

In the end, the concept behind a Support Vector Machine algorithm is pretty simple — draw something that divides your training data (2) into positive and negative samples as best as possible, then classify new data by determining which side of the hyperplane they lie on.

One note I want to make clear is that it’s impossible to guarantee that you’ll predict correctly, of course! Even if your training data is perfectly linearly separable, a new data point could lie on one side of the hyperplane but actually be classified as being on the other side. Machine learning algorithms can only give you guesses for how your data should be classified — they can’t definitively tell you one way or the other.

There are a ton of directions to go from here, like using an SVM to predict multiple classes (rather than just classifying houses as sold or unsold, you could classify flowers as red, yellow, or blue) or optimizing an SVM in various ways to apply to certain problems. In an upcoming post I’m going to go over the math involved in what I discussed in this post, and then perhaps I’ll explore one of these directions.

I’ve attempted to explain SVMs in an conceptual way here, and in doing so, I’ve made a number of simplifications and skipped over parts of the algorithm. Please take this information for what it is: an intuitive way to think about Support Vector Machines, and not a rigorous examination.

I’d love to hear what you think about this post (or if I made any mistakes!) — feel free to email me or comment below!

1. This data isn’t real — I randomly generated it.

2. The “training data” is the data you have at the beginning, when you’re building your model. Then, you use the algorithm on your “test data” to get your results.

## Introduction

I’ve been interested in looking at the information contained in credit cards ever since I first heard about the Square card reader. The idea of such a tiny device being able to parse all the information on a credit card directly into your phone is fascinating to me and I wanted to figure out how it worked and if I could play with it a little bit. Unfortunately, Square moved from a very simple circuit design to a larger, encrypted version before I got mine. Coming into this weekend, I didn’t even have the expertise to figure out the simple version, much less the encrypted version and so I had mostly given up on working with credit card information.

However, one of my friends happened to have a few of the old Square readers that he gave me to work with. That gave me the opportunity to see what I could do and start hacking on this project.

## Software

My first goal was just to read data off a credit card (or any other card with a magstripe) with the Square reader. Jeffrey Malone has put together a couple excellent libraries for parsing data off the unencrypted Square reader, so I jumped right into those. I downloaded MagRead and built it, as below (from the command line):

git clone https://github.com/ieatlint/MagRead.git
make


Then I plugged the Square reader into my computer, started scanning, and slid my card through. Note that this only works with audio jacks that are both the microphone and speaker jack, like on MacBooks or ThinkPads. The app worked exactly as advertised and instantly displayed my credit card number and expiration date. I tried scanning some other cards (my student ID, insurance card, AAA card, etc.) and they all worked, displaying some of the information they contained.

After reading a bit more about how credit card magstripe encoding works, I learned that there are typically three tracks on the magstripe. Track 3 includes a variety of miscellaneous information, track 2 is the normal track with the card number and expiration date (along with a couple other pieces of information), and track 1 has everything on track 2 plus the cardholder’s name and the CVV (different from what most people consider the CVV on the back of the card). If you’re interested in learning more, I highly recommend checking out this website on credit card information.

I learned that the Square reader reads track 2 by default, but by putting a shim into the reader that’s about .11 inches thick, you could get it to read track 1 instead. I found that a zip tie did the trick and I could start reading the first track, which revealed some more information. MagRead continued to display this input perfectly, but it didn’t display the discretionary data on the track, which includes the CVV.

It turns out that the author of MagRead released another library called SWipe that dumps all the information read. I went through much the same process to get it to work, cloning it from GitHub, qmake’ing it, make’ing it, and then opening the resulting application. Now when I scanned the cards I could see everything on them, in the following format:

$B################^BROMBERG/ANDREW A^1408101##############?  Here’s a breakdown of what’s there: $ : Start of sequence
B : Indicates a bank-issued card
#s : Card number
^ : Separator
BROMBERG/ANDREW A : My name
^ : Separator
1408 : The expiration date (August 2014)
101 : The card service code, indicating that the card can be used
internationally and without restrictions
#s : The discretionary data, including the CVV
? : The end-of-sequence marker


Similarly, I could see more information from other cards by scanning their first track as opposed to the second track.

## Hardware

This was very cool, but I wasn’t satisfied yet. I wanted to know more about how and why the Square reader actually works. Thankfully I live in the same house as Jonathan Friedman, who runs CircuitHub (1) and has a PhD in Electrical Engineering. He spent a couple hours teaching me about electricity and magnetism from the ground up (which I’ll spare you in this blog post) and explained the principles behind magnetic card readers. If you’re interested in learning more about how they work, please email me and I’d be happy to talk to you, but there’s a large enough amount of background knowledge required that I don’t think it’s appropriate for this blog post.

Jonathan suggested that we try to dive into the electronics of the reader to help me better visualize what’s going on — so we did exactly that.

Our first step was to open the reader up so we could see the innards, which were actually really simple. The most prominent component is a magnetic head that is capable of detecting magnetic changes that are very close to it. This is then attached to two wires that are soldered onto the connector.

The wires in the reader connect to the ground and microphone segments of the connector. This means that the connector propagates a voltage differential (generated in the magnetic read head) into the computer or phone, which can then be parsed into the desired information. We decided to tap the internal part of the connector (where the original wires are connected) with our own wires to facilitate attachment to an oscilloscope to measure the output and visualize the results.

Jonathan soldered a couple wires on very delicately while I whittled the top half of the Square reader so that we could place it back on top without the new wires interfering with the placement.

Finally we plugged our wires into the oscilloscope and started swiping a card through the reader.

Unfortunately, no matter how much tuning we did, we couldn’t see anything. The Square reader doesn’t actually have an amplifier in it, just a raw magnetic transducer so the signal is very weak. We measured the reader’s impedance at 5.375k Ohms. With the oscilloscope set to its 1 million ohm input impedance setting, the wiring detected the 60Hz airborne signal emanating from the power grid strongly enough to make decoding of the card reader’s signal extremely difficult.

]16 The oscilloscope at the 1 million ohm impedance setting. Only the 60Hz power grid signal is visible.

Switching to the oscilloscope’s low (50 ohm) impedance setting largely eliminated the influence of the airborne signal, but reduced the reader’s signal by more than 100x — to the point where we could no longer see it.

]17 The oscilloscope at the 50 ohm impedance setting. No signal is visible at all.

Jonathan explained to me that Ohm’s law (voltage being generally current times impedance) implies that a large input impedance allows very weak induced currents to turn into comparatively large voltages. This is why the airborne power grid current turned into a very large displayed voltage in the oscilloscope at the 1 million ohm impedance setting.

Jonathan suggested that we try an intermediate impedance value between 50 ohms and 1 million ohm, which we managed by putting a resistor in parallel with the circuit while setting the oscilloscope to the 50 ohm setting. We went with about 360k ohms. This was perfect, and we could suddenly see my credit card data being displayed as waveforms on the screen as I swiped it! Here’s a picture of one of the digits of my credit card number:

What’s so amazing about this for me is that I could see the waveform representing the data on my card — the very same data that MagRead and SWipe and the Square app take in to parse and output as numbers. With some signal processing, we could actually refine the data on the oscilloscope and get the credit card information right out of it!

## Conclusion

There are a few ways I’d like to go with this project now that I have a solid baseline.

1. I want to read track 3 on credit cards and try to parse through that information to figure out what’s actually there. That will involve rigging the reader in such a way that I can offset the card correctly while reading the information.
2. I want to try to do some signal processing on the oscilloscope output to actually parse some data out myself, without the help of MagRead or SWipe.
3. Now that I know a tiny bit about electricity and magnetism, I want to learn more, especially about practical circuitry design and modification. I would really like to be able to hack electronics effectively as well as understand what’s going on in the things I use every day.

If you have any questions or comments on this post, please don’t hesitate to email me! I’d love to answer any questions you might have and if there’s any suggestions you have, please let me know.

1. If you’re interested in electronics and don’t know CircuitHub yet, you should. They’re building a free, universal parts library that makes it incredibly easy to pick out the parts you need.

## Introduction

After my first experiments with using R for sentiment analysis, I started talking with a friend here at school about my work. Jackson and I decided that we’d like to give it a better shot and really try to get some meaningful results. After a lot of research, we decided to shift languages to Python (even though we both know R). We made this shift because Python has a number of very useful libraries for text processing and sentiment analysis, plus it’s easy to code in. We launched right into tutorials and coding, and this post will be about that process and our results.

We also met with Christopher Potts, a professor of linguistics here at Stanford. Prior to meeting with him, we consulted his sentiment analysis guide extensively and found it incredibly useful. We had a fantastic chat with Professor Potts and he helped us grasp some of the concepts we were working on.

If you’d like to jump straight to seeing the full code, you can head over to the GitHub repository.

## The Setup

One of the resources we got a lot of mileage out of was StreamHacker, especially the articles on basic techniques, precision and recall. and eliminating features. More to follow about each of those elements.

Another great discovery was the Natural Language ToolKit (NLTK). This is an incredible library for Python that can do a huge amount of text processing and analysis. This would end up forming the basis for our program.

During our first attempt, we basically just tried to convert my program in R into Python. We quickly realized that not only did Python have more efficient ways to do some of the steps, but it also was missing some functionality that I used in the R version. So instead, we started based on StreamHacker’s code.

An important piece of sentiment analysis terminology: “features” are whatever you’re analyzing in an attempt to correlate to the labels. For example, in this code, the features will be the words in each review. Other algorithms could use different types of features — some algorithms use bigrams or trigrams (strings of two or three consecutive words, respectively) as the features.

An idea from StreamHacker that we really liked was writing a function to evaluate different feature selection mechanisms. That means that we would be able to write different methods to select different subsets of the features (read: words) in the reviews and then evaluate those methods.

As an aside, here are the imports we used for the project, so I won’t have to reference them again:

1 2 3 4 5 import re, math, collections, itertools import nltk, nltk.classify.util, nltk.metrics from nltk.classify import NaiveBayesClassifier from nltk.metrics import BigramAssocMeasures from nltk.probability import FreqDist, ConditionalFreqDist

We began to write our feature evaluator (again, many thanks to StreamHacker.):

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 def evaluate_features(feature_select): #reading pre-labeled input and splitting into lines posSentences = open('polarityData\\rt-polaritydata\\rt-polarity-pos.txt', 'r') negSentences = open('polarityData\\rt-polaritydata\\rt-polarity-neg.txt', 'r') posSentences = re.split(r'\n', posSentences.read()) negSentences = re.split(r'\n', negSentences.read())   posFeatures = [] negFeatures = [] #http://stackoverflow.com/questions/367155/splitting-a-string-into-words-and-punctuation #breaks up the sentences into lists of individual words (as selected by the input mechanism) and appends 'pos' or 'neg' after each list for i in posSentences: posWords = re.findall(r"[\w']+|[.,!?;]", i) posWords = [feature_select(posWords), 'pos'] posFeatures.append(posWords) for i in negSentences: negWords = re.findall(r"[\w']+|[.,!?;]", i) negWords = [feature_select(negWords), 'neg'] negFeatures.append(negWords)

This is the same polarity data that was used in my previous post, so check that out if you’re curious about the data. Essentially what that block of code does is splits up the reviews by line and then builds a posFeatures variable which contains the output of our feature selection mechanism (we’ll see how that works in a minute) with ‘pos’ or ‘neg’ appended to it, depending on whether the review it is drawing from is positive or negative.

The next bit of code separates the data into training and testing data for a Naive Bayes classifier, which is the same type of classifier I used before.

1 2 3 4 5 #selects 3/4 of the features to be used for training and 1/4 to be used for testing posCutoff = int(math.floor(len(posFeatures)*3/4)) negCutoff = int(math.floor(len(negFeatures)*3/4)) trainFeatures = posFeatures[:posCutoff] + negFeatures[:negCutoff] testFeatures = posFeatures[posCutoff:] + negFeatures[negCutoff:]

Now, thanks to NLTK, I can very simply train my classifier:

1 classifier = NaiveBayesClassifier.train(trainFeatures)

Pretty cool, huh? The last thing this function needs to do is check how well the classifier does when it tries to classify the testing data. This code is a little challenging so I’ll walk through it thoroughly.

First, I have to initiate referenceSets and testSets, to be used shortly. referenceSets will contain the actual values for the testing data (which we know because the data is prelabeled) and testSets will contain the predicted output.

Next, for each one of the testFeatures (the reviews that need testing), I iterate through three things: an arbitrary ‘i’, so be used as an identifier, and then the features (or words) in the review, and the actual label (‘pos’ or ‘neg’).

I add the ‘i’ (the unique identifier) to the correct bin in referenceSets. I then predict the label based on the features using the trained classifier and put the unique identifier in the predicted bin in testSets.

1 2 3 4 for i, (features, label) in enumerate(testFeatures): referenceSets[label].add(i) predicted = classifier.classify(features) testSets[predicted].add(i)

This gives me a big list of identifiers in referenceSets['pos'], which are the reviews known to be positive (and the same for the negative reviews). It also gives me a list of identifiers in testSets['pos'], which are the reviews predicted to be positive (and similarly for predicted negatives). What this allows me to do is to compare these lists and see how well the predictor did. Here’s where one of the StreamHacker articles (on precision and recall) really helped.

The essence of those two terms is that precision is a measure of false positives — a higher precision means fewer reviews that aren’t in the desired label get labeled as being in there. A high recall means fewer reviews that are in the desired label get put in the wrong level. As you can imagine, these metrics correlate very closely. Here’s the code (again from the NLTK library) to print out the positive and negative recall and precision, as well as the accuracy (a less-specific measure just showing what percentage the classifier got right). NLTK also has a cool function that shows the features (words) that were most helpful to the classifier in determining whether a review was positive or negative. So here’s the code:

1 2 3 4 5 6 7 print 'train on %d instances, test on %d instances' % (len(trainFeatures), len(testFeatures)) print 'accuracy:', nltk.classify.util.accuracy(classifier, testFeatures) print 'pos precision:', nltk.metrics.precision(referenceSets['pos'], testSets['pos']) print 'pos recall:', nltk.metrics.recall(referenceSets['pos'], testSets['pos']) print 'neg precision:', nltk.metrics.precision(referenceSets['neg'], testSets['neg']) print 'neg recall:', nltk.metrics.recall(referenceSets['neg'], testSets['neg']) classifier.show_most_informative_features(10)

## The Basic Method

So after all that, we can start to figure out our feature selection mechanism. This basically means the way we select which words to train the classifier on.

In my previous post, I didn’t actually train the classifier on words at all. The classifier was trained on the number of words in each category from very negative to very positive. With this Python program, Jackson and I chose to look at the individual words themselves rather than counting positive and negative words.

We did this because there is inherent error in picking positive and negative words — there’s a huge loss of information there: sentence-long reviews were reduced down to just a few digits. With this method, we’re keeping a lot more of the information in the review (1).

The most obvious feature selection mechanism is just to look at all the words in each review — it’s simple to code and provides a great base case. Here’s all we need:

1 2 def make_full_dict(words): return dict([(word, True) for word in words])

This just builds a dictionary object (what we need for the evaluate_features method) that has each of the words in the review followed by ‘True’.

Then we can just run the testing method:

1 2 print 'using all words as features' evaluate_features(make_full_dict)

Here’s the output:

using all words as features
train on 7998 instances, test on 2666 instances
accuracy: 0.773068267067
pos precision: 0.787066246057
pos recall: 0.748687171793
neg precision: 0.760371959943
neg recall: 0.797449362341
Most Informative Features
engrossing = True              pos : neg    =     17.0 : 1.0
quiet = True              pos : neg    =     15.7 : 1.0
mediocre = True              neg : pos    =     13.7 : 1.0
absorbing = True              pos : neg    =     13.0 : 1.0
portrait = True              pos : neg    =     12.4 : 1.0
refreshing = True              pos : neg    =     12.3 : 1.0
flaws = True              pos : neg    =     12.3 : 1.0
inventive = True              pos : neg    =     12.3 : 1.0
triumph = True              pos : neg    =     11.7 : 1.0
refreshingly = True              pos : neg    =     11.7 : 1.0


That might seem overwhelming, but we can go through it bit by bit!

First, we see that the accuracy is 77% — this is already seeming a lot better than my first attempt with R. Then we see that the precisions and recalls are all pretty close to each other, which means that it is classifying everything fairly evenly. No problems there, and we don’t need to read a whole lot more into it.

Then we see the most informative features — for example, if ‘engrossing’ is in a review, there’s a 17:1 chance the review is positive.

As a side note, there’s a couple interesting ones in there. For example, ‘flaws’ being in a review strongly indicates that the review is positive. Perhaps that’s because people rarely use “flaws” in a very negative sense, typically opting for stronger words. However, it is common to hear “it had some flaws, but…”

## The Next Step

Moving on, the next way to select features is to only take the n most informative features — basically, the features that convey the most information. Again (for the millionth time), thanks to StreamHacker here.

We first need to find the information gain of each word. This is a big chunk of code, but we’ll break it up.

First, we broke up the words in a similar way to in the evaluate_features function and made them iterable (so that we could iterate through them):

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 def create_word_scores(): #splits sentences into lines posSentences = open('polarityData\\rt-polaritydata\\rt-polarity-pos.txt', 'r') negSentences = open('polarityData\\rt-polaritydata\\rt-polarity-neg.txt', 'r') posSentences = re.split(r'\n', posSentences.read()) negSentences = re.split(r'\n', negSentences.read())   #creates lists of all positive and negative words posWords = [] negWords = [] for i in posSentences: posWord = re.findall(r"[\w']+|[.,!?;]", i) posWords.append(posWord) for i in negSentences: negWord = re.findall(r"[\w']+|[.,!?;]", i) negWords.append(negWord) posWords = list(itertools.chain(*posWords)) negWords = list(itertools.chain(*negWords))

Then we set up an overall frequency distribution of all the words, which can be visualized as a huge histogram with the number of each word in all the reviews combined. However, with just this line, all we do is initialize the frequency distribution — it’s actually empty:

1 word_fd = FreqDist()

We’ll also need a conditional frequency distribution — a distribution that takes into account whether the word is in a positive or negative review. This can be visualized as two different histograms, one with all the words in positive reviews, and one with all the words in negative reviews. Like above, this is just an empty conditional frequency distribution. Nothing is in there yet.

1 cond_word_fd = ConditionalFreqDist()

Then, we essentially fill out the frequency distributions, incrementing (with .inc) the counter of each word within the appropriate distribution.

1 2 3 4 5 6 for word in posWords: word_fd.inc(word.lower()) cond_word_fd['pos'].inc(word.lower()) for word in negWords: word_fd.inc(word.lower()) cond_word_fd['neg'].inc(word.lower())

The next thing we need to find the highest-information features is the count of words in positive reviews, words in negative reviews, and total words:

1 2 3 pos_word_count = cond_word_fd['pos'].N() neg_word_count = cond_word_fd['neg'].N() total_word_count = pos_word_count + neg_word_count

The last thing we need to do is use a chi-squared test test (also from NLTK) to score the words. We find each word’s positive information score and negative information score, add them up, and fill up a dictionary correlating the words and scores, which we then return out of the function. Chi-squared tests, as you can read in the Wikipedia article I just linked to, is a great way to see how much information a given input conveys.

1 2 3 4 5 6 7 word_scores = {} for word, freq in word_fd.iteritems(): pos_score = BigramAssocMeasures.chi_sq(cond_word_fd['pos'][word], (freq, pos_word_count), total_word_count) neg_score = BigramAssocMeasures.chi_sq(cond_word_fd['neg'][word], (freq, neg_word_count), total_word_count) word_scores[word] = pos_score + neg_score   return word_scores

We then make another function that finds the best n words, given a set of scores (which we’ll calculate using the function we just made) and an n:

1 2 3 4 def find_best_words(word_scores, number): best_vals = sorted(word_scores.iteritems(), key=lambda (w, s): s, reverse=True)[:number] best_words = set([w for w, s in best_vals]) return best_words

Are you a little confused by that ‘lambda’? I certainly was when I saw it. Essentially what it does is allow you to temporarily make a function to return something. In this case, it’s helping to sort the words into the correct order. You can check this out for more information on lambda functions.

Finally, we can make a feature selection mechanism that returns ‘True’ for a word only if it is in the best words list:

1 2 def best_word_features(words): return dict([(word, True) for word in words if word in best_words])

Last, I ran it using the best 10, 100, 1000, 10000, and 15000 words. Here’s the code to do that:

1 2 3 4 5 6 numbers_to_test = [10, 100, 1000, 10000, 15000] #tries the best_word_features mechanism with each of the numbers_to_test of features for num in numbers_to_test: print 'evaluating best %d word features' % (num) best_words = find_best_words(word_scores, num) evaluate_features(best_word_features)

Here’s my output (I’ve cut out the informative features list because it’s the same for all of them, including the one using all the features):

evaluating best 10 word features
train on 7998 instances, test on 2666 instances
accuracy: 0.574643660915
pos precision: 0.549379652605
pos recall: 0.830457614404
neg precision: 0.652841781874
neg recall: 0.318829707427

evaluating best 100 word features
train on 7998 instances, test on 2666 instances
accuracy: 0.682295573893
pos precision: 0.659868421053
pos recall: 0.752438109527
neg precision: 0.712041884817
neg recall: 0.61215303826

evaluating best 1000 word features
train on 7998 instances, test on 2666 instances
accuracy: 0.79632408102
pos precision: 0.817014446228
pos recall: 0.763690922731
neg precision: 0.778169014085
neg recall: 0.82895723931

evaluating best 10000 word features
train on 7998 instances, test on 2666 instances
accuracy: 0.846586646662
pos precision: 0.868421052632
pos recall: 0.81695423856
neg precision: 0.827195467422
neg recall: 0.876219054764

evaluating best 15000 word features
train on 7998 instances, test on 2666 instances
accuracy: 0.846961740435
pos precision: 0.862745098039
pos recall: 0.825206301575
neg precision: 0.832494608196
neg recall: 0.868717179295


## The Conclusions

There’s lots to read into here. Obviously, using very few features didn’t do great in terms of accuracy, precision, and recall because there wasn’t enough data to build the model off of. Using 1000 features is about as good as using all the features, but at 10000 and 15000, there’s a pretty huge increase over the base case, getting up to ~85% accuracy and similar precision and recall statistics.

That means that using intelligent feature selection increased the accuracy by around 8 percentage points, which seems like quite a significant jump. Jackson and I were very happy about that.

We’re also happy about the results as a whole — classifying reviews with over 80% accuracy is pretty impressive and we can see lots of applications for this technology.

Of course, there are tons of ways to improve these results. These include (but are not limited to):

1. Adding different feature selection mechanisms
2. Pre-processing the text to get rid of unimportant words or punctuation
3. Doing deeper analysis of the sentences as a whole
4. Trying a different classifier than the Naive Bayes Classifier

A disclaimer applies: we’re just learning all of this, and fairly independently too. There’s a decent chance that there’s a mistake or an inappropriate conclusion somewhere. If there is, please don’t hesitate to email me. I’d also love to hear from you if you have any other input on the project!

For the full code, check out the GitHub repository.

Pieter Sheth-Voss pointed out in the comments section that there’s a linear relationship between the accuracy and the log of the number of features. I decided to take a look at this in R. Here’s the code I used.

1 2 3 4 5 6 7 8 library(ggplot2) features <- c(10, 50, 100, 250, 500, 750, 1000, 2500, 5000, 7500, 10000, 12500, 15000) logFeatures <- log(features) accuracy <- c(0.5746436609152288, 0.6406601650412603, 0.6822955738934734, 0.7291822955738935, 0.7659414853713429, 0.7861965491372843, 0.7963240810202551, 0.8304576144036009, 0.8510877719429858, 0.8465866466616654, 0.8465866466616654, 0.8465866466616654, 0.8469617404351087) data <- data.frame(logFeatures, accuracy) ggplot(data, aes(x=logFeatures, y=accuracy)) + geom_point(shape=1) + geom_smooth(method=lm)

This takes advantage of the great ggplot2 library for R. It simply puts in the features, calculates the log of them, puts in the accuracy, and plots these two variables. Here’s the output:

I noticed here at the last 4 values seem a bit out of line and ran the script again with those removed to see how well the relationship performed through 5000 features:

This is quite a strong correlation. Very interesting how well that worked out. Thanks for the heads up, Pieter!

1. Although we certainly don’t keep all the information in the review, since sentence structure and punctuation disappear and we just look at words in a vacuum. Improving that might be a direction for us to go in the future.

## Introduction

Over the past few weeks, I’ve become very interested in statistics and machine learning. During my first semester at school, I learned a huge amount about the R programming language (thanks to a couple of my classes and a very helpful TA — more on that in a later post), but I quickly ran out of ideas to code up.

About two weeks ago, I realized that I should really have more of a background in statistics (I’ve never taken a statistics class) if I wanted to keep advancing my knowledge of R. I watched a lot of Khan Academy videos and read tutorials and papers to begin getting a basic understanding of the field and some of the techniques used in it.

Then, a few days ago, I started looking again for projects to work on. I came upon Jeffrey Breen’s excellent discussion of mining Twitter for airline customer sentiment. This launched me into research of sentiment analysis using R.

Around the same time, I also came upon some of the basic concepts of machine learning, including classification algorithms. I then decided on a project to work on that would combine my learning so far in all four of these areas: R, statistics, sentiment analysis and data mining, and classification algorithms (the last two being closely related).

Here was my (very) basic workflow for the project:

1. Select content to analyze
2. Analyze content using dataset of positive and negative elements
3. Train a classification algorithm based on the data
4. Use a classification algorithm to attempt to re-classify the same content and see how well it does

This post will be peppered with the code I wrote (in R) for this project. If you’d like to take a look at all of it at once, either scroll to the bottom of the post or check it out on GitHub here.

## The Process

I quickly decided that for my first sentiment analysis project, I didn’t want to mine Twitter. Tweets are too varied, not only in intention but also in language (1). I found that sentiment analysts often use product and movie reviews to test their analyses, so I settled on those. I found Cornell professor Bo Pang’s page on movie review data and selected his sentence polarity dataset v1.0, which has 10,663 sentences from movie reviews, each classified as either positive or negative. This would be my base content.

Here’s the code to load up the positive and negative sentences and split them onto individual lines (using str_split() from the stringr package) (2):

1 2 3 4 5 6 posText <- read.delim(file='polarityData/rt-polaritydata/rt-polarity-pos.txt', header=FALSE, stringsAsFactors=FALSE) posText <- posText$V1 posText <- unlist(lapply(posText, function(x) { str_split(x, "\n") })) negText <- read.delim(file='polarityData/rt-polaritydata/rt-polarity-neg.txt', header=FALSE, stringsAsFactors=FALSE) negText <- negText$V1 negText <- unlist(lapply(negText, function(x) { str_split(x, "\n") }))

After a lot of searching around for a dataset to analyze the sentence content, I found the AFINN wordlist, which has 2477 words and phrases rated from -5 [very negative] to +5 [very positive]. I reclassified the AFINN words into four categories (3):

• Very Negative (rating -5 or -4)
• Negative (rating -3, -2, or -1)
• Positive (rating 1, 2, or 3)
• Very Positive (rating 4 or 5)

I also added in a few more words specific to movies (found here) to round out my wordlist. Here’s the code for all that:

1 2 3 4 5 6 7 8 9 10 #load up word polarity list and format it afinn_list <- read.delim(file='AFINN/AFINN-111.txt', header=FALSE, stringsAsFactors=FALSE) names(afinn_list) <- c('word', 'score') afinn_list$word <- tolower(afinn_list$word)   #categorize words as very negative to very positive and add some movie-specific words vNegTerms <- afinn_list$word[afinn_list$score==-5 | afinn_list$score==-4] negTerms <- c(afinn_list$word[afinn_list$score==-3 | afinn_list$score==-2 | afinn_list$score==-1], "second-rate", "moronic", "third-rate", "flawed", "juvenile", "boring", distasteful", "ordinary", "disgusting", "senseless", "static", "brutal", "confused", "disappointing", "bloody", "silly", "tired", "predictable", "stupid", "uninteresting", "trite", uneven", "outdated", "dreadful", "bland") posTerms <- c(afinn_list$word[afinn_list$score==3 | afinn_list$score==2 | afinn_list$score==1], "first-rate", "insightful", "clever", "charming", "comical", "charismatic", "enjoyable", "absorbing", "sensitive", "intriguing", "powerful", "pleasant", "surprising", "thought-provoking", "imaginative", "unpretentious") vPosTerms <- c(afinn_list$word[afinn_list$score==5 | afinn_list$score==4], "uproarious", "riveting", "fascinating", "dazzling", "legendary")

I chose to ignore neutral words because I didn’t believe these would help my classification. I then counted the number of words in each review that fit one of those four categories. This left me with a big data table (10,663 rows) of the form:

sentence | #vNeg | #neg | #pos | #vPos | sentiment

For example:

Though it is by no means his best work, laissez-passer is a distinguished and distinctive effort by a bona-fide master, a fascinating film replete with rewards to be had by all willing to make the effort to reap them. | 0 | 1 | 3 | 1 | positive

In this example, it means that sentence had 1 negative word, 3 positive words, and 1 very positive word. It was also classified as positive by the database creator.

I wrote a bunch of code (heavily based off Jeffrey Breen’s code) to accomplish all this. Here it is (I used the laply function from the plyr package in there):

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 #function to calculate number of words in each category within a sentence sentimentScore <- function(sentences, vNegTerms, negTerms, posTerms, vPosTerms){ final_scores <- matrix('', 0, 5) scores <- laply(sentences, function(sentence, vNegTerms, negTerms, posTerms, vPosTerms){ initial_sentence <- sentence #remove unnecessary characters and split up by word sentence <- gsub('[[:punct:]]', '', sentence) sentence <- gsub('[[:cntrl:]]', '', sentence) sentence <- gsub('\\d+', '', sentence) sentence <- tolower(sentence) wordList <- str_split(sentence, '\\s+') words <- unlist(wordList) #build vector with matches between sentence and each category vPosMatches <- match(words, vPosTerms) posMatches <- match(words, posTerms) vNegMatches <- match(words, vNegTerms) negMatches <- match(words, negTerms) #sum up number of words in each category vPosMatches <- sum(!is.na(vPosMatches)) posMatches <- sum(!is.na(posMatches)) vNegMatches <- sum(!is.na(vNegMatches)) negMatches <- sum(!is.na(negMatches)) score <- c(vNegMatches, negMatches, posMatches, vPosMatches) #add row to scores table newrow <- c(initial_sentence, score) final_scores <- rbind(final_scores, newrow) return(final_scores) }, vNegTerms, negTerms, posTerms, vPosTerms) return(scores) }   #build tables of positive and negative sentences with scores posResult <- as.data.frame(sentimentScore(posText, vNegTerms, negTerms, posTerms, vPosTerms)) negResult <- as.data.frame(sentimentScore(negText, vNegTerms, negTerms, posTerms, vPosTerms)) posResult <- cbind(posResult, 'positive') colnames(posResult) <- c('sentence', 'vNeg', 'neg', 'pos', 'vPos', 'sentiment') negResult <- cbind(negResult, 'negative') colnames(negResult) <- c('sentence', 'vNeg', 'neg', 'pos', 'vPos', 'sentiment')   #combine the positive and negative tables results <- rbind(posResult, negResult)

In Jeffrey Breen’s analysis of airline customer sentiment, he used a very simple algorithm to classify the Tweets: he simply took the number of negative terms and subtracted them from the number of positive terms. He then compared these negative and positive values to each other to visualize their sentiment. I decided to implement an algorithm I had learned while researching machine learning: the Naive Bayes classifier. I picked up the idea to run a classification algorithm from this post, which used a similar process to the one described by Breen but relied on a Bayes classifier instead of his additive method.

At this point, I used the naiveBayes classifier from the e1071 package to attempt to classify the sentences as positive or negative (of course, without looking at the sentiment column). As the name suggests, this works by implementing a Naive Bayes algorithm. I won’t go into great detail here as to how it works (check out the Wikipedia page I linked to if you want to learn more), but the essential idea is that it looks at how the number of words in each of the four categories relates to whether the sentence is positive or negative. It then tries to guess whether a sentence is positive or negative by examining how many words it has in each category and relating this to the probabilities of those numbers appearing in positive and negative sentences.

Here’s all I needed to do to run the classifier:

1 classifier <- naiveBayes(results[,2:5], results[,6])

A confusion matrix can help visualize the results of a classification algorithm. The matrix for my result was:

                   actual
predicted   positive   negative
positive   2847        1546
negative   2484        3786


This was generated by this code:

1 2 confTable <- table(predict(classifier, results), results[,6], dnn=list('predicted','actual')) confTable

Since this experiment conforms to a Bernoulli Distribution, I was able to run a binomial test to assess a confidence interval for my results. I found that, within a 95% confidence interval, the population mean of the percent my program would get correct would be between 61.28% and 63.13%. I learned about the Bernoulli Distribution and the binomial test from the excellent Khan Academy statistics videos I had been watching. Here’s what I wrote for the binomial test:

1 binom.test(confTable[1,1] + confTable[2,2], nrow(results), p=0.5)

## Conclusion

I learned a lot from this project. I’m so glad that I was able to pull together what I’ve learned in several different areas to work on one unified program.

Taking a look at my results, though, I have a few comments.

1. My classification results aren’t that great. I had hoped to get much higher than a 60-65% correct classification rate. That’s quite alright with me, though; it means I have plenty of room to improve this program! I could try different classification algorithms, different wordlists, even different training data. There’s lots of space to work.
2. My code isn’t particularly pretty, and I would like to change that. I’ve just started to read Google’s Style Guide for R and I’ve noticed a few things I do wrong. I’ll set out to fix those in later versions of this program if I choose to move forward with it.
3. I’d like to try scraping my data in the future rather than relying on premade databases for analysis. Even though there are undoubtedly advantages to the premade ones (they’ve likely been cleaned up and/or additional attributes have been added to them), I’d like to get more practice with scraping if I can. I could also use this as an opportunity to try to interface between PHP or Python (for the scraping) and R (for the analysis).

If you’d like to grab the code from GitHub, please feel free. It’s right here.

As I said, I just started to learn all of this within the past couple weeks; I wouldn’t be surprised if I made a misstep somewhere. If you spot one — or just found this interesting and would like to chat — please email me!

I wrote a followup post to this in Python. You can check it out here.

As promised, here’s the full code

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 #import libraries to work with library(plyr) library(stringr) library(e1071)   #load up word polarity list and format it afinn_list <- read.delim(file='AFINN/AFINN-111.txt', header=FALSE, stringsAsFactors=FALSE) names(afinn_list) <- c('word', 'score') afinn_list$word <- tolower(afinn_list$word)   #categorize words as very negative to very positive and add some movie-specific words vNegTerms <- afinn_list$word[afinn_list$score==-5 | afinn_list$score==-4] negTerms <- c(afinn_list$word[afinn_list$score==-3 | afinn_list$score==-2 | afinn_list$score==-1], "second-rate", "moronic", "third-rate", "flawed", "juvenile", "boring", "distasteful", "ordinary", "disgusting", "senseless", "static", "brutal", "confused", "disappointing", "bloody", "silly", "tired", "predictable", "stupid", "uninteresting", "trite", "uneven", "outdated", "dreadful", "bland") posTerms <- c(afinn_list$word[afinn_list$score==3 | afinn_list$score==2 | afinn_list$score==1], "first-rate", "insightful", "clever", "charming", "comical", "charismatic", "enjoyable", "absorbing", "sensitive", "intriguing", "powerful", "pleasant", "surprising", "thought-provoking", "imaginative", "unpretentious") vPosTerms <- c(afinn_list$word[afinn_list$score==5 | afinn_list$score==4], "uproarious", "riveting", "fascinating", "dazzling", "legendary")   #load up positive and negative sentences and format posText <- read.delim(file='polarityData/rt-polaritydata/rt-polarity-pos.txt', header=FALSE, stringsAsFactors=FALSE) posText <- posText$V1 posText <- unlist(lapply(posText, function(x) { str_split(x, "\n") })) negText <- read.delim(file='polarityData/rt-polaritydata/rt-polarity-neg.txt', header=FALSE, stringsAsFactors=FALSE) negText <- negText$V1 negText <- unlist(lapply(negText, function(x) { str_split(x, "\n") }))   #function to calculate number of words in each category within a sentence sentimentScore <- function(sentences, vNegTerms, negTerms, posTerms, vPosTerms){ final_scores <- matrix('', 0, 5) scores <- laply(sentences, function(sentence, vNegTerms, negTerms, posTerms, vPosTerms){ initial_sentence <- sentence #remove unnecessary characters and split up by word sentence <- gsub('[[:punct:]]', '', sentence) sentence <- gsub('[[:cntrl:]]', '', sentence) sentence <- gsub('\\d+', '', sentence) sentence <- tolower(sentence) wordList <- str_split(sentence, '\\s+') words <- unlist(wordList) #build vector with matches between sentence and each category vPosMatches <- match(words, vPosTerms) posMatches <- match(words, posTerms) vNegMatches <- match(words, vNegTerms) negMatches <- match(words, negTerms) #sum up number of words in each category vPosMatches <- sum(!is.na(vPosMatches)) posMatches <- sum(!is.na(posMatches)) vNegMatches <- sum(!is.na(vNegMatches)) negMatches <- sum(!is.na(negMatches)) score <- c(vNegMatches, negMatches, posMatches, vPosMatches) #add row to scores table newrow <- c(initial_sentence, score) final_scores <- rbind(final_scores, newrow) return(final_scores) }, vNegTerms, negTerms, posTerms, vPosTerms) return(scores) }   #build tables of positive and negative sentences with scores posResult <- as.data.frame(sentimentScore(posText, vNegTerms, negTerms, posTerms, vPosTerms)) negResult <- as.data.frame(sentimentScore(negText, vNegTerms, negTerms, posTerms, vPosTerms)) posResult <- cbind(posResult, 'positive') colnames(posResult) <- c('sentence', 'vNeg', 'neg', 'pos', 'vPos', 'sentiment') negResult <- cbind(negResult, 'negative') colnames(negResult) <- c('sentence', 'vNeg', 'neg', 'pos', 'vPos', 'sentiment')   #combine the positive and negative tables results <- rbind(posResult, negResult)   #run the naive bayes algorithm using all four categories classifier <- naiveBayes(results[,2:5], results[,6])   #display the confusion table for the classification ran on the same data confTable <- table(predict(classifier, results), results[,6], dnn=list('predicted','actual')) confTable   #run a binomial test for confidence interval of results binom.test(confTable[1,1] + confTable[2,2], nrow(results), p=0.5)

1. Not only did the test Tweets I downloaded have plenty of misspellings and abbreviations, but a decent portion of them were in different languages entirely!

2. read.delim() appeared to split most of the sentences onto individual lines but not all of them (hence the use of str_split()). If you’re a reader who knows why and wouldn’t mind explaining that to me, I’d love to hear from you! You can email me here

3. I recategorized the AFINN wordlist to 4 categories rather than 10 (11 if you count the 0′s) because I knew that since I was working with short content — just a sentence each — there might be some issues where certain word categories don’t appear enough for there isn’t a strong correlation due to the lack of words of each sentiment.