Introduction

Over the past few weeks, I’ve become very interested in statistics and machine learning. During my first semester at school, I learned a huge amount about the R programming language (thanks to a couple of my classes and a very helpful TA — more on that in a later post), but I quickly ran out of ideas to code up.

About two weeks ago, I realized that I should really have more of a background in statistics (I’ve never taken a statistics class) if I wanted to keep advancing my knowledge of R. I watched a lot of Khan Academy videos and read tutorials and papers to begin getting a basic understanding of the field and some of the techniques used in it.

Then, a few days ago, I started looking again for projects to work on. I came upon Jeffrey Breen’s excellent discussion of mining Twitter for airline customer sentiment. This launched me into research of sentiment analysis using R.

Around the same time, I also came upon some of the basic concepts of machine learning, including classification algorithms. I then decided on a project to work on that would combine my learning so far in all four of these areas: R, statistics, sentiment analysis and data mining, and classification algorithms (the last two being closely related).

Here was my (very) basic workflow for the project:

1. Select content to analyze
2. Analyze content using dataset of positive and negative elements
3. Train a classification algorithm based on the data
4. Use a classification algorithm to attempt to re-classify the same content and see how well it does

This post will be peppered with the code I wrote (in R) for this project. If you’d like to take a look at all of it at once, either scroll to the bottom of the post or check it out on GitHub here.

The Process

I quickly decided that for my first sentiment analysis project, I didn’t want to mine Twitter. Tweets are too varied, not only in intention but also in language (1). I found that sentiment analysts often use product and movie reviews to test their analyses, so I settled on those. I found Cornell professor Bo Pang’s page on movie review data and selected his sentence polarity dataset v1.0, which has 10,663 sentences from movie reviews, each classified as either positive or negative. This would be my base content.

Here’s the code to load up the positive and negative sentences and split them onto individual lines (using str_split() from the stringr package) (2):

1 2 3 4 5 6 posText <- read.delim(file='polarityData/rt-polaritydata/rt-polarity-pos.txt', header=FALSE, stringsAsFactors=FALSE) posText <- posText$V1 posText <- unlist(lapply(posText, function(x) { str_split(x, "\n") })) negText <- read.delim(file='polarityData/rt-polaritydata/rt-polarity-neg.txt', header=FALSE, stringsAsFactors=FALSE) negText <- negText$V1 negText <- unlist(lapply(negText, function(x) { str_split(x, "\n") }))

After a lot of searching around for a dataset to analyze the sentence content, I found the AFINN wordlist, which has 2477 words and phrases rated from -5 [very negative] to +5 [very positive]. I reclassified the AFINN words into four categories (3):

• Very Negative (rating -5 or -4)
• Negative (rating -3, -2, or -1)
• Positive (rating 1, 2, or 3)
• Very Positive (rating 4 or 5)

I also added in a few more words specific to movies (found here) to round out my wordlist. Here’s the code for all that:

1 2 3 4 5 6 7 8 9 10 #load up word polarity list and format it afinn_list <- read.delim(file='AFINN/AFINN-111.txt', header=FALSE, stringsAsFactors=FALSE) names(afinn_list) <- c('word', 'score') afinn_list$word <- tolower(afinn_list$word)   #categorize words as very negative to very positive and add some movie-specific words vNegTerms <- afinn_list$word[afinn_list$score==-5 | afinn_list$score==-4] negTerms <- c(afinn_list$word[afinn_list$score==-3 | afinn_list$score==-2 | afinn_list$score==-1], "second-rate", "moronic", "third-rate", "flawed", "juvenile", "boring", distasteful", "ordinary", "disgusting", "senseless", "static", "brutal", "confused", "disappointing", "bloody", "silly", "tired", "predictable", "stupid", "uninteresting", "trite", uneven", "outdated", "dreadful", "bland") posTerms <- c(afinn_list$word[afinn_list$score==3 | afinn_list$score==2 | afinn_list$score==1], "first-rate", "insightful", "clever", "charming", "comical", "charismatic", "enjoyable", "absorbing", "sensitive", "intriguing", "powerful", "pleasant", "surprising", "thought-provoking", "imaginative", "unpretentious") vPosTerms <- c(afinn_list$word[afinn_list$score==5 | afinn_list$score==4], "uproarious", "riveting", "fascinating", "dazzling", "legendary")

I chose to ignore neutral words because I didn’t believe these would help my classification. I then counted the number of words in each review that fit one of those four categories. This left me with a big data table (10,663 rows) of the form:

sentence | #vNeg | #neg | #pos | #vPos | sentiment

For example:

Though it is by no means his best work, laissez-passer is a distinguished and distinctive effort by a bona-fide master, a fascinating film replete with rewards to be had by all willing to make the effort to reap them. | 0 | 1 | 3 | 1 | positive

In this example, it means that sentence had 1 negative word, 3 positive words, and 1 very positive word. It was also classified as positive by the database creator.

I wrote a bunch of code (heavily based off Jeffrey Breen’s code) to accomplish all this. Here it is (I used the laply function from the plyr package in there):

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 #function to calculate number of words in each category within a sentence sentimentScore <- function(sentences, vNegTerms, negTerms, posTerms, vPosTerms){ final_scores <- matrix('', 0, 5) scores <- laply(sentences, function(sentence, vNegTerms, negTerms, posTerms, vPosTerms){ initial_sentence <- sentence #remove unnecessary characters and split up by word sentence <- gsub('[[:punct:]]', '', sentence) sentence <- gsub('[[:cntrl:]]', '', sentence) sentence <- gsub('\\d+', '', sentence) sentence <- tolower(sentence) wordList <- str_split(sentence, '\\s+') words <- unlist(wordList) #build vector with matches between sentence and each category vPosMatches <- match(words, vPosTerms) posMatches <- match(words, posTerms) vNegMatches <- match(words, vNegTerms) negMatches <- match(words, negTerms) #sum up number of words in each category vPosMatches <- sum(!is.na(vPosMatches)) posMatches <- sum(!is.na(posMatches)) vNegMatches <- sum(!is.na(vNegMatches)) negMatches <- sum(!is.na(negMatches)) score <- c(vNegMatches, negMatches, posMatches, vPosMatches) #add row to scores table newrow <- c(initial_sentence, score) final_scores <- rbind(final_scores, newrow) return(final_scores) }, vNegTerms, negTerms, posTerms, vPosTerms) return(scores) }   #build tables of positive and negative sentences with scores posResult <- as.data.frame(sentimentScore(posText, vNegTerms, negTerms, posTerms, vPosTerms)) negResult <- as.data.frame(sentimentScore(negText, vNegTerms, negTerms, posTerms, vPosTerms)) posResult <- cbind(posResult, 'positive') colnames(posResult) <- c('sentence', 'vNeg', 'neg', 'pos', 'vPos', 'sentiment') negResult <- cbind(negResult, 'negative') colnames(negResult) <- c('sentence', 'vNeg', 'neg', 'pos', 'vPos', 'sentiment')   #combine the positive and negative tables results <- rbind(posResult, negResult)

In Jeffrey Breen’s analysis of airline customer sentiment, he used a very simple algorithm to classify the Tweets: he simply took the number of negative terms and subtracted them from the number of positive terms. He then compared these negative and positive values to each other to visualize their sentiment. I decided to implement an algorithm I had learned while researching machine learning: the Naive Bayes classifier. I picked up the idea to run a classification algorithm from this post, which used a similar process to the one described by Breen but relied on a Bayes classifier instead of his additive method.

At this point, I used the naiveBayes classifier from the e1071 package to attempt to classify the sentences as positive or negative (of course, without looking at the sentiment column). As the name suggests, this works by implementing a Naive Bayes algorithm. I won’t go into great detail here as to how it works (check out the Wikipedia page I linked to if you want to learn more), but the essential idea is that it looks at how the number of words in each of the four categories relates to whether the sentence is positive or negative. It then tries to guess whether a sentence is positive or negative by examining how many words it has in each category and relating this to the probabilities of those numbers appearing in positive and negative sentences.

Here’s all I needed to do to run the classifier:

1 classifier <- naiveBayes(results[,2:5], results[,6])

A confusion matrix can help visualize the results of a classification algorithm. The matrix for my result was:

                   actual
predicted   positive   negative
positive   2847        1546
negative   2484        3786


This was generated by this code:

1 2 confTable <- table(predict(classifier, results), results[,6], dnn=list('predicted','actual')) confTable

Since this experiment conforms to a Bernoulli Distribution, I was able to run a binomial test to assess a confidence interval for my results. I found that, within a 95% confidence interval, the population mean of the percent my program would get correct would be between 61.28% and 63.13%. I learned about the Bernoulli Distribution and the binomial test from the excellent Khan Academy statistics videos I had been watching. Here’s what I wrote for the binomial test:

1 binom.test(confTable[1,1] + confTable[2,2], nrow(results), p=0.5)

Conclusion

I learned a lot from this project. I’m so glad that I was able to pull together what I’ve learned in several different areas to work on one unified program.

Taking a look at my results, though, I have a few comments.

1. My classification results aren’t that great. I had hoped to get much higher than a 60-65% correct classification rate. That’s quite alright with me, though; it means I have plenty of room to improve this program! I could try different classification algorithms, different wordlists, even different training data. There’s lots of space to work.
2. My code isn’t particularly pretty, and I would like to change that. I’ve just started to read Google’s Style Guide for R and I’ve noticed a few things I do wrong. I’ll set out to fix those in later versions of this program if I choose to move forward with it.
3. I’d like to try scraping my data in the future rather than relying on premade databases for analysis. Even though there are undoubtedly advantages to the premade ones (they’ve likely been cleaned up and/or additional attributes have been added to them), I’d like to get more practice with scraping if I can. I could also use this as an opportunity to try to interface between PHP or Python (for the scraping) and R (for the analysis).

If you’d like to grab the code from GitHub, please feel free. It’s right here.

As I said, I just started to learn all of this within the past couple weeks; I wouldn’t be surprised if I made a misstep somewhere. If you spot one — or just found this interesting and would like to chat — please email me!

I wrote a followup post to this in Python. You can check it out here.

As promised, here’s the full code

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 #import libraries to work with library(plyr) library(stringr) library(e1071)   #load up word polarity list and format it afinn_list <- read.delim(file='AFINN/AFINN-111.txt', header=FALSE, stringsAsFactors=FALSE) names(afinn_list) <- c('word', 'score') afinn_list$word <- tolower(afinn_list$word)   #categorize words as very negative to very positive and add some movie-specific words vNegTerms <- afinn_list$word[afinn_list$score==-5 | afinn_list$score==-4] negTerms <- c(afinn_list$word[afinn_list$score==-3 | afinn_list$score==-2 | afinn_list$score==-1], "second-rate", "moronic", "third-rate", "flawed", "juvenile", "boring", "distasteful", "ordinary", "disgusting", "senseless", "static", "brutal", "confused", "disappointing", "bloody", "silly", "tired", "predictable", "stupid", "uninteresting", "trite", "uneven", "outdated", "dreadful", "bland") posTerms <- c(afinn_list$word[afinn_list$score==3 | afinn_list$score==2 | afinn_list$score==1], "first-rate", "insightful", "clever", "charming", "comical", "charismatic", "enjoyable", "absorbing", "sensitive", "intriguing", "powerful", "pleasant", "surprising", "thought-provoking", "imaginative", "unpretentious") vPosTerms <- c(afinn_list$word[afinn_list$score==5 | afinn_list$score==4], "uproarious", "riveting", "fascinating", "dazzling", "legendary")   #load up positive and negative sentences and format posText <- read.delim(file='polarityData/rt-polaritydata/rt-polarity-pos.txt', header=FALSE, stringsAsFactors=FALSE) posText <- posText$V1 posText <- unlist(lapply(posText, function(x) { str_split(x, "\n") })) negText <- read.delim(file='polarityData/rt-polaritydata/rt-polarity-neg.txt', header=FALSE, stringsAsFactors=FALSE) negText <- negText$V1 negText <- unlist(lapply(negText, function(x) { str_split(x, "\n") }))   #function to calculate number of words in each category within a sentence sentimentScore <- function(sentences, vNegTerms, negTerms, posTerms, vPosTerms){ final_scores <- matrix('', 0, 5) scores <- laply(sentences, function(sentence, vNegTerms, negTerms, posTerms, vPosTerms){ initial_sentence <- sentence #remove unnecessary characters and split up by word sentence <- gsub('[[:punct:]]', '', sentence) sentence <- gsub('[[:cntrl:]]', '', sentence) sentence <- gsub('\\d+', '', sentence) sentence <- tolower(sentence) wordList <- str_split(sentence, '\\s+') words <- unlist(wordList) #build vector with matches between sentence and each category vPosMatches <- match(words, vPosTerms) posMatches <- match(words, posTerms) vNegMatches <- match(words, vNegTerms) negMatches <- match(words, negTerms) #sum up number of words in each category vPosMatches <- sum(!is.na(vPosMatches)) posMatches <- sum(!is.na(posMatches)) vNegMatches <- sum(!is.na(vNegMatches)) negMatches <- sum(!is.na(negMatches)) score <- c(vNegMatches, negMatches, posMatches, vPosMatches) #add row to scores table newrow <- c(initial_sentence, score) final_scores <- rbind(final_scores, newrow) return(final_scores) }, vNegTerms, negTerms, posTerms, vPosTerms) return(scores) }   #build tables of positive and negative sentences with scores posResult <- as.data.frame(sentimentScore(posText, vNegTerms, negTerms, posTerms, vPosTerms)) negResult <- as.data.frame(sentimentScore(negText, vNegTerms, negTerms, posTerms, vPosTerms)) posResult <- cbind(posResult, 'positive') colnames(posResult) <- c('sentence', 'vNeg', 'neg', 'pos', 'vPos', 'sentiment') negResult <- cbind(negResult, 'negative') colnames(negResult) <- c('sentence', 'vNeg', 'neg', 'pos', 'vPos', 'sentiment')   #combine the positive and negative tables results <- rbind(posResult, negResult)   #run the naive bayes algorithm using all four categories classifier <- naiveBayes(results[,2:5], results[,6])   #display the confusion table for the classification ran on the same data confTable <- table(predict(classifier, results), results[,6], dnn=list('predicted','actual')) confTable   #run a binomial test for confidence interval of results binom.test(confTable[1,1] + confTable[2,2], nrow(results), p=0.5)

1. Not only did the test Tweets I downloaded have plenty of misspellings and abbreviations, but a decent portion of them were in different languages entirely!

2. read.delim() appeared to split most of the sentences onto individual lines but not all of them (hence the use of str_split()). If you’re a reader who knows why and wouldn’t mind explaining that to me, I’d love to hear from you! You can email me here

3. I recategorized the AFINN wordlist to 4 categories rather than 10 (11 if you count the 0’s) because I knew that since I was working with short content — just a sentence each — there might be some issues where certain word categories don’t appear enough for there isn’t a strong correlation due to the lack of words of each sentiment.