Thetazero Pubs
cen0te






Summary

Sentiment analysis is the use of statistics, natural language processing and machine learning to extract or categorize the sentiment content of a piece of sample text. In this project we use the multinomial Naive Bayes classification algorithm to classify movie reviews according to their overall sentiment (positive/negative). We use Pang and Lee?s IMDB movie reviews data which contains 2000 reviews, each with a positive or negative sentiment label. We use the bag of words representation and train a multinomial Naive Bayes classifier on the data and test the model?s performance on a hold-out set.


A sample review from the dataset:

?Warning: anyone offended by blatant, leering machismo had better avoid this film. Or lots of blood & guts, men against men and mano-et-mano stuff. In other words, it?s a Walter Hill film! With a John Milius script! I always picture these guys getting together and producing a movie between arm-wrestling matches. These films always contain male characters I have a very hard time identifying with, probably due to the likelihood that any meeting between them and me would result in my arm being ripped off and then my subsequent death by beating with said limb. And we got tough guys galore, here; drug-running banditos by the dozens, all dirty and sweaty and pretty ill-tempered, overall; a secret task force of army commandos who are in the area to cover-up (supposedly) any connection between the government and the drug runners; and lots of shit-kicking Texas dirt farmers who?d as soon shoot you between the eyes as look at you. In particular, we got Nick Nolte as one hard-ass Texas Ranger, Powers Boothe as the drug kingpin, Michael Ironsides as the leader of the secret army, and Rip Torn as the local sheriff. Torn is the sympathetic figure of the group; he smiles before shooting anyone. As to women? well, I?ve never seen Jane Fonda or Meryl Streep in a Walter Hill film, and at this rate, I doubt I ever will. Women exist here to look good, comfort the man, and get argued over. Gosh! Just like the old days?Frankly, this is a pretty good movie, if you can accept the premise and can take the macho stuff. The cinematography is excellent, the cast of characters is broad and has texture, the script is quite good, and the film lets you keep up with what?s happening yourself, without spelling it out to you. I appreciate a film that makes me have to think to keep up. Finally, there?s lots of Sam Peckinpah slow-motion shoot-ups.?


Setting up the Environment

We first load all the required libraries(packages).

# Load required libraries

library(tm)
library(RTextTools)
library(e1071)
library(dplyr)
library(caret)
# Library for parallel processing
library(doMC)
registerDoMC(cores=detectCores())  # Use all available cores

Reading the Data

A CSV version of the original data can be downloaded from this link.

We download the file to our working directory in R/RStudio and read it as a dataframe object.

df<- read.csv("movie-pang02.csv", stringsAsFactors = FALSE)
glimpse(df)
## Observations: 2,000
## Variables: 2
## $ class (chr) "Pos", "Pos", "Pos", "Pos", "Pos", "Pos", "Pos", "Pos", ...
## $ text  (chr) " films adapted from comic books have had plenty of succ...

Randomize the dataset

set.seed(1)
df <- df[sample(nrow(df)), ]
df <- df[sample(nrow(df)), ]
glimpse(df)
## Observations: 2,000
## Variables: 2
## $ class (chr) "Neg", "Pos", "Neg", "Neg", "Neg", "Neg", "Neg", "Neg", ...
## $ text  (chr) " frank detorri s   bill murray   a single dad who lives...
# Convert the 'class' variable from character to factor.
df$class <- as.factor(df$class)

Bag of Words Tokenisation

In this approach, we represent each word in a document as a token (or feature) and each document as a vector of features. In addition, for simplicity, we disregard word order and focus only on the number of occurrences of each word i.e., we represent each document as a multi-set ?bag? of words.

We first prepare a corpus of all the documents in the dataframe.

corpus <- Corpus(VectorSource(df$text))
# Inspect the corpus
corpus
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2000
inspect(corpus[1:3])
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 2471
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 4198
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 1272

Data Cleanup

Next, we clean up the corpus by eliminating numbers, punctuation, white space, and by converting to lower case. In addition, we discard common stop words such as ?his?, ?our?, ?hadn?t?, couldn?t?, etc. We use the tm_map() function from the ?tm? package to this end.

# Use dplyr's  %>% (pipe) utility to do this neatly.
corpus.clean <- corpus %>%
  tm_map(content_transformer(tolower)) %>% 
  tm_map(removePunctuation) %>%
  tm_map(removeNumbers) %>%
  tm_map(removeWords, stopwords(kind="en")) %>%
  tm_map(stripWhitespace)

Matrix representation of Bag of Words : The Document Term Matrix

We represent the bag of words tokens with a document term matrix (DTM). The rows of the DTM correspond to documents in the collection, columns correspond to terms, and its elements are the term frequencies. We use a built-in function from the ?tm? package to create the DTM.

dtm <- DocumentTermMatrix(corpus.clean)
# Inspect the dtm
inspect(dtm[40:50, 10:15])
## <<DocumentTermMatrix (documents: 11, terms: 6)>>
## Non-/sparse entries: 0/66
## Sparsity           : 100%
## Maximal term length: 9
## Weighting          : term frequency (tf)
## 
##     Terms
## Docs aardman aaron aatish aback abandon abandoned
##   40       0     0      0     0       0         0
##   41       0     0      0     0       0         0
##   42       0     0      0     0       0         0
##   43       0     0      0     0       0         0
##   44       0     0      0     0       0         0
##   45       0     0      0     0       0         0
##   46       0     0      0     0       0         0
##   47       0     0      0     0       0         0
##   48       0     0      0     0       0         0
##   49       0     0      0     0       0         0
##   50       0     0      0     0       0         0

Partitioning the Data

Next, we create 75:25 partitions of the dataframe, corpus and document term matrix for training and testing purposes.

df.train <- df[1:1500,]
df.test <- df[1501:2000,]

dtm.train <- dtm[1:1500,]
dtm.test <- dtm[1501:2000,]

corpus.clean.train <- corpus.clean[1:1500]
corpus.clean.test <- corpus.clean[1501:2000]

Feature Selection

dim(dtm.train)
## [1]  1500 38957

The DTM contains 38957 features but not all of them will be useful for classification. We reduce the number of features by ignoring words which appear in less than five reviews. To do this, we use ?findFreqTerms? function to indentify the frequent words, we then restrict the DTM to use only the frequent words using the ?dictionary? option.

fivefreq <- findFreqTerms(dtm.train, 5)
length((fivefreq))
## [1] 12144

# Use only 5 most frequent words (fivefreq) to build the DTM

dtm.train.nb <- DocumentTermMatrix(corpus.clean.train, control=list(dictionary = fivefreq))

dim(dtm.train.nb)
## [1]  1500 12144

dtm.test.nb <- DocumentTermMatrix(corpus.clean.test, control=list(dictionary = fivefreq))

dim(dtm.train.nb)
## [1]  1500 12144

The Naive Bayes algorithm

The Naive Bayes text classification algorithm is essentially an application of Bayes theorem (with a strong independence assumption) to documents and classes. For a detailed account of the algorithm, refer to course notes from the Stanford NLP course.

Boolean feature Multinomial Naive Bayes

We use a variation of the multinomial Naive Bayes algorithm known as binarized (boolean feature) Naive Bayes due to Dan Jurafsky. In this method, the term frequencies are replaced by Boolean presence/absence features. The logic behind this being that for sentiment classification, word occurrence matters more than word frequency.

# Function to convert the word frequencies to yes (presence) and no (absence) labels
convert_count <- function(x) {
  y <- ifelse(x > 0, 1,0)
  y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
  y
}
# Apply the convert_count function to get final training and testing DTMs
trainNB <- apply(dtm.train.nb, 2, convert_count)
testNB <- apply(dtm.test.nb, 2, convert_count)

Training the Naive Bayes Model

To train the model we use the naiveBayes function from the ?e1071? package. Since Naive Bayes evaluates products of probabilities, we need some way of assigning non-zero probabilities to words which do not occur in the sample. We use Laplace 1 smoothing to this end.

# Train the classifier
system.time( classifier <- naiveBayes(trainNB, df.train$class, laplace = 1) )
##    user  system elapsed 
##  13.616   0.000  13.625

Testing the Predictions

# Use the NB classifier we built to make predictions on the test set.
system.time( pred <- predict(classifier, newdata=testNB) )
##    user  system elapsed 
## 218.548   0.036 218.692
# Create a truth table by tabulating the predicted class labels with the actual class labels 
table("Predictions"= pred,  "Actual" = df.test$class )
##            Actual
## Predictions Neg Pos
##         Neg 224  54
##         Pos  41 181

Confusion Matrix

# Prepare the confusion matrix
conf.mat <- confusionMatrix(pred, df.test$class)

conf.mat
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Neg Pos
##        Neg 224  54
##        Pos  41 181
##                                           
##                Accuracy : 0.81            
##                  95% CI : (0.7728, 0.8435)
##     No Information Rate : 0.53            
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6174          
##  Mcnemar's Test P-Value : 0.2183          
##                                           
##             Sensitivity : 0.8453          
##             Specificity : 0.7702          
##          Pos Pred Value : 0.8058          
##          Neg Pred Value : 0.8153          
##              Prevalence : 0.5300          
##          Detection Rate : 0.4480          
##    Detection Prevalence : 0.5560          
##       Balanced Accuracy : 0.8077          
##                                           
##        'Positive' Class : Neg             
## 

conf.mat$byClass
##          Sensitivity          Specificity       Pos Pred Value 
##            0.8452830            0.7702128            0.8057554 
##       Neg Pred Value           Prevalence       Detection Rate 
##            0.8153153            0.5300000            0.4480000 
## Detection Prevalence    Balanced Accuracy 
##            0.5560000            0.8077479

conf.mat$overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   8.100000e-01   6.174291e-01   7.728180e-01   8.434678e-01   5.300000e-01 
## AccuracyPValue  McnemarPValue 
##   3.570547e-39   2.182578e-01

# Prediction Accuracy
conf.mat$overall['Accuracy']
## Accuracy 
##     0.81

Conclusion

The prediction accuracy of a classification model is given by the proportion of the total number of correct predictions. The accuracy for this model turns out to be 81%. We see that despite many simplifying assumptions, the Naive Bayes algorithm does reasonably well at predicting the correct sentiment classes.

References

  1. Lecture slides from the Stanford NLP Coursera course by Dan Jurafsky and Christopher Manning: https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html

  2. Machine Learning with R by Brett Lantz: https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-r


Copyright © 2016 thetazero.com All Rights Reserved. Privacy Policy