Thetazero Pubs
sreenez

Executive Summary

This report is created as a part of the Coursera Practical Machine Learning course project. The goal of the project is to analyze data from devices such as FitBit to predict the manner in which people exercised.


Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. In this project, we use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.

Data Loading

library(rpart)
## Warning: package 'rpart' was built under R version 3.2.3
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.2.3
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Warning: package 'caret' was built under R version 3.2.3
## Loading required package: lattice
## Loading required package: ggplot2
pml_training <- read.csv("pml-training.csv", na.strings=c("NA",""), header=TRUE)
train_features <- colnames(pml_training)
length(train_features)
## [1] 160
pml_testing <- read.csv("pml-testing.csv", na.strings=c("NA",""), header=TRUE)
test_features <- colnames(pml_testing)
length(test_features)
## [1] 160

Data Cleansing

A lot of values in columns are NAs which can compromise the quality of the result. Hence, all columns with more than 60% of NAs are ommitted. We also remove columns that do not contribute to the predictions.

#Build a vector with column names that can be ommitted from analysis
remove_cols <- c()
for(i in 1:length(pml_training)) {
    if( sum(is.na(pml_training[ ,i])) /nrow(pml_training) >= .6 ) {
       remove_cols <- c(remove_cols, train_features[i])
    }
}
#Remove NAs and first 7 columns as they cannot be used as predictors
pml_training <- pml_training[ ,!(names(pml_training) %in% remove_cols)]
pml_training <- pml_training[,8:length(colnames(pml_training))]

pml_testing <- pml_testing[,!(names(pml_testing) %in% remove_cols)]
pml_testing <- pml_testing[,8:length(colnames(pml_testing))]

# Show remaining columns.
#colnames(pml_training)
#colnames(pml_testing)

#Check for covariates to see if any of them have near zero variability
low_variance_cols <- nearZeroVar(pml_training, saveMetrics=TRUE)
low_variance_cols
##                      freqRatio percentUnique zeroVar   nzv
## roll_belt             1.101904     6.7781062   FALSE FALSE
## pitch_belt            1.036082     9.3772296   FALSE FALSE
## yaw_belt              1.058480     9.9734991   FALSE FALSE
## total_accel_belt      1.063160     0.1477933   FALSE FALSE
## gyros_belt_x          1.058651     0.7134849   FALSE FALSE
## gyros_belt_y          1.144000     0.3516461   FALSE FALSE
## gyros_belt_z          1.066214     0.8612782   FALSE FALSE
## accel_belt_x          1.055412     0.8357966   FALSE FALSE
## accel_belt_y          1.113725     0.7287738   FALSE FALSE
## accel_belt_z          1.078767     1.5237998   FALSE FALSE
## magnet_belt_x         1.090141     1.6664968   FALSE FALSE
## magnet_belt_y         1.099688     1.5187035   FALSE FALSE
## magnet_belt_z         1.006369     2.3290184   FALSE FALSE
## roll_arm             52.338462    13.5256345   FALSE FALSE
## pitch_arm            87.256410    15.7323412   FALSE FALSE
## yaw_arm              33.029126    14.6570176   FALSE FALSE
## total_accel_arm       1.024526     0.3363572   FALSE FALSE
## gyros_arm_x           1.015504     3.2769341   FALSE FALSE
## gyros_arm_y           1.454369     1.9162165   FALSE FALSE
## gyros_arm_z           1.110687     1.2638875   FALSE FALSE
## accel_arm_x           1.017341     3.9598410   FALSE FALSE
## accel_arm_y           1.140187     2.7367241   FALSE FALSE
## accel_arm_z           1.128000     4.0362858   FALSE FALSE
## magnet_arm_x          1.000000     6.8239731   FALSE FALSE
## magnet_arm_y          1.056818     4.4439914   FALSE FALSE
## magnet_arm_z          1.036364     6.4468454   FALSE FALSE
## roll_dumbbell         1.022388    84.2065029   FALSE FALSE
## pitch_dumbbell        2.277372    81.7449801   FALSE FALSE
## yaw_dumbbell          1.132231    83.4828254   FALSE FALSE
## total_accel_dumbbell  1.072634     0.2191418   FALSE FALSE
## gyros_dumbbell_x      1.003268     1.2282132   FALSE FALSE
## gyros_dumbbell_y      1.264957     1.4167771   FALSE FALSE
## gyros_dumbbell_z      1.060100     1.0498420   FALSE FALSE
## accel_dumbbell_x      1.018018     2.1659362   FALSE FALSE
## accel_dumbbell_y      1.053061     2.3748853   FALSE FALSE
## accel_dumbbell_z      1.133333     2.0894914   FALSE FALSE
## magnet_dumbbell_x     1.098266     5.7486495   FALSE FALSE
## magnet_dumbbell_y     1.197740     4.3012945   FALSE FALSE
## magnet_dumbbell_z     1.020833     3.4451126   FALSE FALSE
## roll_forearm         11.589286    11.0895933   FALSE FALSE
## pitch_forearm        65.983051    14.8557741   FALSE FALSE
## yaw_forearm          15.322835    10.1467740   FALSE FALSE
## total_accel_forearm   1.128928     0.3567424   FALSE FALSE
## gyros_forearm_x       1.059273     1.5187035   FALSE FALSE
## gyros_forearm_y       1.036554     3.7763735   FALSE FALSE
## gyros_forearm_z       1.122917     1.5645704   FALSE FALSE
## accel_forearm_x       1.126437     4.0464784   FALSE FALSE
## accel_forearm_y       1.059406     5.1116094   FALSE FALSE
## accel_forearm_z       1.006250     2.9558659   FALSE FALSE
## magnet_forearm_x      1.012346     7.7667924   FALSE FALSE
## magnet_forearm_y      1.246914     9.5403119   FALSE FALSE
## magnet_forearm_z      1.000000     8.5771073   FALSE FALSE
## classe                1.469581     0.0254816   FALSE FALSE

Since all columns have a considerable amount of variance, we are not ommiting any fuurther columns

We have a large training data set and this can be split into 2 sets to perform better modelling

set.seed(10)
inTrain <- createDataPartition(y=pml_training$classe, p=0.7, list=F)
trainingset1 <- pml_training[inTrain, ]
trainingset2 <- pml_training[-inTrain, ]
dim(trainingset2)
## [1] 5885   53

Data Modelling and Predictions

We being our data modelling and predictions with decision trees and confusion matrix. The data from trainingset1 is modelled into decision and the predictions are plotted. The predictions are also validated using confusion matrix.


#Fit the model on to trainingset1
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.2.3
dtFit <- rpart(classe ~ ., data=trainingset1, method="class")
rpart.plot(dtFit, main="Classification Tree", extra=102, under=TRUE, faclen=0)

#Use the model to predict using data from second training set
dtprediction <- predict(dtFit, trainingset2, type = "class")
confusionMatrix(dtprediction, trainingset2$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1527  231   57  158   41
##          B   27  616   91   45   73
##          C   46  150  789  147  143
##          D   41   87   61  518   36
##          E   33   55   28   96  789
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7203          
##                  95% CI : (0.7086, 0.7317)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6437          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9122   0.5408   0.7690  0.53734   0.7292
## Specificity            0.8844   0.9503   0.9000  0.95428   0.9559
## Pos Pred Value         0.7582   0.7230   0.6188  0.69717   0.7882
## Neg Pred Value         0.9620   0.8961   0.9486  0.91326   0.9400
## Prevalence             0.2845   0.1935   0.1743  0.16381   0.1839
## Detection Rate         0.2595   0.1047   0.1341  0.08802   0.1341
## Detection Prevalence   0.3422   0.1448   0.2167  0.12625   0.1701
## Balanced Accuracy      0.8983   0.7455   0.8345  0.74581   0.8425

The Confusion Matrix above shows the prediction accuracy is only 72%. Hence we try a second model called the Random Forest Model and the trainingset1.We train the model using 3-fold cross validation to select optimal parametes.

fitControl <- trainControl(method="cv", number=3, verboseIter=F)

#Fit the RF model on training set 1
RFMfit <- train(classe ~ . , method="rf", data = trainingset1, trControl=fitControl)
RFMfit$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.7%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3899    4    1    0    2 0.001792115
## B   17 2634    6    1    0 0.009029345
## C    0   14 2376    6    0 0.008347245
## D    0    0   27 2223    2 0.012877442
## E    0    0    6   10 2509 0.006336634

Next, we can use the fitted model for predictions using the trainingset2

fittedPrediction <- predict(RFMfit, newdata=trainingset2)

# show confusion matrix to get estimate of out-of-sample error
confusionMatrix(trainingset2$classe, fittedPrediction)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    0    0    0    0
##          B    6 1126    6    1    0
##          C    0    8 1015    3    0
##          D    0    1    8  955    0
##          E    0    0    2    2 1078
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9937          
##                  95% CI : (0.9913, 0.9956)
##     No Information Rate : 0.2855          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.992           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9964   0.9921   0.9845   0.9938   1.0000
## Specificity            1.0000   0.9973   0.9977   0.9982   0.9992
## Pos Pred Value         1.0000   0.9886   0.9893   0.9907   0.9963
## Neg Pred Value         0.9986   0.9981   0.9967   0.9988   1.0000
## Prevalence             0.2855   0.1929   0.1752   0.1633   0.1832
## Detection Rate         0.2845   0.1913   0.1725   0.1623   0.1832
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9982   0.9947   0.9911   0.9960   0.9996

Cross validation and model selection

With an Accuracy of 99.8%, it is clear that the Random Forest Model yeilds better results than the decision tree model. Hence it can be concluded that Random Forest Model is the best model to use for predictions. The out-of-sample error is only 0.2%.

Re-training the Selected Model

Before predicting on the test set, it is important to train the model on the full training set, rather than using a model trained on a reduced training set, in order to produce the most accurate predictions. Therefore, I now repeat everything I did above on the complete training set

fitControl <- trainControl(method="cv", number=3, verboseIter=F)
fit <- train(classe ~ ., data=pml_training, method="rf", trControl=fitControl)

Test set predictions and output generation

# predict on test set
testDataPrediction <- predict(fit, newdata=pml_testing)

#Write predictions into files

testDataPrediction <- as.character(testDataPrediction)

#function to write data to files
pmlProjectResults <- function(x) {
    n <- length(x)
    for(i in 1:n) {
        filename <- paste0("problem_id_", i, ".txt")
        write.table(x[i], file=filename, quote=F, row.names=F, col.names=F)
    }
}

# create result files to submit
pmlProjectResults(testDataPrediction)
Copyright © 2016 thetazero.com All Rights Reserved. Privacy Policy