Thetazero Pubs
bernardo antani

1.

For this quiz we will be using several R packages. R package versions change over time, the right answers have been checked using the following versions of the packages.


Load the vowel.train and vowel.test data sets:

library(ElemStatLearn)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
# load data sets
data(vowel.train)
data(vowel.test)

Set the variable y to be a factor variable in both the training and test set. Then set the seed to 33833. Fit (1) a random forest predictor relating the factor variable y to the remaining variables and (2) a boosted predictor using the ?gbm? method. Fit these both with the train() command in the caret package.

training <- vowel.train
testing <- vowel.test
set.seed(33833)
# random forest predictor algorithm 
rfFit <- train(as.factor(y) ~., method = "rf",
               data = training)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
# boosting predictor algorithm 
boostFit <- train(as.factor(y) ~., method = "gbm",
                  data = training, verbose = FALSE)
## Loading required package: gbm
## Loading required package: survival
## 
## Attaching package: 'survival'
## 
## The following object is masked from 'package:caret':
## 
##     cluster
## 
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr
## 
## Attaching package: 'plyr'
## 
## The following object is masked from 'package:ElemStatLearn':
## 
##     ozone
testingRF <- predict(rfFit, testing)
testingBoost <- predict(boostFit, testing)

What are the accuracies for the two approaches on the test data set? What is the accuracy among the test set samples where the two methods agree?

confusionMatrix(testingRF, as.factor(testing$y))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  4  5  6  7  8  9 10 11
##         1  35  1  0  0  0  0  0  0  0  3  0
##         2   6 27  3  0  0  0  2  0  1 14  1
##         3   1 10 30  3  0  0  0  0  0  3  0
##         4   0  2  0 30  3  0  0  0  0  0  1
##         5   0  0  0  0 19  6  9  0  0  0  0
##         6   0  0  6  8 16 24  2  0  0  0  7
##         7   0  0  0  0  3  0 26  7  5  0  3
##         8   0  0  0  0  0  0  0 30  5  0  0
##         9   0  2  0  0  0  0  3  5 24  1 12
##         10  0  0  0  0  0  0  0  0  2 21  0
##         11  0  0  3  1  1 12  0  0  5  0 18
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6147          
##                  95% CI : (0.5686, 0.6593)
##     No Information Rate : 0.0909          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5762          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity           0.83333  0.64286  0.71429  0.71429  0.45238  0.57143
## Specificity           0.99048  0.93571  0.95952  0.98571  0.96429  0.90714
## Pos Pred Value        0.89744  0.50000  0.63830  0.83333  0.55882  0.38095
## Neg Pred Value        0.98345  0.96324  0.97108  0.97183  0.94626  0.95489
## Prevalence            0.09091  0.09091  0.09091  0.09091  0.09091  0.09091
## Detection Rate        0.07576  0.05844  0.06494  0.06494  0.04113  0.05195
## Detection Prevalence  0.08442  0.11688  0.10173  0.07792  0.07359  0.13636
## Balanced Accuracy     0.91190  0.78929  0.83690  0.85000  0.70833  0.73929
##                      Class: 7 Class: 8 Class: 9 Class: 10 Class: 11
## Sensitivity           0.61905  0.71429  0.57143   0.50000   0.42857
## Specificity           0.95714  0.98810  0.94524   0.99524   0.94762
## Pos Pred Value        0.59091  0.85714  0.51064   0.91304   0.45000
## Neg Pred Value        0.96172  0.97190  0.95663   0.95216   0.94313
## Prevalence            0.09091  0.09091  0.09091   0.09091   0.09091
## Detection Rate        0.05628  0.06494  0.05195   0.04545   0.03896
## Detection Prevalence  0.09524  0.07576  0.10173   0.04978   0.08658
## Balanced Accuracy     0.78810  0.85119  0.75833   0.74762   0.68810
confusionMatrix(testingBoost, as.factor(testing$y))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  4  5  6  7  8  9 10 11
##         1  30  0  0  0  0  0  0  0  0  2  0
##         2   9 21  1  0  0  0  1  0  0 12  0
##         3   1  8 11  3  0  0  0  0  0  0  0
##         4   0  1  9 21  3  0  1  0  0  0  0
##         5   0  0  0  3 19  4  1  0  0  0  0
##         6   0  1 14 14 11 30  1  0  1  0  6
##         7   0  0  0  0  7  2 36  6  3  1 14
##         8   0  0  0  0  0  0  2 29 10  2  0
##         9   0  5  0  0  0  0  0  7 28  6 18
##         10  2  0  0  0  0  0  0  0  0 19  0
##         11  0  6  7  1  2  6  0  0  0  0  4
## 
## Overall Statistics
##                                          
##                Accuracy : 0.5368         
##                  95% CI : (0.4901, 0.583)
##     No Information Rate : 0.0909         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.4905         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity           0.71429  0.50000  0.26190  0.50000  0.45238  0.71429
## Specificity           0.99524  0.94524  0.97143  0.96667  0.98095  0.88571
## Pos Pred Value        0.93750  0.47727  0.47826  0.60000  0.70370  0.38462
## Neg Pred Value        0.97209  0.94976  0.92938  0.95082  0.94713  0.96875
## Prevalence            0.09091  0.09091  0.09091  0.09091  0.09091  0.09091
## Detection Rate        0.06494  0.04545  0.02381  0.04545  0.04113  0.06494
## Detection Prevalence  0.06926  0.09524  0.04978  0.07576  0.05844  0.16883
## Balanced Accuracy     0.85476  0.72262  0.61667  0.73333  0.71667  0.80000
##                      Class: 7 Class: 8 Class: 9 Class: 10 Class: 11
## Sensitivity           0.85714  0.69048  0.66667   0.45238  0.095238
## Specificity           0.92143  0.96667  0.91429   0.99524  0.947619
## Pos Pred Value        0.52174  0.67442  0.43750   0.90476  0.153846
## Neg Pred Value        0.98473  0.96897  0.96482   0.94785  0.912844
## Prevalence            0.09091  0.09091  0.09091   0.09091  0.090909
## Detection Rate        0.07792  0.06277  0.06061   0.04113  0.008658
## Detection Prevalence  0.14935  0.09307  0.13853   0.04545  0.056277
## Balanced Accuracy     0.88929  0.82857  0.79048   0.72381  0.521429
confusionMatrix(testingRF, testingBoost)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  4  5  6  7  8  9 10 11
##         1  31  3  1  0  0  0  0  0  0  4  0
##         2   0 38  3  1  0  1  2  0  5  0  4
##         3   1  1 19  6  0 10  0  0  0  2  8
##         4   0  0  0 24  1  9  0  0  2  0  0
##         5   0  0  0  0 17  3 11  2  0  0  1
##         6   0  0  0  3  8 45  7  0  0  0  0
##         7   0  1  0  0  1  0 39  1  2  0  0
##         8   0  0  0  0  0  0  0 33  2  0  0
##         9   0  0  0  1  0  0  2  4 40  0  0
##         10  0  1  0  0  0  0  1  3  3 15  0
##         11  0  0  0  0  0 10  7  0 10  0 13
## 
## Overall Statistics
##                                         
##                Accuracy : 0.6797        
##                  95% CI : (0.635, 0.722)
##     No Information Rate : 0.1688        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.6449        
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity           0.96875  0.86364  0.82609  0.68571  0.62963   0.5769
## Specificity           0.98140  0.96172  0.93622  0.97190  0.96092   0.9531
## Pos Pred Value        0.79487  0.70370  0.40426  0.66667  0.50000   0.7143
## Neg Pred Value        0.99764  0.98529  0.99036  0.97418  0.97664   0.9173
## Prevalence            0.06926  0.09524  0.04978  0.07576  0.05844   0.1688
## Detection Rate        0.06710  0.08225  0.04113  0.05195  0.03680   0.0974
## Detection Prevalence  0.08442  0.11688  0.10173  0.07792  0.07359   0.1364
## Balanced Accuracy     0.97507  0.91268  0.88115  0.82881  0.79527   0.7650
##                      Class: 7 Class: 8 Class: 9 Class: 10 Class: 11
## Sensitivity           0.56522  0.76744  0.62500   0.71429   0.50000
## Specificity           0.98728  0.99523  0.98241   0.98186   0.93807
## Pos Pred Value        0.88636  0.94286  0.85106   0.65217   0.32500
## Neg Pred Value        0.92823  0.97658  0.94217   0.98633   0.96919
## Prevalence            0.14935  0.09307  0.13853   0.04545   0.05628
## Detection Rate        0.08442  0.07143  0.08658   0.03247   0.02814
## Detection Prevalence  0.09524  0.07576  0.10173   0.04978   0.08658
## Balanced Accuracy     0.77625  0.88133  0.80371   0.84807   0.71904

2.

Load the Alzheimer?s data using the following commands

library(caret)
library(gbm)
set.seed(3433)
library(AppliedPredictiveModeling)
data(AlzheimerDisease)
adData = data.frame(diagnosis, predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]]
training = adData[inTrain, ]
testing = adData[-inTrain, ]

Set the seed to 62433 and predict diagnosis with all the other variables using a random forest (?rf?), boosted trees (?gbm?) and linear discriminant analysis (?lda?) model.


set.seed(62433)
# predict outcome for test data set using the random forest model
rfFit <- train(diagnosis ~.,  data = training,
               method = "rf", prox = TRUE, ntree = 500)
head(getTree(rfFit$finalModel, k = 2))
##   left daughter right daughter split var split point status prediction
## 1             2              3       126   6.0315739      1          0
## 2             4              5        93   1.2644618      1          0
## 3             6              7        36   8.7225787      1          0
## 4             8              9        36   8.0342005      1          0
## 5             0              0         0   0.0000000     -1          1
## 6            10             11        94  -0.2815861      1          0
rf_pred <- predict(rfFit, testing)
# check the accuracy
confusionMatrix(rf_pred, testing$diagnosis)$overall[1]
##  Accuracy 
## 0.7682927
# predict outcome for test data set using the boosting model 
gbmFit <- train(diagnosis ~., data = training,
                method = "gbm", verbose = FALSE)
gbm_pred <- predict(gbmFit, testing)
confusionMatrix(gbm_pred, testing$diagnosis)$overall[1] 
##  Accuracy 
## 0.7926829
# predict outcome for test data set using the boosting random forest model 
ldaFit <- train(diagnosis ~., data = training, 
                method = "lda")
## Loading required package: MASS
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
lda_pred <- predict(ldaFit, testing)
confusionMatrix(lda_pred, testing$diagnosis)$overall[1]
##  Accuracy 
## 0.7682927
# combine the prediction results and the true results into new data frame
combinedTestData <- data.frame(rf.pred = rf_pred, gbm.pred = gbm_pred, 
                               lda.pred = lda_pred, 
                               diagnosis = testing$diagnosis)
# run a Random Forest model on the combined test data
comb.fit <- train(diagnosis ~., method = "rf", data = combinedTestData)
## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
# use the resultant model to predict on the test set
comb.pred.test <- predict(comb.fit, combinedTestData)
confusionMatrix(comb.pred.test, testing$diagnosis)$overall[1]
## Accuracy 
## 0.804878

3.

Load the concrete data with the commands:

set.seed(3523)
library(AppliedPredictiveModeling)
data("concrete")
inTrain = createDataPartition(concrete$CompressiveStrength, p = 3/4)[[1]]
training = concrete[inTrain, ]
testing  = concrete[-inTrain, ]

Set the seed to 233 and fit a lasso model to predict Compressive Strength

set.seed(233)
lassoFit <- train(CompressiveStrength ~.,  data = training,
               method = "lasso")
## Loading required package: elasticnet
## Loading required package: lars
## Loaded lars 1.2
lassoFit$finalModel
## 
## Call:
## enet(x = as.matrix(x), y = y, lambda = 0)
## Cp statistics of the Lasso fit 
## Cp: 1201.603 1014.917  861.914  572.237  422.521  129.536  105.671   34.171    6.767    8.340    9.000 
## DF: 1 2 3 4 5 6 7 7 7 8 9 
## Sequence of  moves:
##      Cement Superplasticizer Age BlastFurnaceSlag Water FineAggregate
## Var       1                5   8                2     4             7
## Step      1                2   3                4     5             6
##      FlyAsh FineAggregate FineAggregate CoarseAggregate   
## Var       3            -7             7               6 11
## Step      7             8             9              10 11
# load lars package
#library(lars)
# perform lasso regression
#x <- concrete[, -9]
#y <- concrete$CompressiveStrength
#lasso.fit <- lars(as.matrix(x), y, type="lasso", trace=TRUE)
# plot lasso regression model
#plot(lasso.fit, breaks=FALSE, cex = 0.75)
# add legend
#legend("topleft", covnames, pch=8, lty=1:length(covnames),
#col=1:length(covnames), cex = 0.6)
plot.enet(lassoFit$finalModel, xvar = "penalty", use.color = TRUE)

4.

Load the data on the number of visitors to the instructors blog from here:

library(lubridate)
## 
## Attaching package: 'lubridate'
## 
## The following object is masked from 'package:plyr':
## 
##     here
#setwd("/Users/joedibernardo/Projects/DATASCIENCE/08_PracticalMachineLearning/week4")
filename_1 <- "gaData.csv"
fileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/gaData.csv"
download.file(fileUrl, destfile = filename_1, method = "curl")
dat = read.csv("gaData.csv")
training = dat[year(dat$date) < 2012, ]
testing  = dat[year(dat$date) > 2011, ]
tstrain  = ts(training$visitsTumblr)

Fit a model using the bats() function in the forecast package to the training time series.

library(forecast)
## Warning: package 'forecast' was built under R version 3.2.5
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: timeDate
## This is forecast 7.1
# build a bats model based on the original time series
bats <- bats(tstrain, use.parallel = TRUE, num.cores = 4)
# construct a forecasting model using the bats function
visits.forecast <- forecast(bats)

Then forecast this model for the remaining time points. For how many of the testing points is the true value within the 95% prediction interval bounds?

# construct a forecasting model using the bats function
# predict next remaining time points
visits.forecast <- forecast(bats, nrow(testing))
plot(visits.forecast)

# extracting the 95% prediction boundaries
visits.forecast.lower95 = visits.forecast$lower[,2]
visits.forecast.upper95 = visits.forecast$upper[,2]
# see how many of the testing visit counts do actually match
table ( 
  (testing$visitsTumblr>visits.forecast.lower95) & 
  (testing$visitsTumblr<visits.forecast.upper95))
## 
## FALSE  TRUE 
##     9   226
226/nrow(testing)
## [1] 0.9617021

5. Load the concrete data with the commands:

set.seed(3523)
library(AppliedPredictiveModeling)
data("concrete")
inTrain = createDataPartition(concrete$CompressiveStrength, p = 3/4)[[1]]
training = concrete[ inTrain,]
testing  = concrete[ - inTrain,]

Set the seed to 325 and fit a support vector machine using the e1071 package to predict Compressive Strength using the default settings. Predict on the testing set. What is the RMSE?

library(e1071)
## 
## Attaching package: 'e1071'
## 
## The following objects are masked from 'package:timeDate':
## 
##     kurtosis, skewness
set.seed(325)
svm.model <- svm(CompressiveStrength ~., data = training)
svm.pred  <- predict(svm.model,testing)
error <- testing$CompressiveStrength - svm.pred
RMSE(svm.pred, testing$CompressiveStrength)
## [1] 6.715009
plot(svm.pred, testing$CompressiveStrength, 
     pch=20, cex=1, 
     col=testing$Age, 
     main= "Relationship between the svm forecast and actual values")

Copyright © 2016 thetazero.com All Rights Reserved. Privacy Policy