Exercise Efficacy

This is a machine learning model to predict how well someone preforms a weight lifting exercise. Movements are tracked through accelerometers in wearable devices (such as a Jawbone Up, Nike FuelBand, and Fitbit). A study was conducted where data was collected and classified measurements taken when someone performs the exercise properly (Class A) as well as when the exercise is done with common mistakes (Classes B to E).

The following is an approach to building the best predictive model:

Retrieving and Preparing Data
Cross Validation
Building Models
Evaluating Model Accuracies
Predicting Results from a Testing Dataset

Retrieving & Preparing Data

The data and its description is available from http://groupware.les.inf.puc-rio.br/har. For this assignment, the data is downloaded and stored locally in two files. The file, pml-training.csv is for training the model. The file, pml-ttesting.csv is data to be classified (predicted) by the model.

# Load datasets and convert missing data into N/A
training <- read.csv("pml-training.csv", na.strings=c("","NA", "#DIV/0!"))
testing <- read.csv("pml-testing.csv", na.strings=c("","NA", "#DIV/0!"))

# Remove columns that are not predictors

#the first five columns include non-predicting values (names, timestamps, etc.)
trainingPs <- training[,-(1:5)]

# remove predictors with data that does not vary (all values are roughly the same)
trainingPs <- trainingPs[,-nearZeroVar(trainingPs, saveMetrics = FALSE)]

# remove columns that have NAa 
rem.columns <- names(which(colSums(is.na(trainingPs))>0))
trainingPs <- trainingPs[, !(names(trainingPs) %in% rem.columns)]

Cross Validation

Split the cleaned data into a training and validation set,

inTrain <- createDataPartition(y=trainingPs$classe, p=.7, list= FALSE)
trainingSet <- trainingPs[inTrain,]
validationSet <- trainingPs[-inTrain,]

Summary of training and validation datasets…

CrossValSummary <- rbind(Original_data = dim(trainingPs), training_subset = dim(trainingSet), validation_subset = dim(validationSet))
colnames(CrossValSummary) <- c("Observations", "Predictors")
CrossValSummary

##                   Observations Predictors
## Original_data            19622         54
## training_subset          13737         54
## validation_subset         5885         54

Building Models

Build two models using different methodologies: random forest and generalized boosting model. Both are tree-based classification models from the caret package.

# fit a random forest model (Ths takes a LONG time!) and gbm - knitting html on global variables
# modFit <- train(classe~., data=trainingSet, method="rf", prox=TRUE)
# modFit_gbm <- train(classe~., data=trainingSet, method="gbm", verbose=FALSE)

Both methodologies zero in on an optimal model and have an accuracy measure. .996 for random forest and .983 for the GBM. Additionally, the random forest model offers an out of bag error estimate of .18% (e.g. an out-of sample error rate).

Evaluating Model Accuracies

To evaluate the two models, we will use the validation data subset to predict the classification and compare the predicted classification with the true classification.

Random Forest Model:

#First for the Random Forest Model
Predict_rf <- predict(modFit, validationSet)
CM_RF <- confusionMatrix(Predict_rf, validationSet$classe)
CM_RF$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.9981308      0.9976357      0.9966580      0.9990666      0.2844520 
## AccuracyPValue  McnemarPValue 
##      0.0000000            NaN

The random forest model has the best accuracy of the two.

GBM Model:

# Second for the GBM model
Predict_gbm <- predict(modFit_gbm, validationSet)
CM_GBM <- confusionMatrix(Predict_gbm, validationSet$classe)
CM_GBM$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.9908241      0.9883926      0.9880442      0.9930995      0.2844520 
## AccuracyPValue  McnemarPValue 
##      0.0000000            NaN

The GBM model has good accuracy, but not quite as good as random forest.

The tables below show how each model predicted the validation dataset across the classifications

CM_RF$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1674    0    0    0    0
##          B    0 1139    1    0    1
##          C    0    0 1025    8    0
##          D    0    0    0  955    0
##          E    0    0    0    1 1081

CM_GBM$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1671    7    0    0    0
##          B    2 1121    9    2    1
##          C    0   11 1015   13    0
##          D    1    0    0  948    5
##          E    0    0    2    1 1076

The random forest method was better across all classifications.

Predicting Results from a Testing Dataset

Predictions using the random forest methodology:

modelPredictions <- predict(modFit, testing)
cbind(testing[,1:2], classe = modelPredictions)

##     X user_name classe
## 1   1     pedro      B
## 2   2    jeremy      A
## 3   3    jeremy      B
## 4   4    adelmo      A
## 5   5    eurico      A
## 6   6    jeremy      E
## 7   7    jeremy      D
## 8   8    jeremy      B
## 9   9  carlitos      A
## 10 10   charles      A
## 11 11  carlitos      B
## 12 12    jeremy      C
## 13 13    eurico      B
## 14 14    jeremy      A
## 15 15    jeremy      E
## 16 16    eurico      E
## 17 17     pedro      A
## 18 18  carlitos      B
## 19 19     pedro      B
## 20 20    eurico      B

Machine Learning Project

Mike Wehinger

24 June 2017