This is a machine learning model to predict how well someone preforms a weight lifting exercise. Movements are tracked through accelerometers in wearable devices (such as a Jawbone Up, Nike FuelBand, and Fitbit). A study was conducted where data was collected and classified measurements taken when someone performs the exercise properly (Class A) as well as when the exercise is done with common mistakes (Classes B to E).
The following is an approach to building the best predictive model:
The data and its description is available from http://groupware.les.inf.puc-rio.br/har. For this assignment, the data is downloaded and stored locally in two files. The file, pml-training.csv is for training the model. The file, pml-ttesting.csv is data to be classified (predicted) by the model.
# Load datasets and convert missing data into N/A
training <- read.csv("pml-training.csv", na.strings=c("","NA", "#DIV/0!"))
testing <- read.csv("pml-testing.csv", na.strings=c("","NA", "#DIV/0!"))
# Remove columns that are not predictors
#the first five columns include non-predicting values (names, timestamps, etc.)
trainingPs <- training[,-(1:5)]
# remove predictors with data that does not vary (all values are roughly the same)
trainingPs <- trainingPs[,-nearZeroVar(trainingPs, saveMetrics = FALSE)]
# remove columns that have NAa
rem.columns <- names(which(colSums(is.na(trainingPs))>0))
trainingPs <- trainingPs[, !(names(trainingPs) %in% rem.columns)]
Split the cleaned data into a training and validation set,
inTrain <- createDataPartition(y=trainingPs$classe, p=.7, list= FALSE)
trainingSet <- trainingPs[inTrain,]
validationSet <- trainingPs[-inTrain,]
Summary of training and validation datasets…
CrossValSummary <- rbind(Original_data = dim(trainingPs), training_subset = dim(trainingSet), validation_subset = dim(validationSet))
colnames(CrossValSummary) <- c("Observations", "Predictors")
CrossValSummary
## Observations Predictors
## Original_data 19622 54
## training_subset 13737 54
## validation_subset 5885 54
Build two models using different methodologies: random forest and generalized boosting model. Both are tree-based classification models from the caret package.
# fit a random forest model (Ths takes a LONG time!) and gbm - knitting html on global variables
# modFit <- train(classe~., data=trainingSet, method="rf", prox=TRUE)
# modFit_gbm <- train(classe~., data=trainingSet, method="gbm", verbose=FALSE)
Both methodologies zero in on an optimal model and have an accuracy measure. .996 for random forest and .983 for the GBM. Additionally, the random forest model offers an out of bag error estimate of .18% (e.g. an out-of sample error rate).
To evaluate the two models, we will use the validation data subset to predict the classification and compare the predicted classification with the true classification.
#First for the Random Forest Model
Predict_rf <- predict(modFit, validationSet)
CM_RF <- confusionMatrix(Predict_rf, validationSet$classe)
CM_RF$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.9981308 0.9976357 0.9966580 0.9990666 0.2844520
## AccuracyPValue McnemarPValue
## 0.0000000 NaN
The random forest model has the best accuracy of the two.
# Second for the GBM model
Predict_gbm <- predict(modFit_gbm, validationSet)
CM_GBM <- confusionMatrix(Predict_gbm, validationSet$classe)
CM_GBM$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.9908241 0.9883926 0.9880442 0.9930995 0.2844520
## AccuracyPValue McnemarPValue
## 0.0000000 NaN
The GBM model has good accuracy, but not quite as good as random forest.
The tables below show how each model predicted the validation dataset across the classifications
CM_RF$table
## Reference
## Prediction A B C D E
## A 1674 0 0 0 0
## B 0 1139 1 0 1
## C 0 0 1025 8 0
## D 0 0 0 955 0
## E 0 0 0 1 1081
CM_GBM$table
## Reference
## Prediction A B C D E
## A 1671 7 0 0 0
## B 2 1121 9 2 1
## C 0 11 1015 13 0
## D 1 0 0 948 5
## E 0 0 2 1 1076
The random forest method was better across all classifications.
Predictions using the random forest methodology:
modelPredictions <- predict(modFit, testing)
cbind(testing[,1:2], classe = modelPredictions)
## X user_name classe
## 1 1 pedro B
## 2 2 jeremy A
## 3 3 jeremy B
## 4 4 adelmo A
## 5 5 eurico A
## 6 6 jeremy E
## 7 7 jeremy D
## 8 8 jeremy B
## 9 9 carlitos A
## 10 10 charles A
## 11 11 carlitos B
## 12 12 jeremy C
## 13 13 eurico B
## 14 14 jeremy A
## 15 15 jeremy E
## 16 16 eurico E
## 17 17 pedro A
## 18 18 carlitos B
## 19 19 pedro B
## 20 20 eurico B