Machine Learning applied to Human Activity Recognition

The goal of this research is to explore a dataset of recorded values from life log systems for monitoring energy expenditure and for supporting weight-loss programs, and digital assistants for weight lifting exercises.

For more details please refer to: Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers' Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.

In this work see the paper we first define quality of execution and investigate three aspects that pertain to qualitative activity recognition:

We tried out an on-body sensing approach and six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).


Given a training set of 19622 recordings with 158 predictors and the correspondent outcome variable (classe), in this research we try to fit a model to predict the outcome with a great accuracy, using Machine Learning algorithms. One of the best method to achieve a great level of accuracy is the Random Forest and in this study it is applied with bootstrap, 25 resampling iterations and 25 cross validations. The training set has been divided at 75% to fit the model and 25% for validation. With the hypothesis set above, we expect a low out of sample error rate, less than 5%.

Source code used for the project

The following code has been used to achieve the model fitting with the Random Forest algorithm. The code is self-commented

training <- read.csv("./datasets/pml-training.csv", row.names = 1)
testing <- read.csv("./datasets/pml-testing.csv", row.names = 1)

# from exploratory analysis many of the predictor variables in the training set are sparse.
# they are removed from the training set as well as some other variables "num_window", "user_name" and timestamps.
to_remove <- c("cvtd_timestamp", "raw_timestamp_part_1")              
to_remove <- c(to_remove, grep("kurtosis", names(training), value=T))
to_remove <- c(to_remove, grep("skewness", names(training), value=T))
to_remove <- c(to_remove, grep("_yaw", names(training), value=T))
to_remove <- c(to_remove, grep("_pitch", names(training), value=T))
to_remove <- c(to_remove, grep("_picth", names(training), value=T))
to_remove <- c(to_remove, grep("_roll", names(training), value=T))
to_remove <- c(to_remove, grep("var_accel", names(training), value=T))
to_remove <- c(to_remove, "new_window", "var_total_accel_belt", "user_name")

# so clean out the train set!
training <- training[, setdiff(names(training), to_remove)]

# search for predictors with low variance to remove
other_predictor_to_discard <- nearZeroVar(training)
if (length(other_predictor_to_discard)>0) {
    training <- training[, -other_predictor_to_discard]
# classe is the outcome
classe <- training[, "classe"]
# remove the outcome from the train set of predictors
classe_col_idx <- which(colnames(training)=="classe")
training <- training[, -classe_col_idx]
# Fit the model
modelFit2 <- train(y=classe, x=training, method="rf")
# remove the same predictors from the testing data set too
testing <- testing[, setdiff(names(testing), to_remove)]
# predict the outcomes in the testing data set
pred2 <- predict(modelFit2, testing)

Expectation of the out of sample error

outOfSample <- function(obj) {
    sum(obj$trainingData$.outcome != obj$finalModel$predicted)/length(obj$trainingData$.outcome)

message("Out of sample error is ", round(outOfSample(modelFit2)*100, 2), " %")
## Out of sample error is 0.13 %

Out of sample error is very low and within the expected value.

Other interesting results from the model

## Random Forest 
## 19622 samples
##    54 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 19622, 19622, 19622, 19622, 19622, 19622, ... 
## Resampling results across tuning parameters:
##   mtry  Accuracy  Kappa   Accuracy SD  Kappa SD
##    2    0.9943    0.9928  0.002161     0.002732
##   28    0.9965    0.9955  0.002202     0.002783
##   54    0.9912    0.9889  0.005180     0.006548
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 28.

Plot section

In the following plots we can see which predictors have the main impact on the outcome values.

plot of chunk charts plot of chunk charts