-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.Rmd
135 lines (96 loc) · 4.61 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
title: "Weight Lifting Exercise Quality"
author: "Luis Talavera"
date: "August 9th 2022"
output:
html_document:
keep_md: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, comment= "", message = FALSE)
```
## Summary
The goal of this project is create a model to predict the quality of how an exercise is performed using data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: [http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har](http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har) (see the section on the Weight Lifting Exercise Dataset).
## The data
The data used in this report was recorded by accelerometers on the belt, forearm, arm, and dumbbell of 6 participants. Participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions:
exactly according to the specification (Class A), throwing the elbows to the
front (Class B), lifting the dumbbell only halfway (Class C), lowering the
dumbbell only halfway (Class D) and throwing the hips to the front (Class E).
Class A corresponds to the specified execution of the exercise, while the other 4
classes correspond to common mistakes.
### Load and clean data
```{r load_data}
library(readr)
url_train <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
url_test <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
pml_training <- read_csv(url_train, na=c("NA","#DIV/0!",""))
pml_testing <- read_csv(url_test, na=c("NA","#DIV/0!",""))
dim(pml_training)
```
For the feature selection we will first remove the near zero variance variables since they are considered to have less predictive power, then remove the variables that are mostly NA and finally the identification variables.
```{r clean_data}
library(caret)
# Remove variables with Nearly Zero Variance
nzv <- nearZeroVar(pml_training, saveMetrics = TRUE)
pml_training <- pml_training[, nzv$nzv==FALSE]
pml_testing <- pml_testing[, nzv$nzv==FALSE]
# Remove variables that are mostly NA
pml_training <- Filter(function(x) mean(is.na(x)) < 0.6, pml_training)
pml_testing <- Filter(function(x) mean(is.na(x)) < 0.6, pml_testing)
# Remove identification variables
pml_training <- pml_training[-(1:5)]
pml_testing <- pml_testing[-(1:5)]
# Categorical variable
pml_training$classe <- factor(pml_training$classe)
```
```{r parallel, echo = FALSE}
library(parallel)
library(doParallel)
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)
```
## The model
The model will be built using the random forest method and the data will be divided in a ratio of 70-30 to perform training and testing.
Cross validation is done with K = 5.
```{r split_data}
set.seed(1234)
trainIndex <- createDataPartition(y=pml_training$classe, times=1,
p=0.7, list=FALSE)
pmltrain <- pml_training[trainIndex, ]
pmltest <- pml_training[-trainIndex, ]
train.names <- colnames(pml_training[, -ncol(pml_training)])
pml_testing <- pml_testing[train.names]
```
```{r model, cache = TRUE}
set.seed(1234)
fitControl <- trainControl(method = "cv",
number = 5,
allowParallel = TRUE)
fit <- train(classe ~ ., method="rf", data=pmltrain, trControl = fitControl)
fit$finalModel
```
## Prediction with Random Forest
Now, we will use our model to predict test data and plot a confusion matrix
```{r predict, cache = TRUE}
predictFit <- predict(fit, newdata = pmltest)
cm <- confusionMatrix(predictFit, pmltest$classe)
cm
```
```{r heatmap, echo=FALSE}
library(ggplot2)
df <- as.data.frame(cm$table)
ggplot(df, aes(Prediction, Reference, fill = Freq)) +
geom_tile(color = "white",
lwd = 1.5,
linetype = 1) +
labs(x = "Prediction", y = "Actual", title = "Confusion matrix of quality",
fill = "Select") +
geom_text(aes(label = Freq), color = "white", size = 4) +
coord_fixed()
```
```{r turnOffParallelism, echo = FALSE}
stopCluster(cluster)
registerDoSEQ()
```
## Bibliography
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human '13) . Stuttgart, Germany: ACM SIGCHI, 2013.