generated from jtr13/cctemplate
-
Notifications
You must be signed in to change notification settings - Fork 67
/
Copy pathcaret_machine_learning_tutorial.Rmd
334 lines (238 loc) · 13.7 KB
/
caret_machine_learning_tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
# caret - a machine learning package turotial
Jiang Zhu and Xuechun Bai
Motivation: The motivation for this community contribution project is to introduce to classmates a package in R that can easily implement machine learning algorithms in few lines of code. As python users who commonly use the package `scikit-learn` to deploy machine laerning models, we strived to find a similar R package that has similar functions. While finding the resources online, we discovered that `caret` is a great package with neatly defined functions that create a easy-to use interface for machine learning algorithms, and at the same time, it has abundant machine learning models and preprocessing and model selection functions. However, we have not found any great tutorial or easy-to-read API's for caret. Therefore, in this markdown file, we will illustrate how to use the `caret` package in a example-based manner. We eliminate the mathematics behind each model as much as possible to focus on the implementation of the algorithm in R. We will also demonstrate the results of our experiments of `caret` model on datasets with visualization tools.
```{r include=FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```
```{r warning=FALSE}
library(caret)
library(tidyverse)
library(readr)
library(kernlab)
library(earth)
library(RANN)
```
## Introduction and description of dataset
For the tutorial, we will use a data frame about Orange Juice which includes two orange juice brands, Citrus Hill (CH) and Minute Maid (MM) and selling information according to each brand. The URL of the dataset is : https://raw.githubusercontent.com/selva86/datasets/master/orange_juice_withmissing.csv, which contains 18 cloumns and 1070 rows.
The first column "Purchase" introduces brand of orange juice. the data contains weekly amount of purchase of the two brands (WeekofPurchase) with the store id (StoreID) where the orange juice was sold, and the price of the orange juice according to brand CH (PriceCH) and MM (PriceMM). The data also includes the discount ("DiscCH" and "DiscMM"), sale price("SalePriceCH" and "SalePriceMM") and other useful information about those two orange juice brands. The goal of our tutorial is to predict the customers' preference between two brands when choosing to buy a orange juice, so we will only focusing on the "Purchase" for now.
Now let us firstly import the data, and state a briefly summary about this data:
```{r}
# Import dataset
df <- read.csv('https://raw.githubusercontent.com/selva86/datasets/master/orange_juice_withmissing.csv')
# Structure of the dataframe
head(df)
```
## Pre-processing
After importing the dataset, we need to separate data into training and testing sets. The training set is used as examples for machine learning models to learn generalize, and the test set is used to evaluate the preformance of our models. We can achieve this by using `createDataPartition()` in caret package.
```{r eval=FALSE}
createDataPartition(
y,
times,
p,
list
)
```
1. `y` specifies the column to which we will partition the data
2. `times` specifies how many times we will partition nthe data
3. `p` specifies the proportion to which we split the data
4. `list` is a boolean indicating whether we return a list
Here we set 70% of training set and 30% of testing set.
```{r}
#set random seed to produce the same result
set.seed(1)
#use createDataPartition() function to get row of training. The input variable here is "Purchase" in df, and p = .7 implies the percentage of dataset we want to take. Here we set the training with 70%, so we let p equals to 0.7. Using List = F we are able to prevent the result as being a list instead of a dataframe.
rowOfTrain <- createDataPartition(df$Purchase, p=0.7, list=FALSE)
# create datasets for training and testing
train_df <- df[rowOfTrain,]
test_df <- df[-rowOfTrain,]
```
The next step is to clear or refill the missing values or NAs in the dataset. There are several ways of trasnformation such as fill the missing value into mean, mode or simply delete the row of missing values. Using caret package, we can apply the NAs to more practical values, which is to predict the missing values by other values that are listed in the dataset. The way `caret` package can do is to use `preProcess()` function to make k-Nearest Neighbors and use it with training set.
```{r eval=FALSE}
preProcess(
data,
method
)
```
1. `data` specifies the dataframe we want to apply the preprocessing
2. `method` is a string that specifies the method for preprocessing
Then we will get a Nearest model for perdicting the missing values. After that, we will use `predict()` to fill the missing values with our prediction.
```{r}
# Create k-Nearest with training data using method = 'knnImpute'
knn_inputer <- preProcess(train_df, method='knnImpute')
knn_inputer
# Import the prediction into missing values using function predict()
train_df <- predict(knn_inputer, newdata = train_df)
#Check the NAs in the training dataset
anyNA(train_df)
```
From the result, we can see the prediction model has centered with 16 variables, ignored with 2 and was created from 825 samples and 18 variables. Then after importing the NAs by "predict" function, we then check if there are any missing values remaining in the dataset using angNA() function. The result of "FALSE" shows that the missing values are all replaced.
Another work the caret package can do is to transform the categorical variables into one-hot vectors to have a numerical representation of categorical variables. An one-hot vector is constructed as follows: suppose there are $n$ classes of categorical variables. Then we represent each class $i$ by a $n$-dimensional vector $o_i$ such that $o_i[i]=1$ and $o_i[j]=0$ for all $j\neq i$
We can acheive the one-hot transformation using the `dummyVars()`
```{r eval=FALSE}
dummyVars(
formula,
data
)
```
1. `formula` repersents a way to inpute the data
2. `data` represents the dataframe
```{r warning=FALSE}
# define the one-hot encoding object
one_hot_encoder <- dummyVars(Purchase~., train_df)
# results after applying one-hot encoding
head(data.frame(predict(one_hot_encoder, train_df)))
```
As we can see, the `Store7.No` and `Store7.Yes` represent the one-hot encoding. We will not actually apply it to the dataset because `caret` supports factor as the labels.
## Model training
The machine learning models are defined and trained in `caret` by the `train()` function. It is the core function that masks all the tedious detail of the machine learning algorithms and provide a convenient and concise interface. The `train()` function has versions with different parameters:
```{r eval=FALSE}
# first version of train
model <- train(
formula,
data,
method,
)
# second version of train
model <- train(
x,
y,
method,
)
```
1. `formula` specifies the way we want to construct our model
2. `data` specifies the training data we want to use to construct the label
3. `method` is a string specifiying the model we want to apply to the data
The first version of the train function is recommended when we have all the data in a single dataframe. In another word, each column represents a feature $x_i$ with one column represents $y$. The second version of the train function is recommended when we have $x$ and $y$ as separate dataframes. In the following code, we will use the first version of `train()` on our prediction.
The method parameter can be specified conveniently by a string. By far there are 238 models supported by `caret` which can be found on https://topepo.github.io/caret/available-models.html. In this tutorial, we will illustrate and apply several common classification algorithms on our dataset.
One amazing feature of `caret` is that for almost all methods, it will automatically finds the best hyperparameters for the user.
### K-nearest neighbors
k-nearest neighbor is the most intuitive classification algorithm. Given some positive $k$ and some data $x$, k-neigherest neighbor finds the k nearest neighbors in the training set based on the euclidian distance
$$dist(x,x')=\sqrt{(x_1-x'_1)^2+(x_2-x'_2)^2+\cdots + (x_n-x'_n)^2}$$
and classifies $x$ as the class that contains the most nearst neighbors.
To implement such algorithm, we will specify the parameter `method=knn`
```{r}
# training the knn model
knn_model <- train(
Purchase ~.,
train_df,
method = "knn"
)
knn_model
```
### Logistics Regression
logistics regression is linear model for binary classification, in another word, given features $x_1,x_2,\cdots,x_n$, we want to predict $y=\{0,1\}$. The model can be described by equation
$$\hat{y}=\sigma(w_1\cdot x_1+w_2\cdot x_2+\cdots+w_n\cdot x_n)$$
where the $\sigma$ function is defined by
$$\sigma(x)=\frac{1}{1+e^{-x}}$$
we hope that the logistics regression algorithm can find the optimal $w_1,\cdots,w_n$ such that $\hat{y}$ is as closed to the true label $y$ as possible.
To implement such algorithm, we will specify the parameter `method="glm"`:
```{r warning=FALSE}
# training the logistics regression model
lr_model <- train(
Purchase ~.,
train_df,
method = "glm"
)
lr_model
```
### Support vector machine
support vector machine is a hyperplane that separates the input space into two subspaces. In the case when only we have $x_1$ and $x_2$, we want to draw a line that saparates the cartesian plane. The criteria for choosing such plane is the one that maximizes the distance to the closest data points from both classes. This gives rise to the optimization objective
$$\max_{w,b}\frac{1}{\vert\vert w\vert\vert_2}\min_{x_i\in D}\vert w^Tx_i+b\vert,\quad \text{ s.t. }\forall i,\quad y_i(w^Tx_i+b)\ge 0 $$
To implement such algorithm, we will specify the parameter `method="svmRadial"`
```{r warning=FALSE}
# training the support vector machine model, the library kernlab is needed to perform kernal transformation
svm_model <- train(
Purchase ~.,
train_df,
method = "svmRadial"
)
svm_model
```
### Random Forest
The decision tree algorithm learns to predict the value of a target variable by learning simple decision rules inferred from the data features. The process for constructing the tree is to first select a root node $x_i$ and a decision boundry $b_i$. Then we recirsively select its children nodes and a corresponding decision threshold $b$ such that it maximizes the information gain. Since different initialization produces different results, random forest aims to create multiple trees with different initialization and collectively ageraging the result that is produced by each individual decision trees
To implement such algorithm, we will specify the parameter `method="earch"`
```{r warning=FALSE}
# training the random forest model, the library earth is needed
rf_model <- train(
Purchase ~.,
train_df,
method = "earth"
)
rf_model
```
## Model prediction and evaluation
After we successfully train the models, it is time to evaluate them on the test set to validate if it is a good model. The key function to achieve this is the `predict()` function:
```{r eval=FALSE}
predict(
object,
newdata
)
```
To evaluate the models we have trained above, we can call the `predict()` function on each models:
```{r}
# inpute the test data first
test_df <- predict(knn_inputer, test_df)
```
we can compute the confution matrix for each model:
Confusion matrix for the knn model:
```{r}
knn_cm <- confusionMatrix(reference = factor(test_df$Purchase), data = predict(knn_model, test_df), mode='everything', positive='MM')
knn_cm$table
```
```{r}
ggplot(data = data.frame(knn_cm$table), aes(x=Reference, y=Prediction))+
geom_tile(aes(fill=Freq)) +
scale_fill_gradient2(low="light blue", high="blue") +
geom_text(aes(label=Freq)) +
ggtitle("Confution matrix for KNN model")
```
Confusion matrix for the logistics regression model:
```{r}
lr_cm <- confusionMatrix(reference = factor(test_df$Purchase), data = predict(lr_model, test_df), mode='everything', positive='MM')
lr_cm$table
```
```{r}
ggplot(data = data.frame(lr_cm$table), aes(x=Reference, y=Prediction))+
geom_tile(aes(fill=Freq)) +
scale_fill_gradient2(low="light blue", high="blue") +
geom_text(aes(label=Freq)) +
ggtitle("Confution matrix for Logistics Regression model")
```
Confusion matrix for the svm model:
```{r}
svm_cm <- confusionMatrix(reference = factor(test_df$Purchase), data = predict(svm_model, test_df), mode='everything', positive='MM')
svm_cm$table
```
```{r}
ggplot(data = data.frame(svm_cm$table), aes(x=Reference, y=Prediction))+
geom_tile(aes(fill=Freq)) +
scale_fill_gradient2(low="light blue", high="blue") +
geom_text(aes(label=Freq)) +
ggtitle("Confution matrix for SVM model")
```
Confusion matrix for the random forest model:
```{r}
rf_cm <- confusionMatrix(reference = factor(test_df$Purchase), data = predict(svm_model, test_df), mode='everything', positive='MM')
rf_cm$table
```
```{r}
ggplot(data = data.frame(rf_cm$table), aes(x=Reference, y=Prediction))+
geom_tile(aes(fill=Freq)) +
scale_fill_gradient2(low="light blue", high="blue") +
geom_text(aes(label=Freq)) +
ggtitle("Confution matrix for random forest model")
```
The `caret` package also provides a convenient function `resamples()` to compare the metric between models.
```{r}
model_list <- list(KNN = knn_model, LR = lr_model, SVM = svm_model, RF = rf_model)
models_compare <- resamples(model_list)
summary(models_compare)
```
As we can see that the random forest has the best performance on the dataset.
## Reference
Caret Package – A Practical Guide to Machine Learning in R - Selva Prabhakaran:
https://www.machinelearningplus.com/machine-learning/caret-package/#61howtotrainthemodelandinterprettheresults?
A basic tutorial of caret: the machine learning package in R - Rebecca Barter:
https://www.rebeccabarter.com/blog/2017-11-17-caret_tutorial/
The caret Package - Max Kuhn:
https://topepo.github.io/caret/index.html