-
Notifications
You must be signed in to change notification settings - Fork 0
/
Evaluate Machine Learning Algo with R
211 lines (150 loc) · 8.43 KB
/
Evaluate Machine Learning Algo with R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
# Evaluate machine learning algorithm with R
source: http://machinelearningmastery.com/evaluate-machine-learning-algorithms-with-r/
Data: https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.names
# setting path to access my csv file
setwd("C:\\Users\\mdeleseleuc\\Documents")
mydata<- read.csv(file="PID.csv",head=TRUE,sep=";")
head(mydata,5)
# Note:
# Missing values seem to have been encoded as "0". Need to transform them (for calculations)
# Class: 1 = tested positive for diabetes
# BMI: 30-35 = Obese class I (18.5-25 = Normal) - Intuition: var. = predictive (source: http://www.calculator.net/bmi-calculator.html)
# Question: BMI & blood pressure correlated?
# Data skewed toward negative results (65% of total)
# Class should be factor & BMI too (I think)
# Objective:
# This case study is broken down into 3 sections:
# Defining a test harness.
# Building multiple predictive models from the data.
# Comparing models and selecting a short list.
# 1. Test Harness
# The dataset we use to spot check algorithms should be representative of our problem, but it does not have to be all of our data.
# If we have a large dataset, it could cause some of the more computationally intensive algorithms we want to check
# to take a long time to train.
# Rule of thumb: each algo should train within 1-to-2 min (ideally 30 sec)
# If large dataset: take random samples and 1 simple model (glm) and see how long it takes to train
# Select a sample size that falls within the sweet spot
# Can repeat this experiment later with a larger dataset, once have a smaller subset of algo that look promising
# In this ex. only 768 instance so will use everythin
# load libraries
library(mlbench) # A collection of artificial and real-world machine learning benchmark problems
library(caret)
# load data
data(PimaIndiansDiabetes)
# rename dataset to keep code below generic
dataset <- PimaIndiansDiabetes
head(dataset)
# Test Options
# Refers to the technique used to evaluate the accuracy of a model on unseen data.
# Often referred to as resampling methods in statistics
# Recommandation:
# Train/Test split: if you have a lot of data and determine you need a lot of data to build accurate models.
# Cross Validation: 5 folds or 10 folds provide a commonly used tradeoff of speed of compute time and generalize error estimate.
# Repeated Cross Validation: 5- or 10-fold cross validation and 3 or more repeats to give a more robust estimate,
# only if you have a small dataset and can afford the time
# In this case study we will use 10-fold cross validation with 3 repeats.
control <- trainControl(method="repeatedcv", number=10, repeats=3)
seed <- 7 # Allow to can re-set the random number generator before we train each algorithm
# (ensure that each algorithm is evaluated on exactly the same splits of data)
# Explations about trainControl: https://topepo.github.io/caret/training.html
# The function trainControl generates parameters that further control how models are created
# Explantions about cross validation: https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29
# Test Metric
# Some good test metrics to use for different problem types include:
# Classification:
# Accuracy: x correct divided by y total instances. Easy to understand and widely used.
# Classification accuracy is the number of correct predictions made divided by the total number of predictions made,
# multiplied by 100 to turn it into a percentage.
# Kappa: easily understood as accuracy that takes the base distribution of classes into account.
# Regression:
# RMSE: root mean squared error. Again, easy to understand and widely used.
# Rsquared: the goodness of fit or coefficient of determination.
# Other popular measures include ROC and LogLoss.
metric <- "Accuracy"
# 2. Model Building
# 3 concerns:
# What models to choose?
# How to configure their arguments?
# Preprocessing of the data for the algo
# Algos
# Important to have a good mix of algo representations (lines, trees, instances, etc.) as well as algo for learning those rpstations
# A good rule of thumb I use is “a few of each”, for example in the case of binary classification:
# Linear methods: Linear Discriminant Analysis and Logistic Regression.
# Non-Linear methods: Neural Network, SVM, kNN and Naive Bayes
# Trees and Rules: CART, J48 and PART
# Ensembles of Trees: C5.0, Bagged CART, Random Forest and Stochastic Gradient Boosting
# Want some low complexity easy to interpret methods (LDA, kNN...) in case they do well and can adopt them
# Also want sophisticated methods (random forest...) to see if the pb can be learned and start building up expect. of accuracy
# At least 10-to-20 different algo
# Algos Config
# All ML algo are parametrized (require specify arguments)
# Most algorithm parameters have heuristics that you can use to provide a first past configuration of the algorithm to get the ball rolling.
# Spot checking, don't want to be trying many variations of algo parameters (comes later)
# caret package helps tuning aglo parameters. Can also estimate good defaults (auto tuning functionality + tunelenght in train())
# Recomment using the defaults for most if not all algo when spot checking
# Data Preprocessing
# The most useful transform is to scale and center the data via. For example:
preProcess = c("center", "scale")
# Algo Spot Check
# Linear Discriminant Analysis
library(e1071)
library(MASS)
library(glmnet)
library(kernlab)
library(klaR)
library(rpart)
library(C50)
library(plyr)
library(ipred)
library(gbm)
set.seed(seed)
fit.lda <- train(diabetes~., data=dataset, method="lda", metric=metric, preProc=c("center", "scale"), trControl=control)
# Logistic Regression
set.seed(seed)
fit.glm <- train(diabetes~., data=dataset, method="glm", metric=metric, trControl=control)
# GLMNET
set.seed(seed)
fit.glmnet <- train(diabetes~., data=dataset, method="glmnet", metric=metric, preProc=c("center", "scale"), trControl=control)
# SVM Radial
set.seed(seed)
fit.svmRadial <- train(diabetes~., data=dataset, method="svmRadial", metric=metric, preProc=c("center", "scale"), trControl=control, fit=FALSE)
# kNN
set.seed(seed)
fit.knn <- train(diabetes~., data=dataset, method="knn", metric=metric, preProc=c("center", "scale"), trControl=control)
# Naive Bayes
set.seed(seed)
fit.nb <- train(diabetes~., data=dataset, method="nb", metric=metric, trControl=control)
# CART
set.seed(seed)
fit.cart <- train(diabetes~., data=dataset, method="rpart", metric=metric, trControl=control)
# C5.0
set.seed(seed)
fit.c50 <- train(diabetes~., data=dataset, method="C5.0", metric=metric, trControl=control)
# Bagged CART
set.seed(seed)
fit.treebag <- train(diabetes~., data=dataset, method="treebag", metric=metric, trControl=control)
# Random Forest
set.seed(seed)
fit.rf <- train(diabetes~., data=dataset, method="rf", metric=metric, trControl=control)
# Stochastic Gradient Boosting (Generalized Boosted Modeling)
set.seed(seed)
fit.gbm <- train(diabetes~., data=dataset, method="gbm", metric=metric, trControl=control, verbose=FALSE)
# 3. Model Selection
# Looking for a best model at this stage. Algos not been tuned and can all likely do a lot better than the results currently seen.
# Goal:
results <- resamples(list(lda=fit.lda, logistic=fit.glm, glmnet=fit.glmnet,
svm=fit.svmRadial, knn=fit.knn, nb=fit.nb, cart=fit.cart, c50=fit.c50,
bagging=fit.treebag, rf=fit.rf, gbm=fit.gbm))
# Table comparison
summary(results)
# The highest the mean, the best the accuracy
# Also useful to review the results using a few different visualization techniques
# to get an idea of the mean and spread of accuracies.
# boxplot comparison
bwplot(results)
# Dot-plot comparison
dotplot(results)
# From these results, it looks like linear methods do well on this problem.
# I would probably investigate logistic, lda, glmnet, and gbm further.
# If had more data, would repeat the experiment with a large sample and see if large dataset improve perf.
# of any of the tree methods (often does)