-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcross_validation_practical_backup.Rmd
429 lines (293 loc) · 17.4 KB
/
cross_validation_practical_backup.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
---
output:
pdf_document: default
html_document: default
---
```{r setup, echo = FALSE}
knitr::opts_chunk$set(error = TRUE)
```
# Cross-Validation
In this lab, we explore the concepts of cross-validation.
## The Validation Set Approach
We explore the use of the validation set approach in order to estimate the test error rates that result from fitting various linear models on the `Auto` data set.
Before we begin, we use the `set.seed()` function in order to set a *seed* for `R`'s random number generator, so that the reader of this book will obtain precisely the same results as those shown below. It is generally a good idea to set a random seed when performing an analysis such as cross-validation that contains an element of randomness, so that the results obtained can be reproduced precisely at a later time.
We begin by using the `sample()` function to split the set of observations into two halves, by selecting a random subset of $196$ observations out of the original $392$ observations. We refer
to these observations as the training set.
```{r chunk1}
###############################
# 1. clarify concepts again
# 2. any doubts?
# 3. ISLR book
# 4. challenge questions
###############################
library(ISLR2)
set.seed(1)
nrow(Auto)
# View(Auto)
# explore the Auto dataset
plot(Auto$horsepower, Auto$mpg)
# how would we model this?
# create a test set 75%-25% split (arbitrary)
sampled_indices <- sample(c(TRUE,FALSE),
nrow(Auto),
replace = TRUE,
prob = c(0.75,0.25)
)
# training
Auto_training_cv <- Auto[sampled_indices,]
# test
Auto_test <- Auto[!sampled_indices,]
Auto = Auto_training_cv
# train <- sample(392, 196)
train <- sample(nrow(Auto), 196)
```
(Here we use a shortcut in the sample command; see `?sample` for details.)
We then use the `subset` option in `lm()` to fit a linear regression using only the observations corresponding to the training set.
```{r chunk2}
lm.fit <- lm(mpg ~ horsepower, data = Auto, subset = train)
```
We now use
the `predict()` function to estimate the response for all $392$ observations, and
we use
the `mean()` function to calculate the MSE of the $196$ observations in the validation set. Note that the `-train` index below selects only the observations that are not in the training set.
```{r chunk3}
attach(Auto)
mean((mpg - predict(lm.fit, Auto))[-train]^2)
```
Therefore, the estimated test MSE for the linear regression fit is $23.27$. We can use the `poly()` function to estimate the test error for the quadratic and cubic regressions.
```{r chunk4}
lm.fit2 <- lm(mpg ~ poly(horsepower, 2), data = Auto,
subset = train)
mean((mpg - predict(lm.fit2, Auto))[-train]^2)
lm.fit3 <- lm(mpg ~ poly(horsepower, 3), data = Auto,
subset = train)
mean((mpg - predict(lm.fit3, Auto))[-train]^2)
```
These error rates are $18.72$ and $18.79$, respectively.
If we choose a different training set instead, then we will obtain somewhat different errors on the validation set.
```{r chunk5}
set.seed(2)
# train <- sample(392, 196)
train <- sample(nrow(Auto), 196)
lm.fit <- lm(mpg ~ horsepower, subset = train)
mean((mpg - predict(lm.fit, Auto))[-train]^2)
lm.fit2 <- lm(mpg ~ poly(horsepower, 2), data = Auto,
subset = train)
mean((mpg - predict(lm.fit2, Auto))[-train]^2)
lm.fit3 <- lm(mpg ~ poly(horsepower, 3), data = Auto,
subset = train)
mean((mpg - predict(lm.fit3, Auto))[-train]^2)
```
Using this split of the observations into a training set and a validation set,
we find that the validation set error rates for the models with linear, quadratic, and cubic terms are $25.73$, $20.43$, and $20.39$, respectively.
These results are consistent with our previous findings: a model that predicts `mpg` using a quadratic function of `horsepower` performs better than a model that involves only a linear function of `horsepower`, and there is little evidence in favor of a model that uses a cubic function of `horsepower`.
## Leave-One-Out Cross-Validation
The LOOCV estimate can be automatically computed for any generalized linear model using the `glm()` and `cv.glm()` functions. In the lab for Chapter 4, we used the `glm()` function to perform logistic regression by passing in the `family = "binomial"` argument.
But if we use `glm()` to fit a model without passing in the `family` argument, then it performs linear regression, just like the `lm()` function.
So for instance,
```{r chunk6}
glm.fit <- glm(mpg ~ horsepower, data = Auto)
coef(glm.fit)
```
and
```{r chunk7}
lm.fit <- lm(mpg ~ horsepower, data = Auto)
coef(lm.fit)
```
yield identical linear regression models. In this lab, we will perform linear regression using
the `glm()` function rather than the `lm()` function because the former can be used together with
`cv.glm()`. The `cv.glm()` function is part of the `boot` library.
```{r chunk8}
library(boot)
glm.fit <- glm(mpg ~ horsepower, data = Auto)
cv.err <- cv.glm(Auto, glm.fit)
# The default is to set K equal to the number of observations in data which gives the usual leave-one-out cross-validation
######################################
# EXPLAIN THE MAGIC OF cv.glm()
# show slides and explain what
# cross-validation is
######################################
cv.err$delta
```
The `cv.glm()` function produces a list with several components. The two numbers in the `delta` vector contain the cross-validation results. In this case the numbers are identical (up to two decimal places). Below, we discuss a situation in which the two numbers differ. Our cross-validation estimate for the test error is approximately $24.23$.
We can repeat this procedure for increasingly complex polynomial fits.
To automate the process, we use the `for()` function to initiate a *for loop* which iteratively fits polynomial regressions for polynomials of order $i=1$ to $i=10$, computes the associated cross-validation error, and stores it in the $i$th element of the vector `cv.error`.
We begin by initializing the vector.
```{r chunk9}
cv.error <- rep(0, 10)
for (i in 1:10) {
# trting different degree polynomials
glm.fit <- glm(mpg ~ poly(horsepower, i), data = Auto)
cv.error[i] <- cv.glm(Auto, glm.fit)$delta[1]
}
cv.error
plot(cv.error, ylab = 'Cross-validation error', xlab = 'Degree of polynomial')
# pick the best model (degree 2 or degree 3 polynomial)
glm_fit_degree2 = glm(mpg ~ poly(horsepower, 2), data = Auto)
# predict on test set using best model
fit_values_test_degree2 = predict(glm_fit_degree2, newdata = Auto_test)
# calculate the test error
(fit_values_test_degree2 - Auto_test$mpg)^2
mean((fit_values_test_degree2 - Auto_test$mpg)^2)
```
As in Figure 5.4, we see a sharp drop in the estimated test MSE between the linear and quadratic fits, but then no clear improvement from using higher-order polynomials.
## $k$-Fold Cross-Validation
The `cv.glm()` function can also be used to implement $k$-fold CV. Below we use $k=10$, a common choice for $k$, on the `Auto` data set.
We once again set a random seed and initialize a vector in which we will store the CV errors corresponding to the polynomial fits of orders one to ten.
```{r chunk10}
set.seed(17)
cv.error.10 <- rep(0, 10)
for (i in 1:10) {
glm.fit <- glm(mpg ~ poly(horsepower, i), data = Auto)
cv.error.10[i] <- cv.glm(Auto, glm.fit, K = 10)$delta[1]
}
cv.error.10
plot(cv.error.10, ylab = 'Cross-validation error', xlab = 'Degree of polynomial')
############################################
# EXERCISE: calculate the test error
############################################
```
EXERCISE: calculate the test error
Notice that the computation time is shorter than that of LOOCV.
(In principle, the computation time for LOOCV for a least squares linear model should be faster than for $k$-fold CV, due to the availability
of the formula (5.2) for LOOCV; however, unfortunately the `cv.glm()` function does not make use of this formula.)
We still see little evidence that using cubic or higher-order polynomial terms leads to lower test error than simply using a quadratic fit.
We saw in Section 5.3.2 that the two numbers associated with `delta` are essentially the same when LOOCV is performed.
When we instead perform $k$-fold CV, then the two numbers associated with `delta` differ slightly. The first is the standard $k$-fold CV estimate,
as in (5.3). The second is a bias-corrected version. On this data set, the two estimates are very similar to each other.
## Concepts
This same concept of cross-validation can be used in other algorithms (neural networks, support vector machines, random forests, etc.).
## Exercises
Now try the exercise(s) below.
* Exercise 1
Download data from
https://github.com/neelsoumya/teaching_reproducible_science_R/blob/main/metagene_score.csv
and train a classifier to predict yes/no (flag_yes_no). This would be a binary classifier. Do this with cross-validation. Write your own R code to do this task. You can work in groups. Do this in class.
The data is also available here
https://github.com/neelsoumya/practical_supervised_machine_learning/blob/main/metagene_score.csv
* Exercise 2
Download data from
https://archive.ics.uci.edu/dataset/2/adult
and train a classifier to predict if income > 50K or < 50K (binary classifier). Do this with cross-validation. Write your own R code to do this task. You can work in groups. Do this in class.
The data is also available in the adult folder here
https://github.com/neelsoumya/practical_supervised_machine_learning/tree/main/adult
Hint: Do not forget to do the basics (like data cleaning, visualization, feature scaling, etc.)
* Challenge exercise 3
How would you select the features that go in the logistic regression model?
Think of a brute force approach.
Can you think of a more sophisticated apporach?
For an advanced exercise use the `glmnet` package in R.
See
https://glmnet.stanford.edu/articles/glmnet.html#logistic-regression-family-binomial
Need more of a challenge? See the `additional_code` folder and see the caret programs.
<!-- ## The Bootstrap -->
<!-- We illustrate the use of the bootstrap in the simple example of Section 5.2, as well as on an example involving estimating the -->
<!-- accuracy of the linear regression model on the `Auto` data set. -->
<!-- ### Estimating the Accuracy of a Statistic of Interest -->
<!-- One of the great advantages of the bootstrap approach is that it can be -->
<!-- applied in almost all situations. No complicated mathematical calculations -->
<!-- are required. Performing a bootstrap analysis in `R` entails only two -->
<!-- steps. First, we must create a function that computes the statistic of -->
<!-- interest. Second, we use the `boot()` function, which is part of the `boot` library, to perform the bootstrap by repeatedly -->
<!-- sampling observations from the data set with replacement. -->
<!-- The `Portfolio` data set in the `ISLR2` package is simulated data of $100$ pairs of returns, generated in the fashion described in Section 5.2. -->
<!-- To illustrate the use of the bootstrap on this data, we must first -->
<!-- create a function, `alpha.fn()`, which takes as input the $(X,Y)$ data -->
<!-- as well as a vector indicating which observations should be used to -->
<!-- estimate $\alpha$. The function then outputs the estimate for $\alpha$ -->
<!-- based on the selected observations. -->
<!-- ```{r chunk11} -->
<!-- alpha.fn <- function(data, index) { -->
<!-- X <- data$X[index] -->
<!-- Y <- data$Y[index] -->
<!-- (var(Y) - cov(X, Y)) / (var(X) + var(Y) - 2 * cov(X, Y)) -->
<!-- } -->
<!-- ``` -->
<!-- This function *returns*, or outputs, an estimate for $\alpha$ based on applying (5.7) to the observations indexed by the argument `index`. -->
<!-- For instance, the following command tells `R` to estimate $\alpha$ using -->
<!-- all $100$ observations. -->
<!-- ```{r chunk12} -->
<!-- alpha.fn(Portfolio, 1:100) -->
<!-- ``` -->
<!-- The next command uses the `sample()` function to randomly select -->
<!-- $100$ observations from the range $1$ to $100$, with replacement. This is equivalent -->
<!-- to constructing a new bootstrap data set and recomputing $\hat{\alpha}$ -->
<!-- based on the new data set. -->
<!-- ```{r chunk13} -->
<!-- set.seed(7) -->
<!-- alpha.fn(Portfolio, sample(100, 100, replace = T)) -->
<!-- ``` -->
<!-- We can implement a bootstrap analysis by performing this command many times, recording all of -->
<!-- the corresponding estimates for $\alpha$, and computing the resulting -->
<!-- standard deviation. -->
<!-- However, the `boot()` function automates this approach. Below we produce $R=1,000$ bootstrap estimates for $\alpha$. -->
<!-- ```{r chunk14} -->
<!-- boot(Portfolio, alpha.fn, R = 1000) -->
<!-- ``` -->
<!-- The final output shows that using the original data, $\hat{\alpha}=0.5758$, -->
<!-- and that the bootstrap estimate for ${\rm SE}(\hat{\alpha})$ is $0.0897$. -->
<!-- ### Estimating the Accuracy of a Linear Regression Model -->
<!-- The bootstrap approach can be used to assess the -->
<!-- variability of the coefficient estimates and predictions from a statistical learning method. Here we use the bootstrap approach in order to assess the variability of -->
<!-- the estimates for $\beta_0$ and $\beta_1$, the intercept and slope terms for the linear regression model -->
<!-- that uses `horsepower` to predict `mpg` in the `Auto` data set. We will compare the estimates obtained using the bootstrap to those obtained using the formulas -->
<!-- for ${\rm SE}(\hat{\beta}_0)$ and ${\rm SE}(\hat{\beta}_1)$ described -->
<!-- in Section 3.1.2. -->
<!-- We first create a simple function, `boot.fn()`, which takes in the -->
<!-- `Auto` data set as well as a set of indices for the observations, and -->
<!-- returns the intercept and slope estimates for the linear regression model. We then apply this function -->
<!-- to the full set of $392$ observations in order to compute the estimates of $\beta_0$ and $\beta_1$ on the entire data set using the usual linear regression coefficient estimate -->
<!-- formulas from Chapter 3. Note that we do not need the `{` and `}` at the beginning and end of the function because it is only one line long. -->
<!-- ```{r chunk15} -->
<!-- boot.fn <- function(data, index) -->
<!-- coef(lm(mpg ~ horsepower, data = data, subset = index)) -->
<!-- boot.fn(Auto, 1:392) -->
<!-- ``` -->
<!-- The `boot.fn()` function can also be used in order to create -->
<!-- bootstrap estimates for the intercept and slope terms by randomly sampling from among the observations with replacement. Here we give two examples. -->
<!-- ```{r chunk16} -->
<!-- set.seed(1) -->
<!-- boot.fn(Auto, sample(392, 392, replace = T)) -->
<!-- boot.fn(Auto, sample(392, 392, replace = T)) -->
<!-- ``` -->
<!-- Next, we use the `boot()` function to compute the standard errors of 1,000 bootstrap estimates for the intercept and slope terms. -->
<!-- ```{r chunk17} -->
<!-- boot(Auto, boot.fn, 1000) -->
<!-- ``` -->
<!-- This indicates that the bootstrap estimate for ${\rm SE}(\hat{\beta}_0)$ is $0.84$, and that the bootstrap estimate for ${\rm SE}(\hat{\beta}_1)$ is $0.0073$. -->
<!-- As discussed in Section 3.1.2, standard formulas can be used to compute the standard errors for the regression coefficients in a linear model. These can be obtained using the `summary()` function. -->
<!-- ```{r chunk18} -->
<!-- summary(lm(mpg ~ horsepower, data = Auto))$coef -->
<!-- ``` -->
<!-- The standard error estimates for $\hat{\beta}_0$ and -->
<!-- $\hat{\beta}_1$ obtained using the formulas from -->
<!-- Section 3.1.2 are $0.717$ for the intercept and $0.0064$ -->
<!-- for the slope. Interestingly, these are somewhat different from the -->
<!-- estimates obtained using the bootstrap. Does this indicate a problem -->
<!-- with the bootstrap? In fact, it suggests the opposite. Recall that -->
<!-- the standard formulas given in Equation 3.8 on page 66 rely on certain assumptions. For example, they depend -->
<!-- on the unknown parameter $\sigma^2$, the noise variance. We then estimate $\sigma^2$ -->
<!-- using the RSS. Now although the formulas for the standard errors do not rely on the linear model -->
<!-- being correct, the estimate for $\sigma^2$ does. -->
<!-- We see in -->
<!-- Figure 3.8 on page 92 that there is a non-linear relationship in -->
<!-- the data, and so the residuals from a linear fit will be inflated, and so will $\hat{\sigma}^2$. -->
<!-- Secondly, the standard formulas assume (somewhat unrealistically) that the $x_i$ are fixed, and all the variability comes from the variation in the errors $\epsilon_i$. -->
<!-- The bootstrap approach does not rely on any of these assumptions, and so it is -->
<!-- likely giving a more accurate estimate of the standard errors of -->
<!-- $\hat{\beta}_0$ and $\hat{\beta}_1$ than is the `summary()` -->
<!-- function. -->
<!-- Below we compute the bootstrap standard error estimates and the standard -->
<!-- linear regression estimates that result from fitting the quadratic model to the data. Since this model provides a good fit to the data (Figure 3.8), there is now a better correspondence between the bootstrap estimates and the standard estimates of ${\rm SE}(\hat{\beta}_0)$, ${\rm SE}(\hat{\beta}_1)$ and ${\rm SE}(\hat{\beta}_2)$. -->
<!-- ```{r chunk19} -->
<!-- boot.fn <- function(data, index) -->
<!-- coef( -->
<!-- lm(mpg ~ horsepower + I(horsepower^2), -->
<!-- data = data, subset = index) -->
<!-- ) -->
<!-- set.seed(1) -->
<!-- boot(Auto, boot.fn, 1000) -->
<!-- summary( -->
<!-- lm(mpg ~ horsepower + I(horsepower^2), data = Auto) -->
<!-- )$coef -->
<!-- ``` -->