forked from daviddalpiaz/appliedstats
-
Notifications
You must be signed in to change notification settings - Fork 0
/
model-building.Rmd
375 lines (253 loc) · 20.6 KB
/
model-building.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
# Model Building
```{r, include = FALSE}
knitr::opts_chunk$set(cache = TRUE, autodep = TRUE, fig.align = "center")
```
> "Statisticians, like artists, have the bad habit of falling in love with their models."
>
> --- **George Box**
Let's take a step back and consider the process of finding a model for data at a higher level. We are attempting to find a model for a response variable $y$ based on a number of predictors $x_1, x_2, x_3, \ldots, x_{p-1}$.
Essentially, we are trying to discover the functional relationship between $y$ and the predictors. In the previous chapter we were fitting models for a car's fuel efficiency (`mpg`) as a function of its attributes (`wt`, `year`, `cyl`, `disp`, `hp`, `acc`). We also consider $y$ to be a function of some noise. Rarely if ever do we expect there to be an *exact* functional relationship between the predictors and the response.
\[
y = f(x_1, x_2, x_3, \ldots, x_{p-1}) + \epsilon
\]
We can think of this as
\[
\text{response} = \text{signal} + \text{noise}.
\]
We *could* consider all sorts of complicated functions for $f$. You will likely encounter several ways of doing this in future machine learning courses. So far in this course we have focused on (multiple) linear regression. That is
\[
\begin{aligned}
y &= f(x_1, x_2, x_3, \ldots, x_{p-1}) + \epsilon \\
&= \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \cdots + \beta_{p-1} x_{p-1} + \epsilon
\end{aligned}
\]
In the big picture of possible models that we could fit to this data, this is a rather restrictive model. What do we mean by a restrictive model?
## Family, Form, and Fit
When modeling data, there are a number of choices that need to be made.
- What **family** of models will be considered?
- What **form** of the model will be used?
- How will the model be **fit**?
Let's work backwards and discuss each of these.
### Fit
Consider one of the simplest models we could fit to data, simple linear regression.
\[
y = f(x_1, x_2, x_3, \ldots, x_{p-1}) + \epsilon = \beta_0 + \beta_1 x_{1} + \epsilon
\]
So here, despite having multiple predictors, we chose to use only one. How is this model **fit**? We will almost exclusively use the method of least squares, but recall, we had seen alternative methods of fitting this model.
\[
\underset{\beta_0, \beta_1}{\mathrm{argmin}} \max|y_i - (\beta_0 + \beta_1 x_i)|
\]
\[
\underset{\beta_0, \beta_1}{\mathrm{argmin}} \sum_{i = 1}^{n}|y_i - (\beta_0 + \beta_1 x_i)|
\]
\[
\underset{\beta_0, \beta_1}{\mathrm{argmin}} \sum_{i = 1}^{n}(y_i - (\beta_0 + \beta_1 x_i))^2
\]
Any of these methods (we will always use the last, least squares) will obtain estimates of the unknown parameters $\beta_0$ and $\beta_1$. Since those are the only unknowns of the specified model, we have then *fit* the model. The fitted model is then
\[
\hat{y} = \hat{f}(x_1, x_2, x_3, \ldots, x_{p-1}) = \hat{\beta}_0 + \hat{\beta}_1 x_{1}
\]
Note that, now we have dropped the term for the noise. We don't make any effort to model the noise, only the signal.
### Form
What are the different **forms** a model could take? Currently, for the linear models we have considered, the only method for altering the form of the model is to control the predictors used. For example, one form of the multiple linear regression model is simple linear regression.
\[
y = f(x_1, x_2, x_3, \ldots, x_{p-1}) + \epsilon = \beta_0 + \beta_1 x_{1} + \epsilon
\]
We could also consider a SLR model with a different predictor, thus altering the form of the model.
\[
y = f(x_1, x_2, x_3, \ldots, x_{p-1}) + \epsilon = \beta_0 + \beta_2 x_{2} + \epsilon
\]
Often, we'll use multiple predictors in our model. Very often, we will at least try a model with all possible predictors.
\[
\begin{aligned}
y &= f(x_1, x_2, x_3, \ldots, x_{p-1}) + \epsilon \\
&= \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \cdots + \beta_{p-1} x_{p-1} + \epsilon
\end{aligned}
\]
We could also use some, but not all of the predictors.
\[
\begin{aligned}
y &= f(x_1, x_2, x_3, \ldots, x_{p-1}) + \epsilon \\
&= \beta_0 + \beta_1 x_{1} + \beta_3 x_{3} + \beta_5 x_{5} + \epsilon
\end{aligned}
\]
These forms are **restrictive** in two senses. First, they only allow for linear relationships between the response and the predictors. This seems like an obvious restriction of linear models, but in fact, we will soon see how to use linear models for *non-linear* relationships. (It will involve transforming variables.) Second, how one variable affects the response is the same for **any** values of the other predictors. Soon we will see how to create models where the effect of $x_{1}$ can be different for different values of $x_{2}$. We will discuss the concept of *interaction*.
### Family
A **family** of models is a broader grouping of many possible *forms* of a model. For example, above we saw several forms of models from the family of linear models. We will only ever concern ourselves with linear models, which model a response as a linear combination of predictors. There are certainly other families of models.
For example, there are several families of *non-parametric* regression. Smoothing is a broad family of models. As are regression trees.
In linear regression, we specified models with parameters, $\beta_j$ and fit the model by finding the best values of these parameters. This is a *parametric* approach. A non-parametric approach skips the step of specifying a model with parameters, and are often described as more of an algorithm. Non-parametric models are often used in machine learning.
```{r, echo=FALSE, fig.height=6, fig.width=12}
set.seed(42)
n = 30
beta_0 = 1
beta_1 = 0.5
sigma = 3
x = runif(n, 0, 10)
xplot = seq(0, 10, by = 0.1)
sim_slr = function(n, x, beta_0 = 10, beta_1 = 5, sigma = 1) {
epsilon = rnorm(n, mean = 0, sd = sigma)
y = beta_0 + beta_1 * x + epsilon
data.frame(predictor = x, response = y)
}
sim_data = sim_slr(n = n, x = x, beta_0 = beta_0, beta_1 = beta_1, sigma = sigma)
sim_fit_1 = lm(response ~ predictor, data = sim_data)
sim_fit_2 = loess(response ~ predictor, data = sim_data)
par(mfrow = c(1, 2))
plot(response ~ predictor, data = sim_data,
xlab = "Predictor", ylab = "Response",
main = "Linear Regression",
pch = 20, cex = 3, col = "darkgrey",
xlim = c(0, 10), ylim = c(-6, 11))
abline(sim_fit_1, lwd = 3, col = "dodgerblue")
plot(response ~ predictor, data = sim_data,
xlab = "Predictor", ylab = "Response",
main = "Smoothing",
pch = 20, cex = 3, col = "darkgrey",
xlim = c(0, 10), ylim = c(-6, 11))
lines(xplot, predict(sim_fit_2, newdata = data.frame(predictor = xplot)),
lwd = 3, lty = 2, col = "darkorange")
```
Here, SLR (parametric) is used on the left, while smoothing (non-parametric) is used on the right. SLR finds the best slope and intercept. Smoothing produces the fitted $y$ value at a particular $x$ value by considering the $y$ values of the data in a neighborhood of the $x$ value considered. (Local smoothing.)
Why the focus on **linear models**? Two big reasons:
- Linear models are **the** go-to model. Linear models have been around for a long time, and are computationally easy. A linear model may not be the final model you use, but often, it should be the first model you try.
- The ideas behind linear models can be easily transferred to other modeling techniques.
### Assumed Model, Fitted Model
When searching for a model, we often need to make assumptions. These assumptions are codified in the **family** and **form** of the model. For example
\[
y = \beta_0 + \beta_1 x_{1} + \beta_3 x_{3} + \beta_5 x_{5} + \epsilon
\]
assumes that $y$ is a linear combination of $x_{1}$, $x_{3}$, and $x_{5}$ as well as some noise. This assumes that the effect of $x_{1}$ on $y$ is $\beta_1$, which is the same for all values of $x_{3}$ and $x_{5}$. That is, we are using the *family* of linear models with a particular *form*.
Suppose we then *fit* this model to some data and obtain the **fitted model**. For example, in `R` we would use
```{r, eval = FALSE}
fit = lm(y ~ x1 + x3 + x5, data = some_data)
```
This is `R`'s way of saying the *family* is *linear* and specifying the *form* from above. An additive model with the specified predictors as well as an intercept. We then obtain
\[
\hat{y} = 1.5 + 0.9 x_{1} + 1.1 x_{3} + 2.3 x_{5}.
\]
This is our best guess for the function $f$ in
\[
y = f(x_1, x_2, x_3, \ldots, x_{p-1}) + \epsilon
\]
for the assumed **family** and **form**. Fitting a model only gives us the best fit for the family and form that we specify. So the natural question is; how do we choose the correct family and form? We'll focus on *form* since we are focusing on the *family* of linear models.
## Explanation versus Prediction
What is the purpose of fitting a model to data? Usually it is to accomplish one of two goals. We can use a model to **explain** the relationship between the response and the predictors. Models can also be used to **predict** the response based on the predictors. Often, a good model will do both, but we'll discuss both goals separately since the process of finding models for explaining and predicting have some differences.
For our purposes, since we are only considering linear models, searching for a good model is essentially searching for a good **form** of a model.
### Explanation
If the goal of a model is to explain the relationship between the response and the predictors, we are looking for a model that is **small** and **interpretable**, but still fits the data well. When discussing linear models, the **size** of a model is essentially the number of $\beta$ parameters used.
Suppose we would like to find a model that explains fuel efficiency (`mpg`) based on a car's attributes (`wt`, `year`, `cyl`, `disp`, `hp`, `acc`). Perhaps we are a car manufacturer trying to engineer a fuel efficient vehicle. If this is the case, we are interested in both which predictor variables are useful for explaining the car's fuel efficiency, as well as how those variables effect fuel efficiency. By understanding this relationship, we can use this knowledge to our advantage when designing a car.
To explain a relationship, we are interested in keeping models as small as possible, since smaller models are easy to interpret. The fewer predictors the less considerations we need to make in our design process.
Note that *linear* models of any size are rather interpretable to begin with. Later in your data analysis careers, you will see more complicated models that may fit data better, but are much harder, if not impossible to interpret. These models aren't nearly as useful for explaining a relationship. This is another reason to always attempt a linear model. If it fits as well as more complicated methods, it will be the easiest to understand.
To find small and interpretable models, we will eventually use selection procedures, which search among many possible forms of a model. For now we will do this in a more ad-hoc manner using **inference** techniques we have already encountered. To use inference as we have seen it, we need an additional assumption in addition to the family and form of the model.
\[
y = \beta_0 + \beta_1 x_{1} + \beta_3 x_{3} + \beta_5 x_{5} + \epsilon
\]
Our additional assumption is about the error term.
\[
\epsilon \sim N(0, \sigma^2)
\]
This assumption, that the errors are normally distributed with some common variance is the key to all of the inference we have done so far. We will discuss this in great detail later.
So with our inference tools (ANOVA and $t$-test) we have two potential strategies. Start with a very small model (no predictors) and attempt to add predictors. Or, start with a big model (all predictors) and attempt to remove predictors.
#### Correlation and Causation
```{r, echo = FALSE}
autompg = read.table(
"http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data",
quote = "\"",
comment.char = "",
stringsAsFactors = FALSE)
colnames(autompg) = c("mpg", "cyl", "disp", "hp", "wt", "acc", "year", "origin", "name")
autompg = subset(autompg, autompg$hp != "?")
autompg = subset(autompg, autompg$name != "plymouth reliant")
rownames(autompg) = paste(autompg$cyl, "cylinder", autompg$year, autompg$name)
autompg = subset(autompg, select = c("mpg", "cyl", "disp", "hp", "wt", "acc", "year"))
autompg$hp = as.numeric(autompg$hp)
```
A word of caution when using a model to *explain* a relationship. There are two terms often used to describe a relationship between two variables: *causation* and *correlation*. [Correlation](https://xkcd.com/552/){target="_blank"} is often also referred to as association.
Just because two variables are correlated does not necessarily mean that one causes the other. For example, consider modeling `mpg` as only a function of `hp`.
```{r}
plot(mpg ~ hp, data = autompg, col = "dodgerblue", pch = 20, cex = 1.5)
```
Does an increase in horsepower cause a drop in fuel efficiency? Or, perhaps the causality is reversed and an increase in fuel efficiency cause a decrease in horsepower. Or, perhaps there is a third variable that explains both!
The issue here is that we have **observational** data. With observational data, we can only detect *associations*. To speak with confidence about *causality*, we would need to run **experiments**. Often, this decision is made for us, before we ever see data, so we can only modify our interpretation.
This is a concept that you should encounter often in your statistics education. For some further reading, and some related fallacies, see: [Wikipedia: Correlation does not imply causation](https://en.wikipedia.org/wiki/Correlation_does_not_imply_causation){target="_blank"}.
We'll discuss this further when we discuss experimental design and traditional ANOVA techniques. (All of which has recently been re-branded as A/B testing.)
### Prediction
If the goal of a model is to predict the response, then the **only** consideration is how well the model fits the data. For this, we will need a metric. In regression problems, this is most often RMSE.
\[
\text{RMSE}(\text{model, data}) = \sqrt{\frac{1}{n} \sum_{i = 1}^{n}(y_i - \hat{y}_i)^2}
\]
where
- $y_i$ are the actual values of the response for the given data
- $\hat{y}_i$ are the predicted values using the fitted model and the predictors from the data
Correlation and causation are *not* an issue here. If a predictor is correlated with the response, it is useful for prediction. For example, in elementary school aged children their shoe size certainly doesn't *cause* them to read at a higher level, however we could very easily use shoe size to make a prediction about a child's reading ability. The larger their shoe size, the better they read. There's a lurking variable here though, their age! (Don't send your kids to school with size 14 shoes, it won't make them read better!)
Also, since we are not performing inference, the extra assumption about the errors is not needed. The only thing we care about is how close the fitted model is to the data. Least squares is least squares. For a specified model, it will find the values of the parameters which will minimize the squared error loss. Your results might be largely uninterpretable and useless for inference, but for prediction none of that matters.
Suppose instead of the manufacturer who would like to build a car, we are a consumer who wishes to purchase a new car. However this particular car is so new, it has not been rigorously tested, so we are unsure of what fuel efficiency to expect. (And, as skeptics, we don't trust what the manufacturer is telling us.) In this case, we would like to use the model to help *predict* the fuel efficiency of this car based on its attributes, which are the predictors of the model. The smaller the errors the model makes, the more confident we are in its prediction.
#### Test-Train Split
The trouble with using RMSE to identify how well a model fits data, is that RMSE is **always** (equal or) lower for a larger model. This would suggest that we should always use the largest model possible when looking for a model that predicts well. The problem with this is the potential to **overfit** to the data. So, we want a model that fits well, but does not overfit. To understand overfitting, we need to think about applying a model to seen and unseen data.
Suppose we fit a model using all data available and we evaluate RMSE on this fitted model and all of the seen data. We will call this data the **training** data, and this RMSE the **train** RMSE.
Now, suppose we magically encounter some additional data. To truly asses how well the model predicts, we should evaluate how well our models predicts the response of this data. We will call this data the **test** data and this RMSE the **test** RMSE.
- Train RMSE: model fit on seen data, evaluated on **seen** data
- Test RMSE: model fit on seen data, evaluated on **unseen** data
Below, we simulate some data and fit two models. We will call the solid blue line the "simple" model. The dashed orange line will be called the "complex" model, which was fit with methods we do not yet know.
```{r, echo=FALSE, fig.height=6, fig.width=12}
set.seed(42)
n = 30
beta_0 = 1
beta_1 = 0.5
sigma = 1.5
x = runif(n, 0, 10)
xplot = seq(-1, 11, by = 0.1)
sim_slr = function(n, x, beta_0 = 10, beta_1 = 5, sigma = 1) {
epsilon = rnorm(n, mean = 0, sd = sigma)
y = beta_0 + beta_1 * x + epsilon
data.frame(predictor = x, response = y)
}
sim_data = sim_slr(n = n, x = x, beta_0 = beta_0, beta_1 = beta_1, sigma = sigma)
sim_fit_1 = lm(response ~ predictor, data = sim_data)
sim_fit_2 = lm(response ~ poly(predictor, 8), data = sim_data)
rmse = function(actual, predicted) {
sqrt(mean((actual - predicted) ^ 2))
}
# rmse(sim_data$response, predict(sim_fit_1))
# rmse(sim_data$response, predict(sim_fit_2))
set.seed(420)
n = 30
x = runif(n, 0, 10)
sim_data_new = sim_slr(n = n, x = x, beta_0 = beta_0, beta_1 = beta_1, sigma = sigma)
# rmse(sim_data_new$response, predict(sim_fit_1, sim_data_new))
# rmse(sim_data_new$response, predict(sim_fit_2, sim_data_new))
par(mfrow = c(1, 2))
plot(response ~ predictor, data = sim_data,
xlab = "Predictor", ylab = "Response",
main = "Simulated Train Data",
pch = 20, cex = 3, col = "darkgrey",
xlim = c(-1, 11), ylim = c(-4, 9))
abline(sim_fit_1, lwd = 3, col = "dodgerblue")
lines(xplot, predict(sim_fit_2, newdata = data.frame(predictor = xplot)),
lwd = 3, lty = 2, col = "darkorange")
plot(response ~ predictor, data = sim_data_new,
xlab = "Predictor", ylab = "Response",
main = "Simulated Test Data",
pch = 20, cex = 3, col = "darkgrey",
xlim = c(-1, 11), ylim = c(-4, 9))
abline(sim_fit_1, lwd = 3, col = "dodgerblue")
lines(xplot, predict(sim_fit_2, newdata = data.frame(predictor = xplot)),
lwd = 3, lty = 2, col = "darkorange")
```
The left panel shows the data that was used to fit the two models. Clearly the "complex" model fits the data much better. The right panel shows additional data that was simulated in the same manner as the original data. Here we see that the "simple" model fits much better. The dashed orange line almost seems random.
| Model | **Train** RMSE | **Test** RMSE |
|---------|-----------------------------------------------------------|-----------------------------------------------------------------------------|
| Simple | `r round(rmse(sim_data$response, predict(sim_fit_1)), 2)` | `r round(rmse(sim_data_new$response, predict(sim_fit_1, sim_data_new)), 2)` |
| Complex | `r round(rmse(sim_data$response, predict(sim_fit_2)), 2)` | `r round(rmse(sim_data_new$response, predict(sim_fit_2, sim_data_new)), 2)` |
The more "complex", wiggly, model fits the training data much better as it has a much lower train RMSE. However, we see that the "simple" model fits the test data much better, with a much lower test RMSE. This means that the complex model has *overfit* the data, and we prefer the simple model. When choosing a model for prediction, we prefer a model that predicts unseen data.
In practice, you can't simply generate more data to evaluate your models. Instead we split existing data into data used to fit the model (train) and data used to evaluate the model (test). Never fit a model with test data.
## Summary
Models can be used to **explain** relationships and **predict** observations.
When using model to,
- **explain**; we prefer *small* and *interpretable* models.
- **predict**; we prefer models that make the smallest errors possible, without *overfitting*.
Linear models can accomplish both these goals. Later, we will see that often a linear model that accomplish one of these goals, usually accomplishes the other.
## `R` Markdown
The `R` Markdown file for this chapter can be found here:
- [`model-building.Rmd`](model-building.Rmd){target="_blank"}
The file was created using `R` version ``r paste0(version$major, "." ,version$minor)``.