You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 08-feature-engineering.Rmd
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -317,7 +317,7 @@ _Order matters_. The gross living area is log transformed prior to the interact
317
317
318
318
When a predictor has a nonlinear relationship with the outcome, some types of predictive models can adaptively approximate this relationship during training. However, simpler is usually better and it is not uncommon to try to use a simple model, such as a linear fit, and add in specific non-linear features for predictors that may need them. One common method for doing this is to use _spline_ functions to represent the data. Splines replace the existing numeric predictor with a set of columns that allow a model to emulate a flexible, non-linear relationship. As more spline terms are added to the data, the capacity to non-linearly represent the relationship increases. Unfortunately, it may also increase the likelihood of picking up on data trends that occur by chance (i.e., over-fitting).
319
319
320
-
If you have ever used `geom_smooth()` within a `ggplot`, you have probably used a spline representation of the data. For example, each panel in Figure \@ref(ames-latitude-splines) uses a different number of smooth splines for the latitude predictor:
320
+
If you have ever used `geom_smooth()` within a `ggplot`, you have probably used a spline representation of the data. For example, each panel in Figure \@ref(fig:ames-latitude-splines) uses a different number of smooth splines for the latitude predictor:
Copy file name to clipboardExpand all lines: 10-resampling.Rmd
+73-20Lines changed: 73 additions & 20 deletions
Original file line number
Diff line number
Diff line change
@@ -112,12 +112,16 @@ In this context, _bias_ is the difference between the true data pattern and the
112
112
113
113
For a low-bias model, the high degree of predictive capacity can sometimes result in the model nearly memorizing the training set data. As an obvious example, consider a 1-nearest neighbor model. It will always provide perfect predictions for the training set no matter how well it truly works for other data sets. Random forest models are similar; re-predicting the training set will always result in an artificially optimistic estimate of performance.
114
114
115
-
For both models, this table summarizes the RMSE estimate for the training and test sets:
115
+
For both models, Table \@ref(tab:rmse-results) summarizes the RMSE estimate for the training and test sets:
caption = "Performance statstics for training and test sets.",
122
+
label = "rmse-results",
123
+
escape = FALSE
124
+
) %>%
121
125
kable_styling(full_width = FALSE) %>%
122
126
add_header_above(c(" ", "RMSE Estimates" = 2))
123
127
```
@@ -133,9 +137,14 @@ If the test set should not be used immediately, and re-predicting the training s
133
137
134
138
## Resampling methods
135
139
136
-
Resampling methods are empirical simulation systems that emulate the process of using some data for modeling and different data for evaluation. Most resampling methods are iterative, meaning that this process is repeated multiple times. This diagram illustrates how resampling methods generally operate:
140
+
Resampling methods are empirical simulation systems that emulate the process of using some data for modeling and different data for evaluation. Most resampling methods are iterative, meaning that this process is repeated multiple times. The diagram in Figure \@ref(fig:resampling-scheme) illustrates how resampling methods generally operate.
#| fig.cap = "Data splitting scheme from the initial data split to resampling.",
147
+
#| fig.alt = "A diagram of the data splitting scheme from the initial data split to resampling. The first level is the training/testing set partition. The second level of splitting takes the training set and splits it into multiple 'analysis' and 'assessment' sets (which are analogous to training and test)."
139
148
knitr::include_graphics("premade/resampling.svg")
140
149
```
141
150
@@ -153,17 +162,27 @@ The next section defines several commonly used methods and discusses their pros
153
162
154
163
### Cross-validation {#cv}
155
164
156
-
Cross-validation is a well established resampling method. While there are a number of variations, the most common cross-validation method is _V_-fold cross-validation. The data are randomly partitioned into _V_ sets of roughly equal size (called the "folds"). For illustration, _V_ = 3 is shown below for a data set of thirty training set points with random fold allocations. The number inside the symbols is the sample number:
165
+
Cross-validation is a well established resampling method. While there are a number of variations, the most common cross-validation method is _V_-fold cross-validation. The data are randomly partitioned into _V_ sets of roughly equal size (called the "folds"). For illustration, _V_ = 3 is shown in Figure \@ref(fig:cross-validation-allocation) for a data set of thirty training set points with random fold allocations. The number inside the symbols is the sample number.
#| fig.cap = "V-fold cross-validation randomly assigns data to folds. ",
172
+
#| fig.alt = "A diagram of how V-fold cross-validation randomly assigns data to folds (where V equals three). A set of thirty data points are assigned to three groups of roughly the same size."
159
173
knitr::include_graphics("premade/three-CV.svg")
160
174
```
161
175
162
176
The color of the symbols represent their randomly assigned folds. Stratified sampling is also an option for assigning folds (previously discussed in Section \@ref(splitting-methods)).
163
177
164
-
For 3-fold cross-validation, the three iterations of resampling are illustrated below. For each iteration, one fold is held out for assessment statistics and the remaining folds are substrate for the model. This process continues for each fold so that three models produce three sets of performance statistics.
178
+
For 3-fold cross-validation, the three iterations of resampling are illustrated in Figure \@ref(fig:cross-validation). For each iteration, one fold is held out for assessment statistics and the remaining folds are substrate for the model. This process continues for each fold so that three models produce three sets of performance statistics.
#| fig.cap = "V-fold cross-validation data usage.",
185
+
#| fig.alt = "A diagram of V-fold cross-validation data usage (where V equals three). For each of the three groups, the data for the fold are held out for performance while the other two are used for modeling."
@@ -203,9 +222,14 @@ There are a variety of variations on cross-validation. The most important is _re
203
222
204
223
To create _R_ repeats of _V_-fold cross-validation, the same fold generation process is done _R_ times to generate _R_ collections of _V_ partitions. Now, instead of averaging _V_ statistics, $V \times R$ statistics produce the final resampling estimate. Due to the Central Limit Theorem, the summary statistics from each model tend toward a normal distribution.
205
224
206
-
Consider the Ames data. On average, 10-fold cross-validation uses assessment sets that contain roughly `r floor(nrow(ames_train) * .1)` properties. If RMSE is the statistic of choice, we can denote that estimate's standard deviation as $\sigma$. With simple 10-fold cross-validation, the standard error of the mean RMSE is $\sigma/\sqrt{10}$. If this is too noisy, repeats reduce the standard error to $\sigma/\sqrt{10R}$. For 10-fold cross-validation with $R$ replicates, the plot below shows how quickly the standard error^[These are _approximate_ standard errors. As will be discussed in the next chapter, there is a within-replicate correlation that is typical of resampled results. By ignoring this extra component of variation, the simple calculations shown in this plot are overestimates of the reduction in noise in the standard errors.] decreases with replicates:
225
+
Consider the Ames data. On average, 10-fold cross-validation uses assessment sets that contain roughly `r floor(nrow(ames_train) * .1)` properties. If RMSE is the statistic of choice, we can denote that estimate's standard deviation as $\sigma$. With simple 10-fold cross-validation, the standard error of the mean RMSE is $\sigma/\sqrt{10}$. If this is too noisy, repeats reduce the standard error to $\sigma/\sqrt{10R}$. For 10-fold cross-validation with $R$ replicates, the plot in Figure \@ref(fig:variance-reduction) shows how quickly the standard error^[These are _approximate_ standard errors. As will be discussed in the next chapter, there is a within-replicate correlation that is typical of resampled results. By ignoring this extra component of variation, the simple calculations shown in this plot are overestimates of the reduction in noise in the standard errors.] decreases with replicates.
226
+
227
+
```{r variance-reduction}
228
+
#| echo = FALSE,
229
+
#| fig.height = 4,
230
+
#| fig.cap = "Relationship between the relative variance in performance estimates versus the number of cross-validation repeats.",
231
+
#| fig.alt = "The relationship between the relative variance in performance estimates versus the number of cross-validation repeats. As the repeats increase, the variance is reduced in a harmonically decreasing pattern with diminishing returns for large number of replicates."
Previously mentioned in Section \@ref(what-about-a-validation-set), a validation set is a single partition that is set aside to estimate performance, before using the test set:
272
+
Previously mentioned in Section \@ref(what-about-a-validation-set), a validation set is a single partition that is set aside to estimate performance, before using the test set (see Figure \@ref(fig:three-way-split)).
#| fig.cap = "A three-way initial split into training, testing, and validation sets.",
279
+
#| fig.alt = "A three-way initial split into training, testing, and validation sets."
251
280
knitr::include_graphics("premade/validation.svg")
252
281
```
253
282
254
283
Validation sets are often used when the original pool of data is very large. In this case, a single large partition may be adequate to characterize model performance without having to do multiple iterations of resampling.
255
284
256
-
With `r pkg(rsample)`, a validation set is like any other resampling object; this type is different only in that it has a single iteration^[In essence, a validation set can be considered a single iteration of Monte Carlo cross-validation.]:
285
+
With `r pkg(rsample)`, a validation set is like any other resampling object; this type is different only in that it has a single iteration^[In essence, a validation set can be considered a single iteration of Monte Carlo cross-validation.]. Figure \@ref(fig:validation-split) shows this scheme.
@@ -276,11 +310,17 @@ Bootstrap resampling was originally invented as a method for approximating the s
276
310
277
311
A bootstrap sample of the training set is a sample that is the same size as the training set but is drawn _with replacement_. This means that some training set data points are selected multiple times for the analysis set. Each data point has a `r round((1-exp(-1)) * 100, 1)`% chance of inclusion in the training set _at least once_. The assessment set contains all of the training set samples that were not selected for the analysis set (on average, with `r round((exp(-1)) * 100, 1)`% of the training set). When bootstrapping, the assessment set is often called the "out-of-bag" sample.
278
312
279
-
For a training set of 30 samples, a schematic of three bootstrap samples is:
313
+
For a training set of 30 samples, a schematic of three bootstrap samples is shown in Figure\@ref(fig:bootstrapping).
#| fig.alt = "A diagram of bootstrapping data usage. For each bootstrap resample, the analysis set is the same size as the training set (due to sampling with replacement) and the assessment set consists of samples not in the analysis set."
282
321
knitr::include_graphics("premade/bootstraps.svg")
283
322
```
323
+
284
324
Note that the sizes of the assessment sets vary.
285
325
286
326
Using `r pkg(rsample)`:
@@ -299,9 +339,14 @@ When the data have a strong time component, a resampling method should support m
299
339
300
340
Rolling forecast origin resampling [@hyndman2018forecasting] provides a method that emulates how time series data is often partitioned in practice, estimating the model with historical data and evaluating it with the most recent data. For this type of resampling, the size of the initial analysis and assessment sets are specified. The first iteration of resampling uses these sizes, starting from the beginning of the series. The second iteration uses the same data sizes but shifts over by a set number of samples.
301
341
302
-
To illustrate, a training set of fifteen samples was resampled with an analysis size of eight samples and an assessment set size of three. The second iteration discards the first training set sample and both data sets shift forward by one. This configuration results in five resamples:
342
+
To illustrate, a training set of fifteen samples was resampled with an analysis size of eight samples and an assessment set size of three. The second iteration discards the first training set sample and both data sets shift forward by one. This configuration results in five resamples, as shown in Figure\@ref(fig:rolling).
#| fig.cap = "Data usage for rolling forecasting origin resampling.",
349
+
#| fig.alt = "The data usage for rolling forecasting origin resampling. For each split, earlier data are used for modeling and a few subsequent instances are used to measure performance."
305
350
knitr::include_graphics("premade/rolling.svg")
306
351
```
307
352
@@ -415,9 +460,9 @@ The prediction column names follow the conventions discussed for `r pkg(parsnip)
415
460
For some resampling methods, such as the bootstrap or repeated cross-validation, there will be multiple predictions per row of the original training set. To obtain summarized values (averages of the replicate predictions) use `collect_predictions(object, summarize = TRUE)`.
416
461
:::
417
462
418
-
Since this analysis used 10-fold cross-validation, there is one unique prediction for each training set sample. These data can generate helpful plots of the model to understand where it potentially failed. For example, let's compare the observed and predicted values:
463
+
Since this analysis used 10-fold cross-validation, there is one unique prediction for each training set sample. These data can generate helpful plots of the model to understand where it potentially failed. For example, Figure \@ref(fig:ames-resampled-performance) compares the observed and held-out predicted values (analogous to Figure \@ref(fig:ames-performance-plot)):
#| fig.cap = "Out-of-sample observed versus predicted values for an Ames regression model, using log-10 units on both axes.",
479
+
#| fig.alt = "Scatter plots of out-of-sample observed versus predicted values for an Ames regression model. Both axes using log-10 units. The model shows good concordance with two outlying data points that are significantly over-predicted."
480
+
```
481
+
429
482
There was one house in the training set with a low observed sale price that is significantly overpredicted by the model. Which house was that?
0 commit comments