Skip to content

Commit 23e7549

Browse files
authored
Merge pull request #225 from tidymodels/fig-tab-ch-10-15
Figure/table updates for Ch 10 to 15
2 parents 1a3855e + 87af1ce commit 23e7549

File tree

7 files changed

+457
-136
lines changed

7 files changed

+457
-136
lines changed

08-feature-engineering.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -317,7 +317,7 @@ _Order matters_. The gross living area is log transformed prior to the interact
317317

318318
When a predictor has a nonlinear relationship with the outcome, some types of predictive models can adaptively approximate this relationship during training. However, simpler is usually better and it is not uncommon to try to use a simple model, such as a linear fit, and add in specific non-linear features for predictors that may need them. One common method for doing this is to use _spline_ functions to represent the data. Splines replace the existing numeric predictor with a set of columns that allow a model to emulate a flexible, non-linear relationship. As more spline terms are added to the data, the capacity to non-linearly represent the relationship increases. Unfortunately, it may also increase the likelihood of picking up on data trends that occur by chance (i.e., over-fitting).
319319

320-
If you have ever used `geom_smooth()` within a `ggplot`, you have probably used a spline representation of the data. For example, each panel in Figure \@ref(ames-latitude-splines) uses a different number of smooth splines for the latitude predictor:
320+
If you have ever used `geom_smooth()` within a `ggplot`, you have probably used a spline representation of the data. For example, each panel in Figure \@ref(fig:ames-latitude-splines) uses a different number of smooth splines for the latitude predictor:
321321

322322
```{r engineering-ames-splines, eval=FALSE}
323323
library(patchwork)

10-resampling.Rmd

Lines changed: 73 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -112,12 +112,16 @@ In this context, _bias_ is the difference between the true data pattern and the
112112

113113
For a low-bias model, the high degree of predictive capacity can sometimes result in the model nearly memorizing the training set data. As an obvious example, consider a 1-nearest neighbor model. It will always provide perfect predictions for the training set no matter how well it truly works for other data sets. Random forest models are similar; re-predicting the training set will always result in an artificially optimistic estimate of performance.
114114

115-
For both models, this table summarizes the RMSE estimate for the training and test sets:
115+
For both models, Table \@ref(tab:rmse-results) summarizes the RMSE estimate for the training and test sets:
116116

117117
```{r resampling-rmse-table, echo = FALSE, results = "asis"}
118118
all_res %>%
119119
mutate(object = paste0("<tt>", object, "</tt>")) %>%
120-
kable(escape = FALSE) %>%
120+
kable(
121+
caption = "Performance statstics for training and test sets.",
122+
label = "rmse-results",
123+
escape = FALSE
124+
) %>%
121125
kable_styling(full_width = FALSE) %>%
122126
add_header_above(c(" ", "RMSE Estimates" = 2))
123127
```
@@ -133,9 +137,14 @@ If the test set should not be used immediately, and re-predicting the training s
133137

134138
## Resampling methods
135139

136-
Resampling methods are empirical simulation systems that emulate the process of using some data for modeling and different data for evaluation. Most resampling methods are iterative, meaning that this process is repeated multiple times. This diagram illustrates how resampling methods generally operate:
140+
Resampling methods are empirical simulation systems that emulate the process of using some data for modeling and different data for evaluation. Most resampling methods are iterative, meaning that this process is repeated multiple times. The diagram in Figure \@ref(fig:resampling-scheme) illustrates how resampling methods generally operate.
137141

138-
```{r resampling-scheme, echo = FALSE, out.width = '85%', warning = FALSE}
142+
```{r resampling-scheme}
143+
#| echo = FALSE,
144+
#| out.width = '85%',
145+
#| warning = FALSE,
146+
#| fig.cap = "Data splitting scheme from the initial data split to resampling.",
147+
#| fig.alt = "A diagram of the data splitting scheme from the initial data split to resampling. The first level is the training/testing set partition. The second level of splitting takes the training set and splits it into multiple 'analysis' and 'assessment' sets (which are analogous to training and test)."
139148
knitr::include_graphics("premade/resampling.svg")
140149
```
141150

@@ -153,17 +162,27 @@ The next section defines several commonly used methods and discusses their pros
153162

154163
### Cross-validation {#cv}
155164

156-
Cross-validation is a well established resampling method. While there are a number of variations, the most common cross-validation method is _V_-fold cross-validation. The data are randomly partitioned into _V_ sets of roughly equal size (called the "folds"). For illustration, _V_ = 3 is shown below for a data set of thirty training set points with random fold allocations. The number inside the symbols is the sample number:
165+
Cross-validation is a well established resampling method. While there are a number of variations, the most common cross-validation method is _V_-fold cross-validation. The data are randomly partitioned into _V_ sets of roughly equal size (called the "folds"). For illustration, _V_ = 3 is shown in Figure \@ref(fig:cross-validation-allocation) for a data set of thirty training set points with random fold allocations. The number inside the symbols is the sample number.
157166

158-
```{r resampling-three-cv, echo = FALSE, out.width = '50%', warning = FALSE}
167+
```{r cross-validation-allocation}
168+
#| echo = FALSE,
169+
#| out.width = '50%',
170+
#| warning = FALSE,
171+
#| fig.cap = "V-fold cross-validation randomly assigns data to folds. ",
172+
#| fig.alt = "A diagram of how V-fold cross-validation randomly assigns data to folds (where V equals three). A set of thirty data points are assigned to three groups of roughly the same size."
159173
knitr::include_graphics("premade/three-CV.svg")
160174
```
161175

162176
The color of the symbols represent their randomly assigned folds. Stratified sampling is also an option for assigning folds (previously discussed in Section \@ref(splitting-methods)).
163177

164-
For 3-fold cross-validation, the three iterations of resampling are illustrated below. For each iteration, one fold is held out for assessment statistics and the remaining folds are substrate for the model. This process continues for each fold so that three models produce three sets of performance statistics.
178+
For 3-fold cross-validation, the three iterations of resampling are illustrated in Figure \@ref(fig:cross-validation). For each iteration, one fold is held out for assessment statistics and the remaining folds are substrate for the model. This process continues for each fold so that three models produce three sets of performance statistics.
165179

166-
```{r resampling-three-cv-iter, echo = FALSE, out.width = '70%', warning = FALSE}
180+
```{r cross-validation}
181+
#| echo = FALSE,
182+
#| out.width = '70%',
183+
#| warning = FALSE,
184+
#| fig.cap = "V-fold cross-validation data usage.",
185+
#| fig.alt = "A diagram of V-fold cross-validation data usage (where V equals three). For each of the three groups, the data for the fold are held out for performance while the other two are used for modeling."
167186
knitr::include_graphics("premade/three-CV-iter.svg")
168187
```
169188

@@ -203,9 +222,14 @@ There are a variety of variations on cross-validation. The most important is _re
203222

204223
To create _R_ repeats of _V_-fold cross-validation, the same fold generation process is done _R_ times to generate _R_ collections of _V_ partitions. Now, instead of averaging _V_ statistics, $V \times R$ statistics produce the final resampling estimate. Due to the Central Limit Theorem, the summary statistics from each model tend toward a normal distribution.
205224

206-
Consider the Ames data. On average, 10-fold cross-validation uses assessment sets that contain roughly `r floor(nrow(ames_train) * .1)` properties. If RMSE is the statistic of choice, we can denote that estimate's standard deviation as $\sigma$. With simple 10-fold cross-validation, the standard error of the mean RMSE is $\sigma/\sqrt{10}$. If this is too noisy, repeats reduce the standard error to $\sigma/\sqrt{10R}$. For 10-fold cross-validation with $R$ replicates, the plot below shows how quickly the standard error^[These are _approximate_ standard errors. As will be discussed in the next chapter, there is a within-replicate correlation that is typical of resampled results. By ignoring this extra component of variation, the simple calculations shown in this plot are overestimates of the reduction in noise in the standard errors.] decreases with replicates:
225+
Consider the Ames data. On average, 10-fold cross-validation uses assessment sets that contain roughly `r floor(nrow(ames_train) * .1)` properties. If RMSE is the statistic of choice, we can denote that estimate's standard deviation as $\sigma$. With simple 10-fold cross-validation, the standard error of the mean RMSE is $\sigma/\sqrt{10}$. If this is too noisy, repeats reduce the standard error to $\sigma/\sqrt{10R}$. For 10-fold cross-validation with $R$ replicates, the plot in Figure \@ref(fig:variance-reduction) shows how quickly the standard error^[These are _approximate_ standard errors. As will be discussed in the next chapter, there is a within-replicate correlation that is typical of resampled results. By ignoring this extra component of variation, the simple calculations shown in this plot are overestimates of the reduction in noise in the standard errors.] decreases with replicates.
226+
227+
```{r variance-reduction}
228+
#| echo = FALSE,
229+
#| fig.height = 4,
230+
#| fig.cap = "Relationship between the relative variance in performance estimates versus the number of cross-validation repeats.",
231+
#| fig.alt = "The relationship between the relative variance in performance estimates versus the number of cross-validation repeats. As the repeats increase, the variance is reduced in a harmonically decreasing pattern with diminishing returns for large number of replicates."
207232
208-
```{r resampling-cv-reduction, echo = FALSE, fig.height= 4}
209233
cv_info <-
210234
tibble(replicates = rep(1:10, 2), V = 10) %>%
211235
mutate(B = V * replicates, reduction = 1/B, V = format(V))
@@ -245,18 +269,28 @@ mc_cv(ames_train, prop = 9/10, times = 20)
245269

246270
### Validation sets {#validation}
247271

248-
Previously mentioned in Section \@ref(what-about-a-validation-set), a validation set is a single partition that is set aside to estimate performance, before using the test set:
272+
Previously mentioned in Section \@ref(what-about-a-validation-set), a validation set is a single partition that is set aside to estimate performance, before using the test set (see Figure \@ref(fig:three-way-split)).
249273

250-
```{r resampling-validation, echo = FALSE, out.width = '50%', warning = FALSE}
274+
```{r three-way-split}
275+
#| echo = FALSE,
276+
#| out.width = '50%',
277+
#| warning = FALSE,
278+
#| fig.cap = "A three-way initial split into training, testing, and validation sets.",
279+
#| fig.alt = "A three-way initial split into training, testing, and validation sets."
251280
knitr::include_graphics("premade/validation.svg")
252281
```
253282

254283
Validation sets are often used when the original pool of data is very large. In this case, a single large partition may be adequate to characterize model performance without having to do multiple iterations of resampling.
255284

256-
With `r pkg(rsample)`, a validation set is like any other resampling object; this type is different only in that it has a single iteration^[In essence, a validation set can be considered a single iteration of Monte Carlo cross-validation.]:
285+
With `r pkg(rsample)`, a validation set is like any other resampling object; this type is different only in that it has a single iteration^[In essence, a validation set can be considered a single iteration of Monte Carlo cross-validation.]. Figure \@ref(fig:validation-split) shows this scheme.
257286

258287

259-
```{r resampling-validation-alt, echo = FALSE, out.width = '45%', warning = FALSE}
288+
```{r validation-split}
289+
#| echo = FALSE,
290+
#| out.width = '45%',
291+
#| warning = FALSE,
292+
#| fig.cap = "A two-way initial split into training and testing with an additional validation set split on the training set.",
293+
#| fig.alt = "A two-way initial split into training and testing with an additional validation set split on the training set."
260294
knitr::include_graphics("premade/validation-alt.svg")
261295
```
262296

@@ -276,11 +310,17 @@ Bootstrap resampling was originally invented as a method for approximating the s
276310

277311
A bootstrap sample of the training set is a sample that is the same size as the training set but is drawn _with replacement_. This means that some training set data points are selected multiple times for the analysis set. Each data point has a `r round((1-exp(-1)) * 100, 1)`% chance of inclusion in the training set _at least once_. The assessment set contains all of the training set samples that were not selected for the analysis set (on average, with `r round((exp(-1)) * 100, 1)`% of the training set). When bootstrapping, the assessment set is often called the "out-of-bag" sample.
278312

279-
For a training set of 30 samples, a schematic of three bootstrap samples is:
313+
For a training set of 30 samples, a schematic of three bootstrap samples is shown in Figure\@ref(fig:bootstrapping).
280314

281-
```{r resampling-bootstraps, echo = FALSE, out.width = '80%', warning = FALSE}
315+
```{r bootstrapping}
316+
#| echo = FALSE,
317+
#| out.width = '80%',
318+
#| warning = FALSE,
319+
#| fig.cap = "Bootstrapping data usage.",
320+
#| fig.alt = "A diagram of bootstrapping data usage. For each bootstrap resample, the analysis set is the same size as the training set (due to sampling with replacement) and the assessment set consists of samples not in the analysis set."
282321
knitr::include_graphics("premade/bootstraps.svg")
283322
```
323+
284324
Note that the sizes of the assessment sets vary.
285325

286326
Using `r pkg(rsample)`:
@@ -299,9 +339,14 @@ When the data have a strong time component, a resampling method should support m
299339

300340
Rolling forecast origin resampling [@hyndman2018forecasting] provides a method that emulates how time series data is often partitioned in practice, estimating the model with historical data and evaluating it with the most recent data. For this type of resampling, the size of the initial analysis and assessment sets are specified. The first iteration of resampling uses these sizes, starting from the beginning of the series. The second iteration uses the same data sizes but shifts over by a set number of samples.
301341

302-
To illustrate, a training set of fifteen samples was resampled with an analysis size of eight samples and an assessment set size of three. The second iteration discards the first training set sample and both data sets shift forward by one. This configuration results in five resamples:
342+
To illustrate, a training set of fifteen samples was resampled with an analysis size of eight samples and an assessment set size of three. The second iteration discards the first training set sample and both data sets shift forward by one. This configuration results in five resamples, as shown in Figure\@ref(fig:rolling).
303343

304-
```{r resampling-rolling, echo = FALSE, out.width = '65%', warning = FALSE}
344+
```{r rolling}
345+
#| echo = FALSE,
346+
#| out.width = '65%',
347+
#| warning = FALSE,
348+
#| fig.cap = "Data usage for rolling forecasting origin resampling.",
349+
#| fig.alt = "The data usage for rolling forecasting origin resampling. For each split, earlier data are used for modeling and a few subsequent instances are used to measure performance."
305350
knitr::include_graphics("premade/rolling.svg")
306351
```
307352

@@ -415,9 +460,9 @@ The prediction column names follow the conventions discussed for `r pkg(parsnip)
415460
For some resampling methods, such as the bootstrap or repeated cross-validation, there will be multiple predictions per row of the original training set. To obtain summarized values (averages of the replicate predictions) use `collect_predictions(object, summarize = TRUE)`.
416461
:::
417462

418-
Since this analysis used 10-fold cross-validation, there is one unique prediction for each training set sample. These data can generate helpful plots of the model to understand where it potentially failed. For example, let's compare the observed and predicted values:
463+
Since this analysis used 10-fold cross-validation, there is one unique prediction for each training set sample. These data can generate helpful plots of the model to understand where it potentially failed. For example, Figure \@ref(fig:ames-resampled-performance) compares the observed and held-out predicted values (analogous to Figure \@ref(fig:ames-performance-plot)):
419464

420-
```{r resampling-cv-pred-plot, fig.height=5, fig.width=5}
465+
```{r resampling-cv-pred-plot, eval=FALSE}
421466
assess_res %>%
422467
ggplot(aes(x = Sale_Price, y = .pred)) +
423468
geom_point(alpha = .15) +
@@ -426,6 +471,14 @@ assess_res %>%
426471
ylab("Predicted")
427472
```
428473

474+
```{r ames-resampled-performance, ref.label = "resampling-cv-pred-plot"}
475+
#| fig.height = 5,
476+
#| fig.width = 5,
477+
#| echo = FALSE,
478+
#| fig.cap = "Out-of-sample observed versus predicted values for an Ames regression model, using log-10 units on both axes.",
479+
#| fig.alt = "Scatter plots of out-of-sample observed versus predicted values for an Ames regression model. Both axes using log-10 units. The model shows good concordance with two outlying data points that are significantly over-predicted."
480+
```
481+
429482
There was one house in the training set with a low observed sale price that is significantly overpredicted by the model. Which house was that?
430483

431484
```{r resampling-investigate}

0 commit comments

Comments
 (0)