Skip to content

Commit c6ce4f3

Browse files
committed
Clean up after #228
1 parent 3912a44 commit c6ce4f3

File tree

3 files changed

+12
-11
lines changed

3 files changed

+12
-11
lines changed

16-dimensionality-reduction.Rmd

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,7 @@ bean_val$splits[[1]]
111111

112112
To visually assess how well different methods perform, we can estimate the methods on the training set (n = `r analysis(bean_val$splits[[1]]) %>% nrow()` beans) and display the results using the validation set (n = `r assessment(bean_val$splits[[1]]) %>% nrow()`).
113113

114-
Before beginning, we can spend some time investigating our data. Since we know that many of these shape features are probably measuring similar concepts, let's take a look at the correlation structure of the data in Figure \@ref(fig:corr-plot) using this code.
114+
Before beginning, we can spend some time investigating our data. Since we know that many of these shape features are probably measuring similar concepts, let's take a look at the correlation structure of the data in Figure \@ref(fig:beans-corr-plot) using this code.
115115

116116
```{r dimensionality-corr-plot, eval = FALSE}
117117
library(corrplot)
@@ -122,7 +122,7 @@ bean_train %>%
122122
corrplot(col = tmwr_cols(200), tl.col = "black", method = "ellipse")
123123
```
124124

125-
```{r corr-plot, ref.label = "dimensionality-corr-plot"}
125+
```{r beans-corr-plot, ref.label = "dimensionality-corr-plot"}
126126
#| echo = FALSE,
127127
#| fig.height=6,
128128
#| fig.width=6,
@@ -372,7 +372,7 @@ Solidity (i.e., the density of the bean) drives the third PLS component, along w
372372

373373
ICA is slightly different than PCA in that it finds components that are as statistically independent from one another as possible (as opposed to being uncorrelated). It can be thought of as maximizing the "non-Gaussianity" of the ICA components. Let's use `step_ica()` to produce Figure \@ref(fig:bean-ica).
374374

375-
```{r dimensionality-ica}
375+
```{r dimensionality-ica, eval=FALSE}
376376
bean_rec_trained %>%
377377
step_ica(all_numeric_predictors(), num_comp = 4) %>%
378378
plot_validation_results() +
@@ -415,7 +415,7 @@ While the between-cluster space is pronounced, the clusters can contain a hetero
415415

416416
There is also a supervised version of UMAP:
417417

418-
```{r dimensionality-umap-supervised, dev = "png", fig.height=7}
418+
```{r dimensionality-umap-supervised, eval=FALSE}
419419
bean_rec_trained %>%
420420
step_umap(all_numeric_predictors(), outcome = "class", num_comp = 4) %>%
421421
plot_validation_results() +
@@ -534,7 +534,8 @@ rankings %>%
534534
geom_point(cex = 3.5) +
535535
theme(legend.position = "right") +
536536
labs(y = "ROC AUC") +
537-
geom_text(aes(y = mean - 0.01, label = wflow_id), angle = 90, hjust = 1)
537+
geom_text(aes(y = mean - 0.01, label = wflow_id), angle = 90, hjust = 1) +
538+
lims(y = c(0.9, NA))
538539
```
539540

540541
It is clear from these results that most models give very good performance; there are few bad choices here. For demonstration, we'll use the RDA model with PLS features as the final model. We will finalize the workflow with the numerically best parameters, fit it to the training set, then evaluate with the test set:

18-explaining-models-and-predictions.Rmd

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ In Chapters \@ref(resampling) and \@ref(compare), we trained and compared severa
4949

5050
```{r explain-obs-pred}
5151
#| echo = FALSE,
52+
#| fig.height = 4,
5253
#| fig.cap = "Comparing predicted prices for a linear model with interactions and a random forest model.",
5354
#| fig.alt = "Comparing predicted prices for a linear model with interactions and a random forest model. The random forest results in more accurate predictions."
5455
bind_rows(
@@ -59,7 +60,7 @@ bind_rows(
5960
geom_abline(col = "gray50", lty = 2) +
6061
geom_point(alpha = 0.3, show.legend = FALSE) +
6162
facet_wrap(vars(model)) +
62-
scale_color_viridis_d(end = 0.7) +
63+
scale_color_brewer(palette = "Paired") +
6364
labs(x = "true price", y = "predicted price")
6465
```
6566

@@ -427,7 +428,6 @@ Using our previously defined importance plotting function, `ggplot_imp(vip_beans
427428
```{r bean-explainer}
428429
#| echo = FALSE,
429430
#| fig.width = 8,
430-
#| fig.height = 4.5,
431431
#| fig.cap = "Global explainer for the regularized discriminant analysis model on the beans data.",
432432
#| fig.alt = "Global explainer for the regularized discriminant analysis model on the beans data. Almost all predictors have a significant contribution with shape factors one and four contributing the most. "
433433

19-when-should-you-trust-predictions.Rmd

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -300,7 +300,7 @@ res_test %>%
300300
scale_color_brewer(palette = "Set2") +
301301
scale_shape_manual(values = 15:22) +
302302
scale_x_date(labels = date_format("%B %d, %Y")) +
303-
labs(x = NULL, y = "Daily Ridership (x1000)", color = NULL)
303+
labs(x = NULL, y = "Daily Ridership (x1000)", color = NULL, pch = NULL)
304304
```
305305

306306
Given the scale of the ridership numbers, these results look particularly good for such a simple model. If this model were deployed, how well would it have done a few years later in June of 2020? The model successfully makes a prediction, as a predictive model will when given input data:
@@ -342,7 +342,7 @@ res_2020 %>%
342342
scale_shape_manual(values = 15:22) +
343343
scale_color_brewer(palette = "Set2") +
344344
scale_x_date(labels = date_format("%B %d, %Y")) +
345-
labs(x = NULL, y = "Daily Ridership (x1000)", color = NULL)
345+
labs(x = NULL, y = "Daily Ridership (x1000)", color = NULL, pch = NULL)
346346
```
347347

348348
Confidence and prediction intervals for linear regression expand as the data become more and more removed from the center of the training set. However, that effect is not dramatic enough to flag these predictions as being poor.
@@ -477,8 +477,8 @@ test_pca_dist <-
477477
aes(x = PC1_mean, y = PC2_mean, xend = PC1, yend = PC2),
478478
col = "red"
479479
) +
480-
geom_point(data = testing_pca, aes(x = PC1, y = PC2), col = "lightblue", pch = 17) +
481-
geom_point(data = new_pca, aes(x = PC1, y = PC2), col = "red") +
480+
geom_point(data = testing_pca, aes(x = PC1, y = PC2), col = "lightblue", size = 2, pch = 17) +
481+
geom_point(data = new_pca, aes(x = PC1, y = PC2), size = 2, col = "red") +
482482
coord_obs_pred() +
483483
labs(x = "Component 1", y = "Component 2", title = "Distances to Training Set Center") +
484484
theme_bw() +

0 commit comments

Comments
 (0)