Clean up after #228

juliasilge · juliasilge · commit c6ce4f311cc8 · 2022-02-03T16:58:52.000-07:00
diff --git a/16-dimensionality-reduction.Rmd b/16-dimensionality-reduction.Rmd
@@ -111,7 +111,7 @@ bean_val$splits[[1]]
 
 To visually assess how well different methods perform, we can estimate the methods on the training set (n = `r analysis(bean_val$splits[[1]]) %>% nrow()` beans) and display the results using the validation set (n = `r assessment(bean_val$splits[[1]]) %>% nrow()`).
 
-Before beginning, we can spend some time investigating our data. Since we know that many of these shape features are probably measuring similar concepts, let's take a look at the correlation structure of the data in Figure \@ref(fig:corr-plot) using this code.
+Before beginning, we can spend some time investigating our data. Since we know that many of these shape features are probably measuring similar concepts, let's take a look at the correlation structure of the data in Figure \@ref(fig:beans-corr-plot) using this code.
 
 ```{r dimensionality-corr-plot, eval = FALSE}
 library(corrplot)
@@ -122,7 +122,7 @@ bean_train %>%
   corrplot(col = tmwr_cols(200), tl.col = "black", method = "ellipse")
 ```
 
-```{r corr-plot, ref.label = "dimensionality-corr-plot"}
+```{r beans-corr-plot, ref.label = "dimensionality-corr-plot"}
 #| echo = FALSE,
 #| fig.height=6, 
 #| fig.width=6, 
@@ -372,7 +372,7 @@ Solidity (i.e., the density of the bean) drives the third PLS component, along w
 
 ICA is slightly different than PCA in that it finds components that are as statistically independent from one another as possible (as opposed to being uncorrelated). It can be thought of as maximizing the "non-Gaussianity" of the ICA components. Let's use `step_ica()` to produce Figure \@ref(fig:bean-ica).
 
-```{r dimensionality-ica}
+```{r dimensionality-ica, eval=FALSE}
 bean_rec_trained %>%
   step_ica(all_numeric_predictors(), num_comp = 4) %>%
   plot_validation_results() + 
@@ -415,7 +415,7 @@ While the between-cluster space is pronounced, the clusters can contain a hetero
 
 There is also a supervised version of UMAP:
 
-```{r dimensionality-umap-supervised, dev = "png", fig.height=7}
+```{r dimensionality-umap-supervised, eval=FALSE}
 bean_rec_trained %>%
   step_umap(all_numeric_predictors(), outcome = "class", num_comp = 4) %>%
   plot_validation_results() +
@@ -534,7 +534,8 @@ rankings %>%
   geom_point(cex = 3.5) + 
   theme(legend.position = "right") +
   labs(y = "ROC AUC")  +
-  geom_text(aes(y = mean - 0.01, label = wflow_id), angle = 90, hjust = 1)
+  geom_text(aes(y = mean - 0.01, label = wflow_id), angle = 90, hjust = 1) +
+  lims(y = c(0.9, NA))
 ```
 
 It is clear from these results that most models give very good performance; there are few bad choices here. For demonstration, we'll use the RDA model with PLS features as the final model. We will finalize the workflow with the numerically best parameters, fit it to the training set, then evaluate with the test set:
diff --git a/18-explaining-models-and-predictions.Rmd b/18-explaining-models-and-predictions.Rmd
@@ -49,6 +49,7 @@ In Chapters \@ref(resampling) and \@ref(compare), we trained and compared severa
 
 ```{r explain-obs-pred}
 #| echo = FALSE,
+#| fig.height = 4,
 #| fig.cap = "Comparing predicted prices for a linear model with interactions and a random forest model.",
 #| fig.alt = "Comparing predicted prices for a linear model with interactions and a random forest model. The random forest results in more accurate predictions."
 bind_rows(
@@ -59,7 +60,7 @@ bind_rows(
   geom_abline(col = "gray50", lty = 2) + 
   geom_point(alpha = 0.3, show.legend = FALSE) +
   facet_wrap(vars(model)) +
-  scale_color_viridis_d(end = 0.7) +
+  scale_color_brewer(palette = "Paired") +
   labs(x = "true price", y = "predicted price")
 ```
 
@@ -427,7 +428,6 @@ Using our previously defined importance plotting function, `ggplot_imp(vip_beans
 ```{r bean-explainer}
 #| echo = FALSE,
 #| fig.width = 8,
-#| fig.height = 4.5,
 #| fig.cap = "Global explainer for the regularized discriminant analysis model on the beans data.",
 #| fig.alt = "Global explainer for the regularized discriminant analysis model on the beans data. Almost all predictors have a significant contribution with shape factors one and four contributing the most. "
 
diff --git a/19-when-should-you-trust-predictions.Rmd b/19-when-should-you-trust-predictions.Rmd
@@ -300,7 +300,7 @@ res_test %>%
   scale_color_brewer(palette = "Set2") +
   scale_shape_manual(values = 15:22) +
   scale_x_date(labels = date_format("%B %d, %Y")) +
-  labs(x = NULL, y = "Daily Ridership (x1000)", color = NULL)
+  labs(x = NULL, y = "Daily Ridership (x1000)", color = NULL, pch = NULL)
 ```
 
 Given the scale of the ridership numbers, these results look particularly good for such a simple model. If this model were deployed, how well would it have done a few years later in June of 2020? The model successfully makes a prediction, as a predictive model will when given input data:
@@ -342,7 +342,7 @@ res_2020 %>%
   scale_shape_manual(values = 15:22) +
   scale_color_brewer(palette = "Set2") +
   scale_x_date(labels = date_format("%B %d, %Y")) +
-  labs(x = NULL, y = "Daily Ridership (x1000)", color = NULL) 
+  labs(x = NULL, y = "Daily Ridership (x1000)", color = NULL, pch = NULL) 
 ```
 
 Confidence and prediction intervals for linear regression expand as the data become more and more removed from the center of the training set. However, that effect is not dramatic enough to flag these predictions as being poor.
@@ -477,8 +477,8 @@ test_pca_dist <-
     aes(x = PC1_mean, y = PC2_mean, xend = PC1, yend = PC2), 
     col = "red"
   )  + 
-  geom_point(data = testing_pca, aes(x = PC1, y = PC2), col = "lightblue", pch = 17) +
-  geom_point(data = new_pca, aes(x = PC1, y = PC2), col = "red") +
+  geom_point(data = testing_pca, aes(x = PC1, y = PC2), col = "lightblue", size = 2, pch = 17) +
+  geom_point(data = new_pca, aes(x = PC1, y = PC2), size = 2, col = "red") +
   coord_obs_pred() + 
   labs(x = "Component 1", y = "Component 2", title = "Distances to Training Set Center") + 
   theme_bw() +