Clean up after #223

juliasilge · juliasilge · commit 1a3855e139e6 · 2022-01-29T11:33:37.000-07:00
diff --git a/04-ames.Rmd b/04-ames.Rmd
@@ -50,7 +50,7 @@ library(tidymodels)
 tidymodels_prefer()
 
 ggplot(ames, aes(x = Sale_Price)) + 
-  geom_histogram(bins = 50)
+  geom_histogram(bins = 50, col= "white")
 ```
 
 ```{r ames-sale-price-hist, ref.label = "ames-sale-price-code"}
@@ -107,7 +107,7 @@ knitr::include_graphics("premade/ames.png")
 
 We can see a few noticeable patterns. First, there is a void of data points in the center of Ames. This corresponds to Iowa State University. Second, while there are a number of neighborhoods that are geographically isolated, there are others that are adjacent to each other. For example, as Figure \@ref(fig:ames-timberland) shows, Timberland is located apart from almost all other neighborhoods.
 
-```{r ames-timberland , out.width = "80%", echo = FALSE, warning = FALSE}
+```{r ames-timberland}
 #| out.width = "80%", 
 #| echo = FALSE, 
 #| warning = FALSE,
@@ -119,7 +119,7 @@ knitr::include_graphics("premade/timberland.png")
 
 Figure \@ref(fig:ames-mitchell) visualizes how the Meadow Village neighborhood in Southwest Ames is like an island of properties ensconced inside the sea of properties that make up the Mitchell neighborhood. 
 
-```{r ames-mitchell , out.width = "60%", echo = FALSE, warning = FALSE}
+```{r ames-mitchell}
 #| out.width = "60%", 
 #| echo = FALSE, 
 #| warning = FALSE,
@@ -131,7 +131,7 @@ knitr::include_graphics("premade/mitchell.png")
  
 A detailed inspection of the map also shows that the neighborhood labels are not completely reliable. For example, Figure \@ref(fig:ames-northridge) shows there are some properties labeled as being in Northridge that are surrounded by homes in the adjacent Somerset neighborhood. 
 
-```{r ames-northridge , out.width = "90%", echo = FALSE, warning = FALSE}
+```{r ames-northridge}
 #| out.width = "90%", 
 #| echo = FALSE, 
 #| warning = FALSE,
@@ -143,7 +143,7 @@ knitr::include_graphics("premade/northridge.png")
 
 Also, there are ten isolated homes labeled as being in Crawford that you can see in Figure \@ref(fig:ames-crawford) but are not close to the majority of the other homes in that neighborhood:
 
-```{r ames-crawford , out.width = "80%", echo = FALSE, warning = FALSE}
+```{r ames-crawford}
 #| out.width = "80%", 
 #| echo = FALSE, 
 #| warning = FALSE,
@@ -155,7 +155,7 @@ knitr::include_graphics("premade/crawford.png")
 
 Also notable is the "Iowa Department of Transportation (DOT) and Rail Road" neighborhood adjacent to the main road on the east side of Ames, shown in Figure \@ref(fig:ames-dot_rr). There are several clusters of homes within this neighborhood as well as some longitudinal outliers; the two homes furthest east are isolated from the other locations. 
 
-```{r ames-dot_rr , out.width = "100%", echo = FALSE, warning = FALSE}
+```{r ames-dot_rr}
 #| out.width = "100%", 
 #| echo = FALSE, 
 #| warning = FALSE,
diff --git a/06-fitting-models.Rmd b/06-fitting-models.Rmd
@@ -174,7 +174,7 @@ arg_info %>%
     caption = "Random forest argument names used by parsnip.",
     label = "parsnip-args",
     escape = FALSE
-    ) %>% 
+  ) %>% 
   kable_styling(full_width = FALSE) %>%
   column_spec(2, monospace = TRUE)
 ```
@@ -291,13 +291,13 @@ ames_test_small %>%
 The motivation for the first rule comes from some R packages producing dissimilar data types from prediction functions. For example, the `r pkg(ranger)` package is an excellent tool for computing random forest models. However, instead of returning a data frame or vector as output, a specialized object is returned that has multiple values embedded within it (including the predicted values). This is just one more step for the data analyst to work around in their scripts. As another example, the `r pkg(glmnet)` package can return at least four different output types for predictions, depending on the model and characteristics of the data. These are shown in Table \@ref(tab:predict-types).
 
 ```{r model-pred-types, echo = FALSE, results = "asis"}
-  tribble(
-    ~ `Type of Prediction`, ~ `Returns a:`,
-    "numeric",                 "numeric matrix",
-    "class",                   "character matrix",
-    "probability (2 classes)", "numeric matrix (2nd level only)",
-    "probability (3+ classes)", "3D numeric array (all levels)", 
-  ) %>% 
+tribble(
+  ~ `Type of Prediction`, ~ `Returns a:`,
+  "numeric",                 "numeric matrix",
+  "class",                   "character matrix",
+  "probability (2 classes)", "numeric matrix (2nd level only)",
+  "probability (3+ classes)", "3D numeric array (all levels)", 
+) %>% 
   kable(
     caption = "Different return values for glmnet prediction types.",
     label = "predict-types"
@@ -318,7 +318,10 @@ tribble(
   "conf_int", ".pred_lower, .pred_upper",
   "pred_int", ".pred_lower, .pred_upper"
 ) %>% 
-  kable() %>% 
+  kable(
+    caption = "The tidymodels mapping of prediction types and column names.",
+    label = "predictable-column-names",
+  ) %>% 
   kable_styling(full_width = FALSE)  %>%
   column_spec(1:2, monospace = TRUE)
 ```
diff --git a/08-feature-engineering.Rmd b/08-feature-engineering.Rmd
@@ -226,7 +226,7 @@ recipe(~Bldg_Type, data = ames_train) %>%
   slice(show_rows) %>% 
   arrange(`Raw Data`) %>% 
   kable(
-    caption = "Illustration of binary encodings (i.e., 'dummy variables') for a qualitative predictor.",
+    caption = 'Illustration of binary encodings (i.e., "dummy variables") for a qualitative predictor.',
     label = "dummy-vars"
   ) %>% 
   kable_styling(full_width = FALSE)
diff --git a/09-judging-model-effectiveness.Rmd b/09-judging-model-effectiveness.Rmd
@@ -19,7 +19,7 @@ Once we have a model, we need to know how well it works. A quantitative approach
 The best approach to empirical validation involves using _resampling_ methods that will be introduced in Chapter \@ref(resampling). In this chapter, we will use the test set for illustration purposes and to motivate the need for empirical validation. Keep in mind that the test set can only be used once, as explained in Section \@ref(splitting-methods).
 :::
 
-The choice of which metrics to examine can be critical. In later chapters, certain model parameters will be empirically optimized and a primary performance metric will be used to choose the best _sub-model_. Choosing the wrong method can easily result in unintended consequences. For example, two common metrics for regression models are the root mean squared error (RMSE) and the coefficient of determination (a.k.a. $R^2$). The former measures _accuracy_ while the latter measures _correlation_. These are not necessarily the same thing. Figure \@ref(fig: performance-reg-metrics) demonstrates the difference between the two. 
+The choice of which metrics to examine can be critical. In later chapters, certain model parameters will be empirically optimized and a primary performance metric will be used to choose the best _sub-model_. Choosing the wrong method can easily result in unintended consequences. For example, two common metrics for regression models are the root mean squared error (RMSE) and the coefficient of determination (a.k.a. $R^2$). The former measures _accuracy_ while the latter measures _correlation_. These are not necessarily the same thing. Figure \@ref(fig:performance-reg-metrics) demonstrates the difference between the two. 
 
 ```{r performance-reg-metrics, echo = FALSE}
 #| fig.cap = "Observed versus predicted values for models that are optimized using the RMSE compared to the coefficient of determination.",
@@ -208,9 +208,9 @@ f_meas(two_class_example, truth, predicted)
 
 For binary classification data sets, these functions have a standard argument called `event_level`. The _default_ is that the **first** level of the outcome factor is the event of interest. 
 
-```{block, type = "rmdnote"}
+:::rmdnote
 There is some heterogeneity in R functions in this regard; some use the first level and others the second to denote the event of interest. We consider it more intuitive that the first level is the most important. The second level logic is borne of encoding the outcome as 0/1 (in which case the second value is the event) and unfortunately remains in some packages. However, tidymodels (along with many other R packages) _require_ a categorical outcome to be encoded as a factor and, for this reason, the legacy justification for the second level as the event becomes irrelevant.  
-```
+:::
 
 As an example where the second class is the event: