Skip to content

Commit 1a3855e

Browse files
committed
Clean up after #223
1 parent 5640b02 commit 1a3855e

File tree

4 files changed

+22
-19
lines changed

4 files changed

+22
-19
lines changed

04-ames.Rmd

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ library(tidymodels)
5050
tidymodels_prefer()
5151
5252
ggplot(ames, aes(x = Sale_Price)) +
53-
geom_histogram(bins = 50)
53+
geom_histogram(bins = 50, col= "white")
5454
```
5555

5656
```{r ames-sale-price-hist, ref.label = "ames-sale-price-code"}
@@ -107,7 +107,7 @@ knitr::include_graphics("premade/ames.png")
107107

108108
We can see a few noticeable patterns. First, there is a void of data points in the center of Ames. This corresponds to Iowa State University. Second, while there are a number of neighborhoods that are geographically isolated, there are others that are adjacent to each other. For example, as Figure \@ref(fig:ames-timberland) shows, Timberland is located apart from almost all other neighborhoods.
109109

110-
```{r ames-timberland , out.width = "80%", echo = FALSE, warning = FALSE}
110+
```{r ames-timberland}
111111
#| out.width = "80%",
112112
#| echo = FALSE,
113113
#| warning = FALSE,
@@ -119,7 +119,7 @@ knitr::include_graphics("premade/timberland.png")
119119

120120
Figure \@ref(fig:ames-mitchell) visualizes how the Meadow Village neighborhood in Southwest Ames is like an island of properties ensconced inside the sea of properties that make up the Mitchell neighborhood.
121121

122-
```{r ames-mitchell , out.width = "60%", echo = FALSE, warning = FALSE}
122+
```{r ames-mitchell}
123123
#| out.width = "60%",
124124
#| echo = FALSE,
125125
#| warning = FALSE,
@@ -131,7 +131,7 @@ knitr::include_graphics("premade/mitchell.png")
131131

132132
A detailed inspection of the map also shows that the neighborhood labels are not completely reliable. For example, Figure \@ref(fig:ames-northridge) shows there are some properties labeled as being in Northridge that are surrounded by homes in the adjacent Somerset neighborhood.
133133

134-
```{r ames-northridge , out.width = "90%", echo = FALSE, warning = FALSE}
134+
```{r ames-northridge}
135135
#| out.width = "90%",
136136
#| echo = FALSE,
137137
#| warning = FALSE,
@@ -143,7 +143,7 @@ knitr::include_graphics("premade/northridge.png")
143143

144144
Also, there are ten isolated homes labeled as being in Crawford that you can see in Figure \@ref(fig:ames-crawford) but are not close to the majority of the other homes in that neighborhood:
145145

146-
```{r ames-crawford , out.width = "80%", echo = FALSE, warning = FALSE}
146+
```{r ames-crawford}
147147
#| out.width = "80%",
148148
#| echo = FALSE,
149149
#| warning = FALSE,
@@ -155,7 +155,7 @@ knitr::include_graphics("premade/crawford.png")
155155

156156
Also notable is the "Iowa Department of Transportation (DOT) and Rail Road" neighborhood adjacent to the main road on the east side of Ames, shown in Figure \@ref(fig:ames-dot_rr). There are several clusters of homes within this neighborhood as well as some longitudinal outliers; the two homes furthest east are isolated from the other locations.
157157

158-
```{r ames-dot_rr , out.width = "100%", echo = FALSE, warning = FALSE}
158+
```{r ames-dot_rr}
159159
#| out.width = "100%",
160160
#| echo = FALSE,
161161
#| warning = FALSE,

06-fitting-models.Rmd

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -174,7 +174,7 @@ arg_info %>%
174174
caption = "Random forest argument names used by parsnip.",
175175
label = "parsnip-args",
176176
escape = FALSE
177-
) %>%
177+
) %>%
178178
kable_styling(full_width = FALSE) %>%
179179
column_spec(2, monospace = TRUE)
180180
```
@@ -291,13 +291,13 @@ ames_test_small %>%
291291
The motivation for the first rule comes from some R packages producing dissimilar data types from prediction functions. For example, the `r pkg(ranger)` package is an excellent tool for computing random forest models. However, instead of returning a data frame or vector as output, a specialized object is returned that has multiple values embedded within it (including the predicted values). This is just one more step for the data analyst to work around in their scripts. As another example, the `r pkg(glmnet)` package can return at least four different output types for predictions, depending on the model and characteristics of the data. These are shown in Table \@ref(tab:predict-types).
292292

293293
```{r model-pred-types, echo = FALSE, results = "asis"}
294-
tribble(
295-
~ `Type of Prediction`, ~ `Returns a:`,
296-
"numeric", "numeric matrix",
297-
"class", "character matrix",
298-
"probability (2 classes)", "numeric matrix (2nd level only)",
299-
"probability (3+ classes)", "3D numeric array (all levels)",
300-
) %>%
294+
tribble(
295+
~ `Type of Prediction`, ~ `Returns a:`,
296+
"numeric", "numeric matrix",
297+
"class", "character matrix",
298+
"probability (2 classes)", "numeric matrix (2nd level only)",
299+
"probability (3+ classes)", "3D numeric array (all levels)",
300+
) %>%
301301
kable(
302302
caption = "Different return values for glmnet prediction types.",
303303
label = "predict-types"
@@ -318,7 +318,10 @@ tribble(
318318
"conf_int", ".pred_lower, .pred_upper",
319319
"pred_int", ".pred_lower, .pred_upper"
320320
) %>%
321-
kable() %>%
321+
kable(
322+
caption = "The tidymodels mapping of prediction types and column names.",
323+
label = "predictable-column-names",
324+
) %>%
322325
kable_styling(full_width = FALSE) %>%
323326
column_spec(1:2, monospace = TRUE)
324327
```

08-feature-engineering.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -226,7 +226,7 @@ recipe(~Bldg_Type, data = ames_train) %>%
226226
slice(show_rows) %>%
227227
arrange(`Raw Data`) %>%
228228
kable(
229-
caption = "Illustration of binary encodings (i.e., 'dummy variables') for a qualitative predictor.",
229+
caption = 'Illustration of binary encodings (i.e., "dummy variables") for a qualitative predictor.',
230230
label = "dummy-vars"
231231
) %>%
232232
kable_styling(full_width = FALSE)

09-judging-model-effectiveness.Rmd

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ Once we have a model, we need to know how well it works. A quantitative approach
1919
The best approach to empirical validation involves using _resampling_ methods that will be introduced in Chapter \@ref(resampling). In this chapter, we will use the test set for illustration purposes and to motivate the need for empirical validation. Keep in mind that the test set can only be used once, as explained in Section \@ref(splitting-methods).
2020
:::
2121

22-
The choice of which metrics to examine can be critical. In later chapters, certain model parameters will be empirically optimized and a primary performance metric will be used to choose the best _sub-model_. Choosing the wrong method can easily result in unintended consequences. For example, two common metrics for regression models are the root mean squared error (RMSE) and the coefficient of determination (a.k.a. $R^2$). The former measures _accuracy_ while the latter measures _correlation_. These are not necessarily the same thing. Figure \@ref(fig: performance-reg-metrics) demonstrates the difference between the two.
22+
The choice of which metrics to examine can be critical. In later chapters, certain model parameters will be empirically optimized and a primary performance metric will be used to choose the best _sub-model_. Choosing the wrong method can easily result in unintended consequences. For example, two common metrics for regression models are the root mean squared error (RMSE) and the coefficient of determination (a.k.a. $R^2$). The former measures _accuracy_ while the latter measures _correlation_. These are not necessarily the same thing. Figure \@ref(fig:performance-reg-metrics) demonstrates the difference between the two.
2323

2424
```{r performance-reg-metrics, echo = FALSE}
2525
#| fig.cap = "Observed versus predicted values for models that are optimized using the RMSE compared to the coefficient of determination.",
@@ -208,9 +208,9 @@ f_meas(two_class_example, truth, predicted)
208208

209209
For binary classification data sets, these functions have a standard argument called `event_level`. The _default_ is that the **first** level of the outcome factor is the event of interest.
210210

211-
```{block, type = "rmdnote"}
211+
:::rmdnote
212212
There is some heterogeneity in R functions in this regard; some use the first level and others the second to denote the event of interest. We consider it more intuitive that the first level is the most important. The second level logic is borne of encoding the outcome as 0/1 (in which case the second value is the event) and unfortunately remains in some packages. However, tidymodels (along with many other R packages) _require_ a categorical outcome to be encoded as a factor and, for this reason, the legacy justification for the second level as the event becomes irrelevant.
213-
```
213+
:::
214214

215215
As an example where the second class is the event:
216216

0 commit comments

Comments
 (0)