Skip to content

Commit 5640b02

Browse files
authored
Merge pull request #223 from tidymodels/fig-tab-4-9
Figure/table updates for Ch 4 to 9
2 parents fd16b45 + f840adc commit 5640b02

11 files changed

+297
-481
lines changed

04-ames.Rmd

Lines changed: 61 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ tidymodels_prefer()
99

1010
# The Ames housing data {#ames}
1111

12-
The Ames housing data set [@ames] is an excellent resource for learning about models that we will use throughout this book. It contains data on `r format(nrow(ames), big.mark = ",")` properties in Ames, Iowa, including columns related to
12+
The Ames housing data set [@ames] is an excellent resource for learning about models that we will use throughout this book. It contains data on `r format(nrow(ames), big.mark = ",")` properties in Ames, Iowa, including columns related to:
1313

1414
* house characteristics (bedrooms, garage, fireplace, pool, porch, etc.),
1515
* location (neighborhood),
@@ -43,24 +43,42 @@ dim(ames)
4343

4444
## Exploring important features
4545

46-
It makes sense to start with the outcome we want to predict: the last sale price of the house (in USD):
46+
Let's start with the outcome we want to predict: the last sale price of the house (in USD), as shown in Figure \@ref(fig:ames-sale-price).
4747

48-
```{r ames-sale_price, out.width = '100%', fig.width=8, fig.height=3}
48+
```{r ames-sale-price-code, eval = FALSE}
4949
library(tidymodels)
5050
tidymodels_prefer()
5151
5252
ggplot(ames, aes(x = Sale_Price)) +
5353
geom_histogram(bins = 50)
5454
```
5555

56-
The data are right-skewed; there are more inexpensive houses than expensive ones. The median sale price was \$`r format(median(ames$Sale_Price), big.mark = ",")` and the most expensive house was \$`r format(max(ames$Sale_Price), big.mark = ",")`. When modeling this outcome, a strong argument can be made that the price should be log-transformed. The advantages of doing this are that no houses would be predicted with negative sale prices and that errors in predicting expensive houses will not have an undue influence on the model. Also, from a statistical perspective, a logarithmic transform may also _stabilize the variance_ in a way that makes inference more legitimate. Let's visualize the transformed data:
56+
```{r ames-sale-price-hist, ref.label = "ames-sale-price-code"}
57+
#| out.width = '100%',
58+
#| fig.width = 8,
59+
#| fig.height = 3,
60+
#| echo = FALSE,
61+
#| fig.cap = "Sale prices of houses in Ames, Iowa.",
62+
#| fig.alt = "A histogram of the sale prices of houses in Ames, Iowa. The distribution has a long right tail."
63+
```
64+
65+
The data are right-skewed; there are more inexpensive houses than expensive ones. The median sale price was \$`r format(median(ames$Sale_Price), big.mark = ",")` and the most expensive house was \$`r format(max(ames$Sale_Price), big.mark = ",")`. When modeling this outcome, a strong argument can be made that the price should be log-transformed. The advantages of doing this are that no houses would be predicted with negative sale prices and that errors in predicting expensive houses will not have an undue influence on the model. Also, from a statistical perspective, a logarithmic transform may also _stabilize the variance_ in a way that makes inference more legitimate. Figure \@ref(fig:ames-log-sale-price-hist) visualizes the transformed data.
5766

58-
```{r ames-log-sale_price, out.width = '100%', fig.width=8, fig.height=3}
67+
```{r ames-log-sale-price-code, eval = FALSE}
5968
ggplot(ames, aes(x = Sale_Price)) +
60-
geom_histogram(bins = 50) +
69+
geom_histogram(bins = 50, col= "white") +
6170
scale_x_log10()
6271
```
6372

73+
```{r ames-log-sale-price-hist, ref.label = "ames-log-sale-price-code"}
74+
#| out.width = '100%',
75+
#| fig.width = 8,
76+
#| fig.height = 3,
77+
#| echo = FALSE,
78+
#| fig.cap = "Sale prices of houses in Ames, Iowa after a log (base 10) transformation.",
79+
#| fig.alt = "A histogram of the sale prices of houses in Ames, Iowa after a log (base 10) transformation. The distribution, while not perfectly symmetric, exhibits far less skewness."
80+
```
81+
6482
While not perfect, this will probably result in better models than using the untransformed data.
6583

6684
:::rmdwarning
@@ -75,44 +93,74 @@ Despite these drawbacks, the models used in this book utilize the log transforma
7593
ames <- ames %>% mutate(Sale_Price = log10(Sale_Price))
7694
```
7795

78-
Another important aspect of these data for our modeling are their geographic locations. This spatial information is contained in the data in two ways: a qualitative `Neighborhood` label as well as quantitative longitude and latitude data. To visualize the spatial information, let's use both together to plot the data on a map and color by neighborhood:
96+
Another important aspect of these data for our modeling are their geographic locations. This spatial information is contained in the data in two ways: a qualitative `Neighborhood` label as well as quantitative longitude and latitude data. To visualize the spatial information, let's use both together to plot the data on a map and color by neighborhood in Figure \@ref(fig:ames-map).
7997

80-
```{r ames-map, out.width = "100%", echo = FALSE, fig.cap = "Neighborhoods in Ames IA", warning = FALSE}
98+
```{r ames-map}
99+
#| out.width = "100%",
100+
#| echo = FALSE,
101+
#| warning = FALSE,
102+
#| fig.cap = "Neighborhoods in Ames, IA.",
103+
#| fig.alt = "A scatter plot of house locations in Ames superimposed over a street map. There is a significant area in the center of the map where no homes were sold."
81104
# See file extras/ames_sf.R
82105
knitr::include_graphics("premade/ames.png")
83106
```
84107

85-
We can see a few noticeable patterns. First, there is a void of data points in the center of Ames. This corresponds to Iowa State University. Second, while there are a number of neighborhoods that are geographically isolated, there are others that are adjacent to each other. For example, Timberland is located apart from almost all other neighborhoods:
108+
We can see a few noticeable patterns. First, there is a void of data points in the center of Ames. This corresponds to Iowa State University. Second, while there are a number of neighborhoods that are geographically isolated, there are others that are adjacent to each other. For example, as Figure \@ref(fig:ames-timberland) shows, Timberland is located apart from almost all other neighborhoods.
86109

87110
```{r ames-timberland , out.width = "80%", echo = FALSE, warning = FALSE}
111+
#| out.width = "80%",
112+
#| echo = FALSE,
113+
#| warning = FALSE,
114+
#| fig.cap = "Locations of homes in Timberland.",
115+
#| fig.alt = "A scatter plot of locations of homes in Timberland, located in the southern part of Ames."
88116
# See file extras/ames_sf.R
89117
knitr::include_graphics("premade/timberland.png")
90118
```
91119

92-
The Meadow Village neighborhood in Southwest Ames is like an island of properties ensconced inside the sea of properties that make up the Mitchell neighborhood:
120+
Figure \@ref(fig:ames-mitchell) visualizes how the Meadow Village neighborhood in Southwest Ames is like an island of properties ensconced inside the sea of properties that make up the Mitchell neighborhood.
93121

94122
```{r ames-mitchell , out.width = "60%", echo = FALSE, warning = FALSE}
123+
#| out.width = "60%",
124+
#| echo = FALSE,
125+
#| warning = FALSE,
126+
#| fig.cap = "Locations of homes in Meadow Village and Mitchell.",
127+
#| fig.alt = "A scatter plot of locations of homes in Meadow Village and Mitchell. The small number of Meadow Village properties are enclosed inside the the ones labeled as being in Mitchell."
95128
# See file extras/ames_sf.R
96129
knitr::include_graphics("premade/mitchell.png")
97130
```
98131

99-
A detailed inspection of the map also shows that the neighborhood labels are not completely reliable. For example, there are some properties labeled as being in Northridge that are surrounded by houses in the adjacent Somerset neighborhood:
132+
A detailed inspection of the map also shows that the neighborhood labels are not completely reliable. For example, Figure \@ref(fig:ames-northridge) shows there are some properties labeled as being in Northridge that are surrounded by homes in the adjacent Somerset neighborhood.
100133

101134
```{r ames-northridge , out.width = "90%", echo = FALSE, warning = FALSE}
135+
#| out.width = "90%",
136+
#| echo = FALSE,
137+
#| warning = FALSE,
138+
#| fig.cap = "Locations of homes in Somerset and Northridge.",
139+
#| fig.alt = "A scatter plot of locations of homes in Somerset and Northridge. There are a few homes in Somerset mixed in the periphery of Northridge (and vice versa)."
102140
# See file extras/ames_sf.R
103141
knitr::include_graphics("premade/northridge.png")
104142
```
105143

106-
Also, there are ten isolated houses labeled as being in Crawford but are not close to the majority of the other houses in that neighborhood:
144+
Also, there are ten isolated homes labeled as being in Crawford that you can see in Figure \@ref(fig:ames-crawford) but are not close to the majority of the other homes in that neighborhood:
107145

108146
```{r ames-crawford , out.width = "80%", echo = FALSE, warning = FALSE}
147+
#| out.width = "80%",
148+
#| echo = FALSE,
149+
#| warning = FALSE,
150+
#| fig.cap = "Locations of homes in Crawford.",
151+
#| fig.alt = "A scatter plot of locations of homes in Crawford. There is a large cluster of homes to the west of a small, separate cluster of properties also labeled as Crawford."
109152
# See file extras/ames_sf.R
110153
knitr::include_graphics("premade/crawford.png")
111154
```
112155

113-
Also notable is the "Iowa Department of Transportation (DOT) and Rail Road" neighborhood adjacent to the main road on the east side of Ames. There are several clusters of houses within this neighborhood as well as some longitudinal outliers; the two houses furthest east are isolated from the other locations.
156+
Also notable is the "Iowa Department of Transportation (DOT) and Rail Road" neighborhood adjacent to the main road on the east side of Ames, shown in Figure \@ref(fig:ames-dot_rr). There are several clusters of homes within this neighborhood as well as some longitudinal outliers; the two homes furthest east are isolated from the other locations.
114157

115158
```{r ames-dot_rr , out.width = "100%", echo = FALSE, warning = FALSE}
159+
#| out.width = "100%",
160+
#| echo = FALSE,
161+
#| warning = FALSE,
162+
#| fig.cap = "Homes labeled as 'Iowa Department of Transportation (DOT) and Rail Road'.",
163+
#| fig.alt = "A scatter plot of locations of homes labeled as 'Iowa Department of Transportation (DOT) and Rail Road'. The longitude distribution is right-skewed with a few outlying properties."
116164
# See file extras/ames_sf.R
117165
knitr::include_graphics("premade/dot_rr.png")
118166
```

05-data-spending.Rmd

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,10 @@ These objects are data frames with the same _columns_ as the original data but o
5858

5959
Simple random sampling is appropriate in many cases but there are exceptions. When there is a dramatic _class imbalance_ in classification problems, one class occurs much less frequently than another. Using a simple random sample may haphazardly allocate these infrequent samples disproportionately into the training or test set. To avoid this, _stratified sampling_ can be used. The training/test split is conducted separately within each class and then these subsamples are combined into the overall training and test set. For regression problems, the outcome data can be artificially binned into _quartiles_ and then stratified sampling conducted four separate times. This is an effective method for keeping the distributions of the outcome similar between the training and test set.
6060

61-
```{r ames-sale-price, echo = FALSE, fig.cap = "The distribution of the sale price (in log units) for the Ames housing data. The vertical lines indicate the quartiles of the data."}
61+
```{r ames-sale-price, echo = FALSE}
62+
#| fig.cap = "The distribution of the sale price (in log units) for the Ames housing data. The vertical lines indicate the quartiles of the data.",
63+
#| fig.alt = "The distribution of the sale price (in log units) for the Ames housing data. The vertical lines indicate the quartiles of the data."
64+
6265
sale_dens <-
6366
density(ames$Sale_Price, n = 2^10) %>%
6467
tidy()
@@ -72,7 +75,7 @@ quart_plot <-
7275
geom_segment(data = quartiles,
7376
aes(x = value, xend = value, y = 0, yend = y),
7477
lty = 2) +
75-
xlab("Sale Price (log-10 USD)")
78+
labs("x = Sale Price (log-10 USD)", y = NULL)
7679
quart_plot
7780
```
7881

0 commit comments

Comments
 (0)