Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 10 additions & 4 deletions 01-software-modeling.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,8 @@ This iterative process is especially true for modeling. Figure \@ref(fig:softwar

After an initial sequence of these tasks, more understanding is gained regarding which types of models are superior as well as which sub-populations of the data are not being effectively estimated. This leads to additional EDA and feature engineering, another round of modeling, and so on. Once the data analysis goals are achieved, the last steps are typically to finalize, document, and communicate the model. For predictive models, it is common at the end to validate the model on an additional set of data reserved for this specific purpose.

As an example, @fes use data to model the daily ridership of Chicago's public train system using predictors such as the date, the previous ridership results, the weather, and other factors. An approximation of these authors' "inner monologue" when analyzing these data is, in order:
As an example, @fes use data to model the daily ridership of Chicago's public train system using predictors such as the date, the previous ridership results, the weather, and other factors. Table \@ref(tab:inner-monologue) shows an approximation of these authors' "inner monologue" when analyzing these data. Eventually, a model is selected that is able to achieve sufficient performance.


```{r software-monolog, echo = FALSE, results = 'as-is'}
monolog <-
Expand Down Expand Up @@ -251,21 +252,26 @@ if (knitr::is_html_output()) {
tab <-
monolog %>%
dplyr::select(Thoughts, Activity) %>%
kable() %>%
kable(
caption = "A hypothetical inner monologue of a model developer.",
label = "inner-monologue"
) %>%
kable_styling() %>%
column_spec(2, width = "25%") %>%
column_spec(1, width = "75%", italic = TRUE)
} else {
tab <-
monolog %>%
dplyr::select(Thoughts, Activity) %>%
kable() %>%
kable(
caption = "A hypothetical inner monologue of a model developer.",
label = "inner-monologue"
) %>%
kable_styling()
}
tab
```

and so on. Eventually, a model is selected that is able to achieve sufficient performance.

## Chapter summary {#software-summary}

Expand Down
86 changes: 64 additions & 22 deletions 03-base-r.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
knitr::opts_chunk$set(fig.path = "figures/")
data(crickets, package = "modeldata")
library(tidyverse)
library(kableExtra)
```

This book is about creating models with R. Before describing how to apply tidy data principles, let's review how models are created, trained, and used in the core R language (often called "base R"). This chapter is a brief illustration of core language conventions. It is not exhaustive but provides readers (especially those new to R) the basic, most commonly used motifs.
Expand All @@ -12,30 +13,44 @@ The S language, on which R is based, has had a rich data analysis environment si

## An example

To demonstrate these fundamentals, let's use experimental data from @mcdonald2009, by way of @mangiafico2015, on the relationship between the ambient temperature and the rate of cricket chirps per minute. Data were collected for two species: _O. exclamationis_ and _O. niveus_. The data are contained in a data frame called `crickets` with a total of `r nrow(crickets)` data points. These data are shown here in a `r pkg(ggplot2)` graph.
To demonstrate these fundamentals, let's use experimental data from @mcdonald2009, by way of @mangiafico2015, on the relationship between the ambient temperature and the rate of cricket chirps per minute. Data were collected for two species: _O. exclamationis_ and _O. niveus_. The data are contained in a data frame called `crickets` with a total of `r nrow(crickets)` data points. These data are shown in Figure \@ref(fig:cricket-plot) using the following `r pkg(ggplot2)` code.

```{r base-r-cricket-plot, out.width = '70%', fig.width=6, fig.height=4, warning = FALSE}
```{r base-r-cricket-plot, eval = FALSE}
library(tidyverse)

data(crickets, package = "modeldata")
names(crickets)

# Plot the temperature on the x-axis, the chirp rate on the y-axis. The plot
# elements will be colored differently for each species:
ggplot(crickets, aes(x = temp, y = rate, col = species)) +
ggplot(crickets, aes(x = temp, y = rate, col = species, pch = species)) +
# Plot points for each data point and color by species
geom_point() +
# Show a simple linear model fit created separately for each species:
geom_smooth(method = lm, se = FALSE) +
labs(x = "Temperature (C)", y = "Chirp Rate (per minute)")
geom_smooth(method = lm, se = FALSE, alpha = 1/2) +
labs(x = "Temperature (C)", y = "Chirp Rate (per minute)") +
scale_color_brewer(palette = "Paired")
```

The data exhibit fairly linear trends for each species. For a given temperature, _O. exclamationis_ appears to chirp more per minute than the other species. For an inferential model, the researchers might have specified the following null hypotheses prior to seeing the data:

* Temperature has no effect on the chirp rate.

* There are no differences between the species' chirp rate.


```{r cricket-plot, ref.label = "base-r-cricket-plot"}
#| out.width = '70%',
#| fig.width = 6,
#| fig.height = 4,
#| warning = FALSE,
#| message = FALSE,
#| echo = FALSE,
#| fig.cap = "The relationship between chirp rate and temperature for two different species of cricket.",
#| fig.alt = "A scatter plot of the chirp rate and temperature for two different species of cricket with linear trend lines per species. The trends are linearly increasing with a separation between the two species."
```


There may be some scientific or practical value in predicting the chirp rate but in this example we will focus on inference.

To fit an ordinary linear model in R, the `lm()` function is commonly used. The important arguments to this function are a model formula and a data frame that contains the data. The formula is _symbolic_. For example, the simple formula:
Expand Down Expand Up @@ -96,9 +111,9 @@ interaction_fit

This output is a little hard to read. For the species indicator variables, R mashes the variable name (`species`) together with the factor level (`O. niveus`) with no delimiter.

Before going into any inferential results for this model, the fit should be assessed using diagnostic plots. We can use the `plot()` method for `lm` objects. This method produces a set of four plots for the object, each showing different aspects of the fit. Two plots are shown here:
Before going into any inferential results for this model, the fit should be assessed using diagnostic plots. We can use the `plot()` method for `lm` objects. This method produces a set of four plots for the object, each showing different aspects of the fit. The code below produces the two plots are shown in Figure \@ref(fig:interaction-plots).

```{r interaction-plots, out.width = '100%', fig.width=8, fig.height=4.5, warning = FALSE}
```{r base-r-interaction-plots, eval = FALSE}
# Place two plots next to one another:
par(mfrow = c(1, 2))

Expand All @@ -109,6 +124,17 @@ plot(interaction_fit, which = 1)
plot(interaction_fit, which = 2)
```


```{r interaction-plots, ref.label= "base-r-interaction-plots"}
#| out.width = '100%',
#| fig.width = 8,
#| fig.height = 4.5,
#| warning = FALSE,
#| echo = FALSE,
#| fig.cap = "Residual diagnostic plots for the linear model containing interactions.",
#| fig.alt = "On the left is a scatter plot of the model residuals versus predicted values. There are no strong trends in the data. The right-hand panel shows a normal quantile-quantile plot where the points indicate that normality if probably a good assumption."
```

These appear reasonable enough to conduct inferential analysis.

:::rmdnote
Expand Down Expand Up @@ -206,18 +232,28 @@ R is quite different from Python in this respect. An advantage of R's diversity

Unfortunately, some of the syntactical diversity is due to a focus on the needs of the person developing the code instead of the needs of the person using the code. Inconsistencies between packages can be a stumbling block to R users.

Suppose your modeling project has an outcome with two classes. There are a variety of statistical and machine learning models you could choose from. In order to produce a class probability estimate for each sample, it is common for a model function to have a corresponding `predict()` method. However, there is significant heterogeneity in the argument values used by those methods to make class probability predictions; this heterogeneity can be difficult for even experienced users to navigate. A sampling of these argument values for different models is:

| Function | Package | Code |
| :----------- | :---------------------------------- | :-------------------------------------------- |
| `lda` | <span class="pkg">MASS</span> | `predict(object)` |
| `glm` | <span class="pkg">stats</span> | `predict(object, type = "response")` |
| `gbm` | <span class="pkg">gbm</span> | `predict(object, type = "response", n.trees)` |
| `mda` | <span class="pkg">mda</span> | `predict(object, type = "posterior")` |
| `rpart` | <span class="pkg">rpart</span> | `predict(object, type = "prob")` |
| various | <span class="pkg">RWeka</span> | `predict(object, type = "probability")` |
| `logitboost` | <span class="pkg">LogitBoost</span> | `predict(object, type = "raw", nIter)` |
| `pamr.train` | <span class="pkg">pamr</span> | `pamr.predict(object, type = "posterior")` |
Suppose your modeling project has an outcome with two classes. There are a variety of statistical and machine learning models you could choose from. In order to produce a class probability estimate for each sample, it is common for a model function to have a corresponding `predict()` method. However, there is significant heterogeneity in the argument values used by those methods to make class probability predictions; this heterogeneity can be difficult for even experienced users to navigate. A sampling of these argument values for different models is shown in Table \@ref(tab:probability-args).

```{r prob-args, echo = FALSE, results = "asis"}
prob_arg_info <-
tribble(
~ Function, ~Package, ~Code,
"`lda`" , "MASS" , "`predict(object)`" ,
"`glm`" , "stats" , "`predict(object, 'response')`" ,
"`gbm`" , "gbm" , "`predict(object, 'response', n.trees)`" ,
"`mda`" , "mda" , "`predict(object, 'posterior')`" ,
"`rpart`" , "rpart" , "`predict(object, 'prob')`" ,
"various" , "RWeka" , "`predict(object, 'probability')`" ,
"`logitboost`" , "LogitBoost" , "`predict(object, 'raw', nIter)`" ,
"`pamr.train`" , "pamr" , "`pamr.predict(object, 'posterior')`"
) %>%
kable(
caption = "Example argument names for different random forest functions.",
label = "probability-args",
escape = FALSE
) %>%
kable_styling(full_width = FALSE)
```

Note that the last example has a custom _function_ to make predictions instead of using the more common `predict()` interface (the generic `predict()` _method_). This lack of consistency is a barrier to day-to-day usage of R for modeling.

Expand Down Expand Up @@ -271,9 +307,9 @@ library(broom)
tidy(corr_res[[1]])
```

These results can be "stacked" and added to a `ggplot()`:
These results can be "stacked" and added to a `ggplot()`, as shown in Figure \@ref(fig:corr-plot).

```{r base-r-corr-plot}
```{r base-r-corr-plot, eval = FALSE}
corr_res %>%
# Convert each to a tidy format; `map_dfr()` stacks the data frames
map_dfr(tidy, .id = "predictor") %>%
Expand All @@ -283,6 +319,12 @@ corr_res %>%
labs(x = NULL, y = "Correlation with mpg")
```

```{r corr-plot, ref.label = "base-r-corr-plot"}
#| echo = FALSE,
#| fig.cap = "A plot of the correlations (and 95% confidence intervals) between predictors and the outcome in the `mtcars` data set.",
#| fig.alt = "A plot of the correlations (and 95% confidence intervals) between predictors and the outcome in the `mtcars` data set. None of the intervals overlap with zero. The car weight had the largest negative correlation and the rear axle ratio has the highest positive correlation."
```

Creating such a plot is possible using core R language functions, but automatically reformatting the results makes for more concise code with less potential for errors.

## Combining base R models and the tidyverse
Expand Down
71 changes: 60 additions & 11 deletions 04-ames.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -43,24 +43,42 @@ dim(ames)

## Exploring important features

It makes sense to start with the outcome we want to predict: the last sale price of the house (in USD):
It makes sense to start with the outcome we want to predict: the last sale price of the house (in USD). Figure \@ref(fig:ames-sale-price) is produced using:

```{r ames-sale_price, out.width = '100%', fig.width=8, fig.height=3}
```{r ames-sale-price-code, eval = FALSE}
library(tidymodels)
tidymodels_prefer()

ggplot(ames, aes(x = Sale_Price)) +
geom_histogram(bins = 50)
geom_histogram(bins = 50, col= "white")
```

```{r ames-sale-price-hist, ref.label = "ames-sale-price-code"}
#| out.width = '100%',
#| fig.width = 8,
#| fig.height = 3,
#| echo = FALSE,
#| fig.cap = "A histogram of the sale prices of houses in Ames Iowa.",
#| fig.alt = "A histogram of the sale prices of houses in Ames Iowa. The distribution has a long right tail."
```

The data are right-skewed; there are more inexpensive houses than expensive ones. The median sale price was \$`r format(median(ames$Sale_Price), big.mark = ",")` and the most expensive house was \$`r format(max(ames$Sale_Price), big.mark = ",")`. When modeling this outcome, a strong argument can be made that the price should be log-transformed. The advantages of doing this are that no houses would be predicted with negative sale prices and that errors in predicting expensive houses will not have an undue influence on the model. Also, from a statistical perspective, a logarithmic transform may also _stabilize the variance_ in a way that makes inference more legitimate. Let's visualize the transformed data:

```{r ames-log-sale_price, out.width = '100%', fig.width=8, fig.height=3}
```{r ames-log-sale-price-code, eval = FALSE}
ggplot(ames, aes(x = Sale_Price)) +
geom_histogram(bins = 50) +
geom_histogram(bins = 50, col= "white") +
scale_x_log10()
```

```{r ames-log-sale-price-hist, ref.label = "ames-log-sale-price-code"}
#| out.width = '100%',
#| fig.width = 8,
#| fig.height = 3,
#| echo = FALSE,
#| fig.cap = "A histogram of the sale prices of houses in Ames Iowa after a log (base 10) transformation.",
#| fig.alt = "A histogram of the sale prices of houses in Ames Iowa after a log (base 10) transformation. The distribution, while not perfectly symmetric, exhibits far less skewness."
```

While not perfect, this will probably result in better models than using the untransformed data.

:::rmdwarning
Expand All @@ -77,42 +95,73 @@ ames <- ames %>% mutate(Sale_Price = log10(Sale_Price))

Another important aspect of these data for our modeling are their geographic locations. This spatial information is contained in the data in two ways: a qualitative `Neighborhood` label as well as quantitative longitude and latitude data. To visualize the spatial information, let's use both together to plot the data on a map and color by neighborhood:

```{r ames-map, out.width = "100%", echo = FALSE, fig.cap = "Neighborhoods in Ames IA", warning = FALSE}
```{r ames-map}
#| out.width = "100%",
#| echo = FALSE,
#| warning = FALSE,
#| fig.cap = "Neighborhoods in Ames IA.",
#| fig.alt = "A scatter plot of house locations in Ames superimposed over a street map. There is a significant area in the center of the map where no houses were sold."
# See file extras/ames_sf.R
knitr::include_graphics("premade/ames.png")
```

We can see a few noticeable patterns. First, there is a void of data points in the center of Ames. This corresponds to Iowa State University. Second, while there are a number of neighborhoods that are geographically isolated, there are others that are adjacent to each other. For example, Timberland is located apart from almost all other neighborhoods:

```{r ames-timberland , out.width = "80%", echo = FALSE, warning = FALSE}
```{r ames-timberland}
#| out.width = "80%",
#| echo = FALSE,
#| warning = FALSE,
#| fig.cap = "A scatter plot of locations of houses in Timberland.",
#| fig.alt = "A scatter plot of locations of houses in Timberland, located in the southern part of Ames."
# See file extras/ames_sf.R
knitr::include_graphics("premade/timberland.png")
```

The Meadow Village neighborhood in Southwest Ames is like an island of properties ensconced inside the sea of properties that make up the Mitchell neighborhood:

```{r ames-mitchell , out.width = "60%", echo = FALSE, warning = FALSE}
```{r ames-mitchell}
#| out.width = "60%",
#| echo = FALSE,
#| warning = FALSE,
#| fig.cap = "A scatter plot of locations of houses in Meadow Village and Mitchell.",
#| fig.alt = "A scatter plot of locations of houses in Meadow Village and Mitchell. The small number of Meadow Village properties are enclosed inside the the ones labeled as being in Mitchell."
# See file extras/ames_sf.R
knitr::include_graphics("premade/mitchell.png")
```

A detailed inspection of the map also shows that the neighborhood labels are not completely reliable. For example, there are some properties labeled as being in Northridge that are surrounded by houses in the adjacent Somerset neighborhood:

```{r ames-northridge , out.width = "90%", echo = FALSE, warning = FALSE}
```{r ames-northridge}
#| out.width = "90%",
#| echo = FALSE,
#| warning = FALSE,
#| fig.cap = "A scatter plot of locations of houses in Somerset and Northridge.",
#| fig.alt = "A scatter plot of locations of houses in Somerset and Northridge. There are a few houses in Somerset mixed in the periphery of Northridge (and vice versa)."
# See file extras/ames_sf.R
knitr::include_graphics("premade/northridge.png")
```

Also, there are ten isolated houses labeled as being in Crawford but are not close to the majority of the other houses in that neighborhood:

```{r ames-crawford , out.width = "80%", echo = FALSE, warning = FALSE}
```{r ames-crawford}
#| out.width = "80%",
#| echo = FALSE,
#| warning = FALSE,
#| fig.cap = "A scatter plot of locations of houses in Crawford.",
#| fig.alt = "A scatter plot of locations of houses in Crawford. There is a large cluster of houses to the west of a small, separate cluster of properties also labeled as Crawford."
# See file extras/ames_sf.R
knitr::include_graphics("premade/crawford.png")
```

Also notable is the "Iowa Department of Transportation (DOT) and Rail Road" neighborhood adjacent to the main road on the east side of Ames. There are several clusters of houses within this neighborhood as well as some longitudinal outliers; the two houses furthest east are isolated from the other locations.

```{r ames-dot_rr , out.width = "100%", echo = FALSE, warning = FALSE}
```{r ames-dot_rr}
#| out.width = "100%",
#| echo = FALSE,
#| warning = FALSE,
#| fig.cap = "A scatter plot of locations of houses labeled as 'Iowa Department of Transportation (DOT) and Rail Road'.",
#| fig.alt = "A scatter plot of locations of houses labeled as 'Iowa Department of Transportation (DOT) and Rail Road'. The longitude distribution is right-skewed with a few outlying properties."

# See file extras/ames_sf.R
knitr::include_graphics("premade/dot_rr.png")
```
Expand Down
Loading