forked from rstudio-education/stat545
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path21_functions-practicum.Rmd
129 lines (88 loc) · 6.02 KB
/
21_functions-practicum.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
# Function-writing practicum {#functions-practicum}
```{r include = FALSE}
source("common.R")
```
<!--Original content: https://stat545.com/block012_function-regress-lifeexp-on-year.html-->
## Overview
We recently learned how to write our own R functions ([part 1](#functions-part1), [part 2](#functions-part2), [part 3](#functions-part3)).
Now we use that knowledge to write another useful function, within the context of the Gapminder data:
* Input: a data.frame that contains (at least) a life expectancy variable `lifeExp` and a variable for year `year`
* Output: a vector of estimated intercept and slope, from a linear regression of `lifeExp` on `year`
The ultimate goal is to apply this function to the Gapminder data for a specific country. We will eventually scale up to *all* countries using external machinery, e.g., the `dplyr::group_by()` + `dplyr::do()`.
## Load the Gapminder data
As usual, load gapminder. Load ggplot2 because we'll make some plots and load dplyr too.
```{r message = FALSE}
library(gapminder)
library(ggplot2)
library(dplyr)
```
## Get data to practice with
I extract the data for one country in order to develop some working code interactively.
```{r}
j_country <- "France" # pick, but do not hard wire, an example
(j_dat <- gapminder %>%
filter(country == j_country))
```
Always always always plot the data. Yes, even now.
```{r first-example-scatterplot}
p <- ggplot(j_dat, aes(x = year, y = lifeExp))
p + geom_point() + geom_smooth(method = "lm", se = FALSE)
```
## Get some code that works
Fit the regression:
```{r}
j_fit <- lm(lifeExp ~ year, j_dat)
coef(j_fit)
```
Whoa, check out that crazy intercept! Apparently the life expectancy in France around year 0 A.D. was minus 400 years! Never forget to sanity check a model. In this case, a reparametrization is in order. I think it makes more sense for the intercept to correspond to life expectancy in 1952, the earliest date in our dataset. Estimate the intercept eye-ball-o-metrically from the plot and confirm that we've got something sane and interpretable now.
```{r}
j_fit <- lm(lifeExp ~ I(year - 1952), j_dat)
coef(j_fit)
```
### Sidebar: regression stuff
There are two things above that might prompt questions.
First, how did I know to get the estimated coefficients from a fitted model via `coef()`? Years of experience. But how might a novice learn such things? Read [the documentation for `lm()`][rdocs-lm], in this case. The "See also" section advises us about many functions that can operate on fitted linear model objects, including, but by no means limited to, `coef()`. Read [the documentation on `coef()`][rdocs-coef] too.
Second, what am I doing here: `lm(lifeExp ~ I(year - 1952))`? I want the intercept to correspond to 1952 and an easy way to accomplish that is to create a new predictor on the fly: year minus 1952. The way I achieve that in the model formula, `I(year - 1952)`, uses the `I()` function which "inhibits interpretation/conversion of objects". By protecting the expression `year - 1952`, I ensure it is interpreted in the obvious arithmetical way.
## Turn working code into a function
Create the basic definition of a function and drop your working code inside. Add arguments and edit the inner code to match. Apply it to the practice data. Do you get the same result as before?
```{r}
le_lin_fit <- function(dat, offset = 1952) {
the_fit <- lm(lifeExp ~ I(year - offset), dat)
coef(the_fit)
}
le_lin_fit(j_dat)
```
I had to decide how to handle the offset. Given that I will scale this up to many countries, which, in theory, might have data for different dates, I chose to set a default of 1952. Strategies that compute the offset from data, either the main Gapminder dataset or the excerpt passed to this function, are also reasonable to consider.
I loathe the names on this return value. This is not my first rodeo and I know that, downstream, these will contaminate variable names and factor levels and show up in public places like plots and tables. Fix names early!
```{r}
le_lin_fit <- function(dat, offset = 1952) {
the_fit <- lm(lifeExp ~ I(year - offset), dat)
setNames(coef(the_fit), c("intercept", "slope"))
}
le_lin_fit(j_dat)
```
Much better!
## Test on other data and in a clean workspace
It's always good to rotate through examples during development. The most common error this will help you catch is when you accidentally hard-wire your example into your function. If you're paying attention to your informal tests, you will find it creepy that your function returns __exactly the same results__ regardless which input data you give it. This actually happened to me while I was writing this document, I kid you not! I had left `j_fit` inside the call to `coef()`, instead of switching it to `the_fit`. How did I catch that error? I saw the fitted line below, which clearly did not have an intercept in the late 60s and a positive slope, as my first example did. Figures are a mighty weapon in the fight against nonsense. I don't trust analyses that have few/no figures.
```{r second-example-scatterplot}
j_country <- "Zimbabwe"
(j_dat <- gapminder %>% filter(country == j_country))
p <- ggplot(j_dat, aes(x = year, y = lifeExp))
p + geom_point() + geom_smooth(method = "lm", se = FALSE)
le_lin_fit(j_dat)
```
The linear fit is comically bad, but yes I believe the visual line and the regression results match up.
It's also a good idea to clean out the workspace, rerun the minimum amount of code, and retest your function. This will help you catch another common mistake: accidentally relying on objects that were lying around in the workspace during development but that are not actually defined in your function nor passed as formal arguments.
```{r}
rm(list = ls())
le_lin_fit <- function(dat, offset = 1952) {
the_fit <- lm(lifeExp ~ I(year - offset), dat)
setNames(coef(the_fit), c("intercept", "slope"))
}
le_lin_fit(gapminder %>% filter(country == "Zimbabwe"))
```
## Are we there yet?
Yes.
Given how I plan to use this function, I don't feel the need to put it under formal unit tests or put in argument validity checks.
```{r links, child="links.md"}
```