forked from moderndive/ModernDive_book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path06-regression.Rmd
executable file
·593 lines (379 loc) · 31.8 KB
/
06-regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
# Regression {#regression}
```{r setup_reg, include=FALSE, purl=FALSE}
chap <- 6
lc <- 0
rq <- 0
# **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`**
# **`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`**
knitr::opts_chunk$set(tidy = FALSE, out.width = '\\textwidth')
options(scipen = 99, digits = 4)
```
Now that we are equipped with data visualization skills from Chapter \@ref(viz), data wrangling skills from Chapter \@ref(wrangling), and an understanding of the "tidy" data format from Chapter \@ref(tidy), we now proceed to discuss once of the most commonly used statistical procedures: *regression*. Much as we saw with the Grammar of Graphics in Chapter \@ref(viz), the fundamental premise of (simple linear) regression is to *model* the relationship between
* An outcome/dependent/predicted variable $y$
* As a function of a predictor/independent/covariate variable $x$
Where do these different labels outcome, dependent, predictor variable root in? Regression, in its simplest form, can be viewed in two ways:
1. **For Explanation**: You want to study the relationship between an outcome variable $y$ and a set of explanatory variables, determine the significance of any found relationships, and have measures summarizing these.
1. **For Prediction**: You want to predict an outcome variable $y$ based on the information contained in a set of predictor variables. You don't care so much about understanding how all the variables relate and interact, but so long as you can make good predictions about $y$, you're fine.
In this chapter, we use the `flights` data frame in the `nycflights13` package to look at the relationship between departure delay, arrival delay, and other variables related to flights. We will also discuss the concept of *correlation* and how it is frequently incorrectly implied to also lead to *causation*. This chapter also introduces the `broom` package, which is a useful tool for summarizing the results of regression fits in "tidy" format.
### Needed packages {-}
Let's load all the packages needed for this chapter (this assumes you've already installed them). Read Chapter \@ref(packages) for information on how to install and load R packages.
```{r message=FALSE, warning=FALSE}
library(nycflights13)
library(ggplot2)
library(dplyr)
library(broom)
library(knitr)
```
```{r message=FALSE, warning=FALSE, echo=FALSE}
# Packages needed internally, but not in text.
library(mvtnorm)
```
<!--Subsection on Alaska Data -->
## Data: Alaskan Airlines delays {#data}
Say you are junior airlines analyst, charged with exploring the relationship/association of departure delays and arrival delays for Alaska Airlines flights leaving New York City in 2013. You however, don't have enough time to dig up information on all flights, and thus take a random sample of 50 flights. Is there a meaningful relationship between departure and arrival delays? Do higher departure delays lead to higher arrival delays? Most of us would assume so. Let us explore the relationship between these two variables using a scatterplot in Figure \@ref(fig:regplot1).
```{r regplot1, warning=FALSE, fig.cap="Departure and Arrival Flight Delays for a sample of 50 Alaskan flights from NYC"}
library(nycflights13)
data(flights)
set.seed(2017)
# Load Alaska data, deleting rows that have missing departure delay
# or arrival delay data
alaska_flights <- flights %>%
filter(carrier == "AS") %>%
filter(!is.na(dep_delay) & !is.na(arr_delay)) %>%
sample_n(50)
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point()
```
Note how we used the `dplyr` package's `sample_n()` function to sample 50 points at random. A similarly useful function is `sample_frac()`, sampling a specified fraction of the data.
***
```{block lc9-1, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Does there appear to be a linear relationship with arrival delay and departure delay? In other words, could you fit a line to the data and have it explain well how `arr_delay` increases as `dep_delay` increases?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Is there only one possible line that fits the data "well"? How could you decide on which one is best if there are multiple options?
***
<!--Subsection on Correlation -->
## Correlation
One way to measure the association between two numerical variables is measuring their correlation. In fact, the **correlation coefficient** measures the degree to which points formed by two numerical variables form a line.
***
**Definition: Correlation Coefficient**
The *correlation coefficient* measures the strength of linear association between two variables.
**Properties of the Correlation Coefficient**: It is always between -1 and 1, where
- -1 indicates a perfect negative relationship
- 0 indicates no relationship
- +1 indicates a perfect positive relationship
***
We can look at a variety of different data sets and their corresponding correlation coefficients in the following plot.
```{r corr-coefs, echo=FALSE, fig.cap="Different Correlation Coefficients"}
library(mvtnorm)
correlation <- c(-0.9999, -0.9, -0.75, -0.3, 0, 0.3, 0.75, 0.9, 0.9999)
n_sim <- 100
values <- NULL
for(i in 1:length(correlation)){
rho <- correlation[i]
sigma <- matrix(c(5, rho * sqrt(50), rho * sqrt(50), 10), 2, 2)
sim <- rmvnorm(
n = n_sim,
mean = c(20,40),
sigma = sigma
) %>%
as_data_frame() %>%
mutate(correlation = round(rho,2))
values <- bind_rows(values, sim)
}
ggplot(data = values, mapping = aes(V1, V2)) +
geom_point() +
facet_wrap(~ correlation, ncol = 3) +
labs(x = "", y = "") +
theme(
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank()
)
```
***
```{block lc9-2, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Make a guess as to the value of the correlation cofficient between `arr_delay` and `dep_delay` in the `alaska_flights` data frame.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Do you think that the correlation coefficient between `arr_delay` and `dep_delay` is the same as the correlation coefficient between `dep_delay` and `arr_delay`? Explain.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What do you think the correlation between temperatures in Farenheit and temperatures in Celsius is?
***
We can calculate the correlation coefficient for our example of flight delays via the `cor()` function in R
```{r, warning=FALSE, echo=TRUE}
alaska_flights %>%
summarize(correl = cor(dep_delay, arr_delay))
```
The sample correlation coefficient is denoted by $r$. In this case, $r = `r cor(alaska_flights$dep_delay, alaska_flights$arr_delay)`$.
***
```{block lc9-3, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Would you quantify the value of `correl` calculated above as being
- strongly positively linear,
- weakly positively linear,
- not linear,
- weakly negatively linear, or
- strongly positively linear?
Discuss your choice and what it means about the relationship between `dep_delay` and `arr_delay`.
***
If you'd like a little more practice in determining the linear relationship between two variables by quantifying a correlation coefficient, you should check out the [Guess the Correlation](http://guessthecorrelation.com/) game online.
### Correlation does not imply causation
Just because arrival delays are related to departure delays in a somewhat linear fashion, we can't say with certaintly that arrival delays are **caused entirely** by departure delays. Certainly it appears that as one increases, the other tends to increase as well, but that might not always be the case. We can only say that there is an **association** between them.
Causation is a tricky problem and frequently takes either carefully designed experiments or methods to control for the effects of potential confounding variables. Both these approches attempt either to remove all counfounding variables or take them into account as best they can, and only focus on the behavior of a outcome variable in the presence of the levels of the other variable(s).
Be careful as you read studies to make sure that the writers aren't falling into this fallacy of correlation implying causation. If you spot one, you may want to send them a link to [Spurious Correlations](http://www.tylervigen.com/spurious-correlations).
***
```{block lc9-4, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What are some other confounding
variables besides departure delay that we could attribute to an increase in arrival delays? Remember that a variable is something that has to **vary**!
***
<!--Subsection on SLR -->
## Simple linear regression
As suggested both visually and by their correlation coefficient of $r = `r cor(alaska_flights$dep_delay, alaska_flights$arr_delay)`$, there appears to be a strong positive linear association between these delay variables where
* The dependent/outcome variable $y$ is `arr_delay`
* The independent/explanatory variable $x$ is `dep_delay`
What would be the "best fitting line"?. One example of a line that fits the data well can be computed by using **simple linear regression**. In Figure \@ref(fig:with-reg) we add the **least squares/best fitting/simple linear regression line** by adding a `geom_smooth()` layer to our plot where `lm` is short for "linear model".
```{r with-reg, echo=TRUE, fig.cap="Regression line fit on delays"}
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red")
```
```{r, echo=FALSE}
# USED INTERNALLY: Least squares line values, used for in-text output
delay_fit <- lm(formula = arr_delay ~ dep_delay, data = alaska_flights)
intercept <- tidy(delay_fit, conf.int=TRUE)$estimate[1] %>% round(3)
slope <- tidy(delay_fit, conf.int=TRUE)$estimate[2] %>% round(3)
CI_intercept <- c(tidy(delay_fit, conf.int=TRUE)$conf.low[1], tidy(delay_fit, conf.int=TRUE)$conf.high[1]) %>% round(3)
CI_slope <- c(tidy(delay_fit, conf.int=TRUE)$conf.low[2], tidy(delay_fit, conf.int=TRUE)$conf.high[2]) %>% round(3)
```
### Best Fitting Line
We now unpack one possible critria for a line to be a "best fitting line" to a set of points. Let's choose an arbitrary point on the graph and label it the color blue:
```{r echo=FALSE}
best_fit_plot <- ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
annotate("point", x = 44, y = 7, color = "blue", size = 3)
best_fit_plot
```
Now consider this point's *deviation* from the regression line.
```{r echo=FALSE}
best_fit_plot <- best_fit_plot +
annotate("segment", x = 44, xend = 44, y = 7, yend = intercept + slope * 44,
color = "blue", arrow = arrow(length = unit(0.03, "npc")))
best_fit_plot
```
Do this for another point.
```{r echo=FALSE}
best_fit_plot <- best_fit_plot +
annotate("point", x = 15, y = 34, color = "blue", size = 3) +
annotate("segment", x = 15, xend = 15, y = 34, yend = intercept + slope * 15,
color = "blue", arrow = arrow(length = unit(0.03, "npc")))
best_fit_plot
```
And for another point.
```{r echo=FALSE}
best_fit_plot <- best_fit_plot +
annotate("point", x = 7, y = -20, color = "blue", size = 3) +
annotate("segment", x = 7, xend = 7, y = -20, yend = intercept + slope * 7,
color = "blue", arrow = arrow(length = unit(0.03, "npc")))
best_fit_plot
```
We repeat this process for each of the 50 points in our sample. The pattern that emerges here is that the least squares line minimizes the sum of the squared arrow lengths (i.e., the least squares) for all of the points. We square the arrow lengths so that positive and negative deviations of the same amount are treated equally. As you look at these points you might think that a different line could fit the data better based on this criteria. That is not the case as it can be proven via calculus and linear algebra that this line indeed minimizes the sum of the squared arrow lengths.
***
**Definitions:**
For $i$ ranging from 1 to n, the number of observations in your data set, we define the following:
* **Observed Value** $y_i$. Ex: The y-position of the black dots.
* **Fitted/Predicted Value** $\widehat{y}_i$. Ex: The y-position of the correponding
value on the red regression line. In other words, the blue arrow tips.
* **Residual** $\widehat{\epsilon}_i = y_i - \widehat{y}_i$: Ex: The length of the blue arrows.
***
Some observations on residuals:
* Note the order of the subtraction. You start at the non-pointy end of the arrow ($y_i$) and then subtract away what comes at the point ($\widehat{y_i}$).
* If the observed value is exactly equal to the fitted value, then the residual is 0.
* As suggested by the word residual (left over), residuals represent the lack of fit of a line to a model, in other words the model's error.
* Of all possible lines, the least squares line minimizes the sum of all n residuals squared.
They play an important part in regression analysis; we will revisit the topic in Section \@ref(resid).
<!--Subsection on Equation of the line -->
## Equation of the line
Figure \@ref(fig:with-reg) displayed the fitted least squares line in red, which we now define as
$$\widehat{y} = b_0 + b_1 x$$
where $b_0$ and $b_1$ are the computed $y$-intercept and slope coefficients. We first use R's `lm()` function to fit a linear regression model (which we save in `delay_fit`) and then use the `tidy()` function in the `broom` package to display the $b_0$ and $b_1$ coefficients and futher information about them in a **regression output table**. Almost any statistical software package you use for regression will output results that contain the following information.
```{r fit}
delay_fit <- lm(formula = arr_delay ~ dep_delay, data = alaska_flights)
tidy(delay_fit, conf.int=TRUE) %>%
kable()
```
We see the regression output table has two rows, one corresponding to the $y$-intercept and the other the slope, with the first column "estimate" corresponding to their values. Thus, our equation is $$\widehat{y} = `r intercept` + `r slope` \, x.$$ It is usually preferred to actually write the names of the variables instead of $x$ and $y$ for context, so the line could also be written as $$\widehat{arr\_delay} = `r intercept` + `r slope` \, dep\_delay.$$
For the remainder of this Section, we answer the following questions:
* How do we interpret these coefficients? Section \@ref(interpretation)
* What are the additional columns in the regression output? Section \@ref(inference)
* How can I use the regression model to make predictions of arrival delay if I know the departure delay? Section \@ref(prediction)
<!--
These were the values of the first columns of the regression output table. We can also extract the coefficients by using the `coef` function:
```{r}
coef(delay_fit)
```
-->
### Coefficient Interpretation {#interpretation}
After you have determined your line of best fit, it is good practice to interpret the results to see if they make sense. The intercept $b_0=`r intercept`$ can be interpreted as the average associated arrival delay when a plane has a 0 minute departure delay. In other words, flights that depart on time arrive on average `r intercept` early (a negative delay). One explanation for this is that Alaska Airlines is overestimating their flying times, and hence flights tend to arrive early. In this case, the intercept had a direct interpretation since there are observations with $x=0$ values, in other words had 0 departure delay. However, in other contexts, the intercept may have no direct interpretation.
The slope is the more interesting of the coefficients as it summarises the relationship between $x$ and $y$. Slope is defined as rise over run or the change in $y$ for every one unit increase in $x$. For our specific example, $b_1=`r slope`$ that for every one **minute** increase in the departure delay of Alaskan Airlines flights from NYC, there is an associated average increase in arrival delay of `r slope` minutes. This estimate does make some practical sense. It would be strange if arrival delays went down as departure delays increased. We also expect that the longer a flight is delayed on departure, the more likely the longer a flight is delayed on arrival.
**Important note**: The correlation coefficient and the slope of the regression line are not the same thing, as the correlation coefficient only measures the strength of linear association. They will always share the same sign (positive correlation coefficients correspond to positive slope coefficients and the same holds true for negative values), but otherwise they are not equal. For example, say we have 3 sets of points (red, green, blue) and their corresponding regression lines. Their regression lines all have different slopes, but the correlation coefficient is $r = 1$ for all 3. In other words, all three groups of points have a perfect (positive) linear relationship.
```{r, echo=FALSE, warning=FALSE, fig.height=4}
vals <- seq(-2, 2, length=20)
example <- data_frame(
x = rep(vals, 3),
y = c(0.01*vals, 1*vals, 3*vals),
slope = factor(rep(c(0.01, 1, 3), each = length(vals)))
)
ggplot(example, aes(x = x, y = y, col = slope)) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = FALSE)
```
### Inference for regression {#inference}
Recall the regression output table from earlier:
```{r, echo=FALSE}
tidy(delay_fit, conf.int=TRUE) %>%
kable()
```
We saw in Chapter \@ref(ci) when trying to estimate the true population mean $\mu$, the sample mean $\overline{x}$ will vary from sample to sample due to sampling variability. We characterized this variability using the sampling distribution of $\overline{x}$, whose standard deviation is called the standard error. In other words, the standard error quantifies how much $\overline{x}$ varies from sample to sample and can be thought of as a measure of how precise our estimates are.
All these notions apply in our regression example as well, as recall from Section \@ref(data) that `alaska_flights` represents only a random sample of 50 Alaska Airlines flights and not all flights. Hence if we repeated the analysis but with another random sample of 50 flights, the fitted line would likely change slightly due to sampling variability. In this case, there is a true population least squares line is defined by the formula $y = \beta_0 + \beta_1 x + \epsilon$ where
* $\beta_0$ is the true population intercept parameter
* $\beta_1$ is the true population slope parameter
* $\epsilon$ represents the error term
$\epsilon$ corresponds to the part of the response variable $y$ that remains unexplained after considering the predictor variable $x$. We will see in Section \@ref(resid) that ideally they should exhibit no systematic pattern in that they are normally distributed, have mean 0, and constant variance.
The values $b_0 = `r intercept`$ and $b_1 = `r slope`$ are point estimates of $\beta_0$ and $\beta_1$, and thus the second column of the regression output table that has their values is called "estimate". The third column "std.error" represents the standard errors for each estimate.
**Hypothesis Testing**: The rows of the fourth and fifth columns correspond to the hypothesis tests $H_0: \beta_0 = 0 \mbox{ vs. } H_1: \beta_0 \neq 0$ and $H_0: \beta_1 = 0 \mbox{ vs. } H_1: \beta_1 \neq 0$. Of particular interest is the second hypothesis test because if $\beta_1 = 0$ then $y = \beta_0 + \epsilon$. Hence the value of $y$ does not depend on the value of $x$ at all, in other words there is no relationship between them. Recall from Chapter \@ref(hypo) that any hypothesis test involves 1) an observed test statistic and 2) a p-value resulting from the comparison of the observed test statistic to a null distribution. The columns "statistic" and "p.value" correspond to these values. We leave the exact computation of the test statistic for a more advanced regression class and focus only on
In our example, since the p-value corresponding to the hypothesis test $H_0: \beta_1 = 0 \mbox{ vs. } H_1: \beta_1 \neq 0$ is 0, for any value of $\alpha$ we would reject $H_0$ in favor of $H_1$ and declare that there is a significant relationship between arrival delay and departure delay.
**Confidence Intervals**: Similarly, the rows of the "conf.low" and "conf.high" colums correspond to the left and right end points of 95% confidence intervals for $\beta_0$ and $\beta_1$. Similarly to the case of a 95% confidence interval for the population mean $\mu \pm 2 * SE$, the 95% confidence intervals for $\beta_0$ and $\beta$ can be computed as $\mbox{estimate} \pm 2*SE$ Once again, focusing on the confidence interval (`r CI_slope[1]`, `r CI_slope[2]`) for the slope coefficient $\beta_1$, we observe that it does not contain 0, suggesting again that departure delay and arrival delays have a significant relationship.
However, for the conclusions of the hypothesis tests and confidence intervals to be valid, there are certain conditions that must be met, in particular relating to the behavior of the residuals. We will address these assumptions in the upcoming Section \@ref(conditions-for-regression).
<!--
We can also use the concept of shuffling to determine the standard error of our null distribution and conduct a hypothesis test for a population slope. Let's go back to our example on Alaskan flights that represent a sample of all Alaskan flights departing NYC in 2013. Let's test to see if we have evidence that a *positive* relationship exists between the departure delay and arrival delay for Alaskan flights. We will set up this hypothesis testing process as we have each before via the "There is Only One Test" diagram in Figure \@ref(fig:htdowney).
### Data
Our data is stored in `alaska_flights` and we are focused on the 50 measurements of `dep_delay` and `arr_delay` there.
### Test Statistic $\delta$
Our test statistic here is the sample slope coefficient that we denote with $b_1$.
### Observed effect $\delta^*$
```{r}
(b1_obs <- tidy(delay_fit)$estimate[2])
```
The calculated slope value from our observed sample is $b_1 = `r tidy(delay_fit)$estimate[2]`$.
### Model of $H_0$
We are looking to see if a positive relationship exists so $H_a: \beta_1 > 0$. Our null hypothesis is always in terms of equality so we have $H_0: \beta_1 = 0$.
### Simulated Data
Now to simulate the null hypothesis being true and recreating how our sample was created, we need to think about what it means for $\beta_1$ to be zero. If $\beta_1 = 0$, we said above that there is no relationship between the departure delay and arrival delay. If there is no relationship, then any one of the arrival delay values could have just as likely occurred with any of the other departure delay values instead of the one that it actually did fall with. We, therefore, have another example of shuffling in our simulating of data.
**Tactile simulation**
We could use a deck of 100 note cards to create a tactile simulation of this shuffling process. We would write the 50 different values of departure delays on each of the 50 cards, one per card. We would then do the same thing for the 50 arrival delays putting them on one per card.
Next, we would lay out each of the 50 departure delay cards and we would shuffle the arrival delay deck. Then, after shuffling the deck well, we would disperse the cards one per each one of the departure delay cards. We would then enter these new values in for arrival delay and compute a sample slope based on this shuffling. We could repeat this process many times, keeping track of our sample slope after each shuffle.
### Distribution of $\delta$ under $H_0$
We can build our randomization distribution in much the same way we did before using the `do` and `shuffle` functions. Here we will take advantage of the `coef` function we saw earlier to extract the slope and intercept coefficients. (Our focus will be on the slope here though.)
```{r, include=FALSE}
if(!file.exists("rds/rand_slope_distn.rds")){
rand_slope_distn <- mosaic::do(5000) *
(lm(formula = mosaic::shuffle(arr_delay) ~ dep_delay, data = alaska_flights) %>%
coef())
saveRDS(object = rand_slope_distn, "rds/rand_slope_distn.rds")
} else {
rand_slope_distn <- readRDS("rds/rand_slope_distn.rds")
}
```
```{r many_shuffles_reg, eval=FALSE}
library(mosaic)
rand_slope_distn <- mosaic::do(5000) *
(lm(formula = shuffle(arr_delay) ~ dep_delay, data = alaska_flights) %>%
coef())
names(rand_slope_distn)
```
We see that the names of our columns are `Intercept` and `dep_delay`. We want to look at `dep_delay` since that corresponds to the slope coefficients.
```{r, eval=FALSE}
ggplot(data = rand_slope_distn, mapping = aes(x = dep_delay)) +
geom_histogram(color = "white", bins = 20)
```
### The p-value
Recall that we want to see where our observed sample slope $\delta^* = `r tidy(delay_fit)$estimate[2]`$ falls on this distribution and then count all of the values to the right of it corresponding to $H_a: \beta_0 > 0$. To get a sense for where our values falls, we can shade all values at least as big as $\delta^*$.
```{r fig.cap="Shaded histogram to show p-value", eval=FALSE}
ggplot(data = rand_slope_distn, aes(x = dep_delay, fill = (dep_delay >= b1_obs))) +
geom_histogram(color = "white", bins = 20)
```
Since `r b1_obs` falls far to the right of this plot, we can say that we have a $p$-value of 0. We, thus, have evidence to reject the null hypothesis in support of there being a positive association between the departure delay and arrival delay of all Alaskan flights from NYC in 2013.
***
```{block lc9-5, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Repeat the inference above but this time for the correlation coefficient instead of the slope.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Use bootstrapping (of points) to determine a range of possible values for the population slope comparing departure delays to arrival delays for Alaskan flights in 2013 from NYC.
***
-->
### Predicting values {#prediction}
Let's say that we are waiting for our flight to leave New York City on Alaskan Airlines and we are told that our flight is going to be delayed 25 minutes. What could we predict for our arrival delay based on the least squares line in Figure \@ref(fig:regplot1)? In Figure \@ref(fig:with-reg-predict), we denote a departure delay of $x=25$ minutes with a dashed black line. The predicted arrival time $\widehat{y}$ according to this regression model is $\widehat{y} = `r intercept` + `r slope`\times 25 = `r intercept + slope*25 %>% round(2)`$, indicated with the blue dot. This value does make some sense since flights that aren't delayed greatly from the beginning to tend to make up time in the air to compensate.
```{r with-reg-predict, echo=FALSE, fig.cap="Predicting Arrival Delay when Departure Delay is 25m"}
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
geom_vline(xintercept = 25, linetype="dashed") +
annotate("point", x = 25, y = intercept + slope*25, color = "blue", size = 3)
```
Instead of manually calculating the fitted value $\widehat{y}$ for a given $x$ value, we can use the `augment` function in the `broom` package to automate this. For example, we automate this procedure for departure delays of 25, 30, and 15 minutes. The three fitted $\widehat{y}$ are in the ".fitted" column.
```{r}
new_alaskan_flight <- data_frame(dep_delay = c(25, 30, 15))
delay_fit %>%
augment(newdata = new_alaskan_flight) %>%
kable()
```
<!--Subsection on Conditions for Regression -->
## Conditions for regression {#conditions-for-regression}
In order for all inferences from regression to valid (in particular the hypothesis tests and confidence intervals from Section \@ref(#inference), certain conditions must roughly hold.
1. Nearly normal residuals with mean 0 and constant variance. (Check quantile-quantile plot of standardized residuals.)
1. Equal variances across explanatory variable. (Check residual plot for non-uniform patterns.)
1. Independent observations. (Check residual plot for no time series-like patterns.)
As you can see the *residuals* will play a large role in determining whether the conditions are met. In particular, the first two conditions can be roughly interpreted as requiring that there being no systematic pattern to the residuals. The residuals $\widehat{\epsilon}_i$ are estimates for the error term $\epsilon$ we discussed with the true population regression line, and this is a big reason why they play an important role in validating regression assumptions.
### Residual analysis {#resid}
The following diagram will help you to keep track of what is meant by a residual. Consider the observation marked by the blue dot:
```{r echo=FALSE}
ggplot(data = alaska_flights,
mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
annotate("point", x = 44, y = 7, color = "blue", size = 3) +
annotate("segment", x = 44, xend = 44, y = 7, yend = -14.155 + 1.218 * 44,
color = "blue", arrow = arrow(length = unit(0.03, "npc")))
```
Recall that $y_i$ is the observed value of the `arr_delay` variable (y-position of blue dot), $\widehat{y}_i$ is the fitted value of the `arr_delay` (value that is being pointed to on the red line), and the residual is $\widehat{\epsilon}_i = y_i - \hat{y}_i$. We can quickly extra the values of all 50 residuals by using the `augment()` function in the `broom` package. Specifically, we are interested in the `.fitted` and `.resid` variables. Let's look at the residuals corresponding to the first 6 rows of data.
```{r}
regression_points <- augment(delay_fit) %>%
select(arr_delay, dep_delay, .fitted, .resid)
regression_points %>%
head() %>%
kable()
```
In \@ref(fig:resid-histogram)
```{r resid-histogram}
ggplot(data = regression_points, mapping = aes(x = .resid)) +
geom_histogram(binwidth=10) +
geom_vline(xintercept = 0, color = "blue")
```
```{r resid-plot}
ggplot(data = regression_points, mapping = aes(x = .fitted, y = .resid)) +
geom_point() +
geom_abline(intercept = 0, slope = 0, color = "blue")
```
**Quantile-quantile plot**
```{r qqplot1}
ggplot(data = regression_points, mapping = aes(sample = .resid)) +
stat_qq()
```
**Checking conditions**:
1. We are looking to see if the points are scattered about the blue line at 0 relatively evenly as we look from left to right. We have some reason for concern here as the large lump of values on the left are much more dispersed than those on the right.
2. The second condition is invalidated if there is a trigonometric pattern of up and down throughout the residual plot. That is not the case here.
3. We look at the *quantile-quantile plot* (Q-Q plot for sure) for the third condition. We are looking to see if the residuals fall on a straight line with what we would expect if they were normally distributed. We see some curvature here as well. We should begin to wonder if regression was valid here with both condition 1 and condition 3 in question.
We have reason to doubt whether a linear regression is valid here. Unfortunately, all too frequently regressions are run without checking these assumptions carefully. While small deviations from the assumptions can be OK, larger violations can completely invalidate the results and make any inferences improbable and questionable.
<!--Subsection on Conclusion -->
## Conclusion
### What's to come?
In the last chapter of the textbook, we'll summarize the purpose of this book as well as present an excellent example of what goes into making an effective story via data.
### Script of R code
```{r include=FALSE, eval=FALSE, purl=FALSE}
knitr::purl("06-regress.Rmd", "docs/scripts/06-regression.R")
```
An R script file of all R code used in this chapter is available [here](http://ismayc.github.io/moderndiver-book/scripts/06-regression.R).