-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathIntroduction-to-Regression-with-R.Rmd
873 lines (535 loc) · 46 KB
/
Introduction-to-Regression-with-R.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
---
title: "Introduction to Regression with R"
output: learnr::tutorial
runtime: shiny_prerendered
---
```{r setup, include=FALSE}
library(learnr)
set.seed(123)
par(mar = c(4,4,1,1))
nSamples <- 250
height<-rnorm(nSamples, 180, 20)
sex<-as.factor(sample(c("male", "female"), size = nSamples, replace = TRUE, prob = c(0.45,0.55)))
height<-height + rnorm(nSamples, 10,5)*(as.numeric(sex)-1)
age<-floor(runif(nSamples, 40,60))
weight<- height * 0.7 - 44 + rnorm(nSamples,0,12)
weight<-weight + rnorm(nSamples, 3, 2)*(as.numeric(sex)-1) + rnorm(nSamples, 0.005, 0.001)*(as.numeric(sex)-1) * height
weight <- weight + age * rnorm(nSamples, 0.04, 0.03)
bmi <- weight/(height/100)^2
smoker<-sample(c(0,1), size = nSamples, replace = TRUE, prob = c(0.8,0.2))
t2diabetes <- sample(c(0,1), size = nSamples, replace = TRUE, prob = c(0.8,0.2))
t2diabetes[sample(which(bmi > 25),10)]<-1
t2diabetes[sample(which(smoker == 1),5)]<-1
exercise_hours <- rpois(nSamples, 1) + rpois(nSamples, 2)*(1-t2diabetes) + rpois(nSamples, 1) * (as.numeric(sex)-1)
alcohol_units <- rpois(nSamples, 3) + rpois(nSamples, 5)*(1-t2diabetes) + rpois(nSamples, 3) * (as.numeric(sex)-1) + rpois(nSamples, 1)*rpois(nSamples, 6)*(1-t2diabetes)
exercise_hours[which(weight < 60)]<-rpois(sum(weight < 60), 12)
alcohol_units[which(bmi > 37)]<-alcohol_units[which(bmi > 37)] + rpois(sum(bmi > 37),5)
alcohol_units[which(weight > 140)]<-rpois(sum(weight > 140),50)
demoDat <-data.frame(age, height, weight, bmi,exercise_hours, alcohol_units)
```
## Overview of Workshop
Welcome to **Introduction to Regression with R**. Our aim is to provide you with a comprehensive introduction to the statistical tool regression and how to perform these analyses in R.
By the end of this session you will be able to:
* describe what regression is
* fit a range of regression models with R including:
* simple linear regression
* multiple linear regression
* describe the concept behind hypothesis testing in regression analysis
* describe and evaluate the assumptions behind hypothesis testing in regression analysis
* interpret the coefficients of a regression model
* make predictions from a regression model
* describe the link between regression and other common statistical tools
While it is delivered as a stand alone session, it is designed as a part of a series of Regression with R workshops where the content develops the ideas further to give you a comprehensive understanding how regression can be used to address a broad range of questions.
The complete series includes:
1. Introduction to Regression with R
2. Regression Analysis in R: Adapting to Varied Data Types
3. Mixed Effects Regression with R
### Pre-requisites
This course will not include an introduction to R, or how to setup and use R or Rstudio. It is assumed you are comfortable coding in R and are familiar with:
* how to write and execute commands in the R console
* what type of variables are available in R and how to work with these
### Course Notes
This tutorial contains the course notes, example code snippets plus explanations, exercises for you to try with solutions and quiz questions to test your knowledge. Attending a workshop on this topic means there are people on hand to help if you have any questions or issues with the materials. However, these materials have also been designed such that you should also be able to work through them independently.
You can navigate through the sections using the menu on the side. Please note that the data required for the examples and exercises is preloaded within this interactive document, so the commands/exercises only work within it. They **won't work** within the Rstudio console window. When you come to run them on your own datasets, you will need to ensure your data is loaded and edit the syntax to model the relevant variables.
## Introduction to Regression
### What is Regression?
Regression analysis is a broad category of analyses where the objective is to statistically quantify relationships between variables.
It enables you to:
* understand which variables affect other variables and how
* make predictions from a new set of data
It involves fitting a prespecified model to data, where the model refers to a mathematical description of the relationship between the variables. Observed data is then used to determine what numbers or coefficients the model should have.
A regression model requires:
* dependent variable(s) - the outcome or variable you are trying to predict
* independent variable(s) - the predictors, features, covariates or explanatory variable(s)
You may also know regression as fitting a line to data as in the example below. We can think of a line as a graphical representation of a model. Note by line we are not limited to just a straight line.
```{r straight line example, echo = FALSE, fig.height= 4, fig.width=4}
set.seed(123)
par(mar = c(4,4,1,1))
nSamples <- 25
height<-rnorm(nSamples, 180, 20)
weight<- height * 0.62 - 44 + rnorm(nSamples,0,5)
plot(height, weight, pch = 16, col = , xlab = "Height (cm)", ylab = "Weight (kg)")
abline(a = -44, b = 0.62)
```
### What is a line?
To understand a bit more about regression parameters we are going to recap what a line is, specifically a straight line, in a mathematical sense. If we have two continuous variables such as height and weight, measured on the same set of individuals, we can visualise the relationship with a scatterplot like the one above. From this plot we can try to generalise the relationship by drawing a straight line through the points. From this line we can make predictions of what weight a person would be expected to have, if we know their height.
What enables us to do this, is the fact that we can represent this line, and therefore this relationship, as an equation. The standard equation for a straight line between two variables Y and X:
$$Y = \beta_{0} + \beta_{1} X$$
You may have learnt this previously as
$$Y = c + m X$$
These two equations are equivalent we have just used different notation for the coefficients. The coefficients are the values that we multiply our predictor variables by. They are unknown but can be learned or estimated from our observed data. They are sometimes called parameters.
There are two regression coefficients in this equation:
1. Intercept ($\beta_{0}$ ) - This is the value of the outcome variable (Y) when the predictor (X) is set to 0.
2. Slope coefficient ($\beta_{1}$) - This is the change in the outcome variable (Y) for each unit of the predictor variable (X).
When we know the values of these coefficients we can then input different values of X to make predictions for Y. We can also use these values to describe how Y changes as a function of X changing.
Importantly, changing the values of these coefficients changes the position of the line in the graph and ultimately the relationship between X and Y. Below we showcase a number of different lines, can you characterise what is happening?
```{r straight line multiple, echo = FALSE, fig.height= 6, fig.width=8, fig.pos="center"}
par(mar = c(4,4,1,1))
par(mfrow = c(2,2))
plot(0, 1, type = "n", col = , xlab = "X", ylab = "Y", main = "Y = 1 + 2X", xlim = c(-5,5), ylim = c(-10,10))
abline(a = 1, b = 2, lwd = 1.5)
abline(v = 0)
abline(h = 0)
plot(0, 1, type = "n", col = , xlab = "X", ylab = "Y", main = "Y = 1 + -2X", xlim = c(-5,5), ylim = c(-10,10), lwd = 1.5)
abline(a = 1, b = -2, lwd = 1.5)
abline(v = 0)
abline(h = 0)
plot(0, 1, type = "n", col = , xlab = "X", ylab = "Y", main = "Y = 5 + 2X", xlim = c(-5,5), ylim = c(-10,10), lwd = 1.5)
abline(a = 5, b = 2, lwd = 1.5)
abline(v = 0)
abline(h = 0)
plot(0, 1, type = "n", col = , xlab = "X", ylab = "Y", main = "Y = 5 + 4X", xlim = c(-5,5), ylim = c(-10,10), lwd = 1.5)
abline(a = 5, b = 4, lwd = 1.5)
abline(v = 0)
abline(h = 0)
```
What you should observe is that changing $\beta_{1}$, i.e. the coefficient for the X variable, changes the slope of the line:
* the direction of the line changes depending if the parameter is negative or positive
* the steepness of the slope is determined by the magnitude of this coefficient.
This is the coefficient that captures the relationship between our variables X and Y and enables us to model an unlimited number of linear relationships between these variables.
You might also have observed that if we change the value of $\beta_0$, i.e the intercept, the line moves up and down. The intercept is important if we are interested in making predictions but not so important if we want to understand how changing X influences Y.
### Fitting a Simple Linear Regression Model in R
We want to fit our line to a specific set of data where we collected paired values for X and Y to enable us estimate the values of $\beta_{0}$ and $\beta_{1}$. It is out of the scope of this workshop to examine the precise mathematical details of how this is done, but it is important to understand that the principle behind the methodology is to draw the line that "best" fits the data by having the lowest total error. Here the error is defined as the difference between the observed value of Y and the predicted value of Y given X and our estimated model. The total error is calculated as effectively a sum across all observations.
In R, linear regression models can be fitted with the base function `lm()`. Let's look with our height and weight example. These data are available within this tutorial in the R object `demoDat`. In our R code we will use the formula style to specify the model we want to fit. You may recognise this type of syntax from other statistical functions such as `t.test()` or even plotting functions such as `boxplot()`. The equation of the line we wish to fit needs to be provided as an argument to the `lm` function:
```{r, eval = FALSE, echo = TRUE}
lm(weight ~ height, data = demoDat)
```
The dependent variable (taking the place of Y in the standard equation for a line above) goes on the left hand side of the `~` symbol. The predictor variable goes on the right hand side (taking the place of X in the standard equation for a line above). The R code we have written specifies the model:
$$weight = \beta_0 + \beta_1 height$$
Note that we did not need to explicitly specify either the
* intercept ($\beta_0$)
or
* regression coefficient ($\beta_1$)
R knows to add these in automatically.
Let's run this bit of code
```{r, echo = TRUE}
lm(weight ~ height, data = demoDat)
```
If we execute just the `lm()` function it only prints a subset of the possible output:
* the formula we called
* the estimates of the coefficients derived from our observed data.
From these coefficients we can specify the line that has been calculated to represent the relationship between these variables.
```{r, echo = FALSE}
model <- lm(weight ~ height, data = demoDat)
```
We can see that the estimated value for the intercept is `r signif(summary(model)$coefficients[1,1], 4)` and the estimated value for the height slope coefficient is `r signif(summary(model)$coefficients[2,1], 4)`. As the height regression coefficient is positive, we can conclude that weight increases as the participants get taller. More than that we can quantify by how much. The value of the regression coefficient for height tells us how much weight changes by for a one unit increase of height. To interpret this we need to know the units of our variables. In this example height is measured in cm and weight in kg, so the value of our regression coefficient means that for each extra centimetre of height, an individual's weight increases by a mean of `r signif(summary(model)$coefficients[2,1],2)`kg.
```{r,echo=FALSE}
equation = paste0("$weight =", signif(summary(model)$coefficients[1,1],3), " + ", signif(summary(model)$coefficients[2,1],3), " * Height$")
```
We can write our estimated regression model as:
`r equation`.
### Simple Linear Regression Exercise
*Let's practice fitting and interpreting the output of a simple linear regression model.*
Write the R code required to characterise how `bmi` changes as the participants `age`. Both of these variables are available in the `demoDat` object you already have loaded in this tutorial. You can enter your code into the box below and press the blue `Run code` button on the top right hand side of the box to execute the code. There is also a `Solution` button to check your solution or if you are stuck.
```{r ols, exercise=TRUE}
```
```{r ols-solution}
lm(bmi ~ age, data = demoDat)
```
```{r quiz1, echo=FALSE}
quiz(caption = "Questions on Exercise Above",
question("What is the value of the intercept?",
answer("0.04", message = "This is the slope coefficient."),
answer("1", message = "Look at the coefficients in the output."),
answer("24.2", correct = TRUE),
answer("47.4", message = "Did you put age and bmi in the formula the correct way round?"),
allow_retry = TRUE
),
question("The slope coefficient is 0.04, which is the correct interpretation?",
answer("bmi increases by 0.04 per year of age", correct = TRUE),
answer("age increases by 0.04 year per 1 unit of bmi"),
allow_retry = TRUE
),
question("What is the predicted BMI for a participant aged 45?",
answer("1.9", message = "Did you forget to add the intercept?"),
answer("24.2", message = "Did you forget to add the contribution from the slope parameter?"),
answer("25.2", message = "Did you forget to multiple the slope parameter by the age?"),
answer("26.1", correct = TRUE),
allow_retry = TRUE)
)
```
## Hypothesis Testing for Regression Models
If you have run regression models in other software or have seen the results of regression analysis presented in scientific reports you might be wondering where are the p-values? Don't worry they are there. Before we look at how you go about extracting this information we will first go over how hypothesis testing works in the context of regression.
### The Theory
Recall, how earlier we saw that the relationship between X and Y could be written down as a mathematical equation:
$$Y = \beta_{0} + \beta_{1} X$$
The important parameter for understanding the relationship between X and Y is $\beta_1$. Specifically the magnitude of $\beta_1$ tells us about the strength of the relationship between X and Y. When we looked at the different lines you could get by changing the values of $\beta_0$ and $\beta_1$ there were two very specific scenarios we didn't consider, when either $\beta_0 = 0$ or $\beta_1 = 0$. Let's see what happens to our equation in these situations:
If $\beta_0 = 0$ then our equation becomes
$$Y = 0 + \beta_{1} X = \beta_{1}X$$
Recall that the intercept is the position on the y-axis when the line crosses. If $\beta_0 = 0$ then our line will cross the y-axis through the origin (i.e the point (0,0)).
If $\beta_1 = 0$ then our equation becomes
$$Y = \beta_0 + 0 X = \beta_{0}$$
When you multiply anything by zero the answer is always 0. If the coefficient for our X variable is 0, regardless what the value of X is this will compute to 0. We can, therefore, simplify our prediction equation and remove the X term from the equation. Our predictive equation for Y is no longer dependent on or influenced by the predictor variable X.
This is essentially saying that there is no relationship between X and Y. This is the concept that underlies hypothesis testing in regression analysis. If there is a relationship between X and Y the regression parameter for X needs to be non-zero. If it is non-zero, then the value of X will have some effect of the prediction of Y, albeit it potentially quite small, but an effect none the less.
To statistically test for a effect we require an explicitly stated null hypothesis and alternative hypothesis. We will then use our observed data to summarise the evidence to determine if there is enough evidence to reject the null hypothesis in favour of the alternative. For each individual regression coefficient these hypotheses are:
$$H_{null}: \beta = 0$$
$$H_{alternative}: \beta \neq 0$$
To test these hypotheses with the observed data we implement the following steps:
1. accumulate the evidence from the data
2. use a statistical theory or the data to establish how normal or extreme this result is
3. apply some criteria to draw a conclusion in the context of the proposed hypotheses
We have already started to accumulate the evidence by estimating the values of the regression coefficient. This estimate is then converted into a t-statistic (by dividing by the estimated standard error) which enables us to use a known statistical distribution to quantify how likely it is to occur. The relevant distribution here is the Student's t-distribution. With this knowledge we can then calculate a p-value which we used to decide which hypothesis our data supports. You might recognise the Student's t-distribution, if you are familar with t-tests. Hold this thought, the link between t-tests and regression is explored in the second session in this series.
### Statistical Assumptions
To enable us to complete this process, we have made a series of assumptions about the data. If these do not hold true, this process potentially falls apart. So understanding the context with which hypothesis testing has been formulated is important for performing accurate and meaningful statistical inference.
For hypothesis testing of a regression coefficient for the predictor variable X on an outcome variable Y the assumptions are:
* The dependent variable Y has a linear relationship with the independent variable X.
* The errors (residuals) are normally distributed.
* The observed samples are independent.
* The variances across the range of predictor variables are equal.
Let's revisit our R code to see how we can extract the results of the hypothesis testing for our regression model.
### Hypothesis Testing of Regression Parameters in R
We don't actually need to get R to do any further calculations to get the p-values for our statistical hypothesis testing. These are calculated when you execute `lm()`. By default though it does not print this output to console. We do need to use some additional commands to extract this information. This is largely achieved with the `summary()` function. We could chain these commands together i.e. `summary(lm()))` or we could save the output of `lm()` to a variable (for example `model`) and run it as two lines of code as shown below:
```{r}
model<-lm(weight ~ height, data = demoDat)
summary(model)
```
We can see that the `summary()` function expands on the information printed to the console compared to just running the `lm()` function.
From top to bottom, we have:
* the model formula
* a statistical summary of the residuals (i.e. errors)
* a table of regression coefficients
* model fit statistics
We are probably most interested in the coefficients table which contains a summary of each estimated regression coefficient. The results for each coefficient are provided on a separate rows, with the intercept in the top row followed by 1 row for each explanatory variable and 4 columns. It is worth noting at this point that the intercept is also a regression coefficient which we can test. However, we almost never pay much attention to this result as
1. it is almost always significantly non-zero
2. it doesn't tell us anything about relationships between the variables, which is typically our main motivation for hypothesis testing in regression analysis
The columns in the coefficients table are as follows:
| Column | Content |
|-------|------|
| Estimate | estimated regression coefficient |
| Std. Error | standard error of the estimated regression coefficient |
| t value | test statistic (the ratio of the previous two columns) |
| Pr(> t) | p-value comparing the t value to the Student's t distribution |
We can see from the coefficients table above that the slope parameter for height has a p-value < 0.05. Therefore we can conclude that it is significantly non-zero and reject the null hypothesis in favour of the alternative hypothesis. There is a statistically significant association between height and weight in this dataset.
We can visualise this estimated model, by generating a scatterplot of the observed data and adding the fitted regression line to the plot. This can be done simply by passing the output of the `lm` function into the `abline` function.
```{r, fig.height= 5, fig.width=5, fig.pos="center"}
par(mar = c(4,4,1,1)) # adjust the margins around the plot
plot(demoDat$height, demoDat$weight, pch = 16, xlab = "Height (cm)", ylab = "Weight (kg)")
abline(model, col = "red")
```
We can see the positively correlated relationship between height and weight. The fitted regression model appears to represent the data well but there is some noise around the fitted line. In other words the model is not perfect. Potentially there are other factors or variables that could improve the prediction of weight further.
### Hypothesis Testing of Regression Coefficients Exercise
*Let's see if age has a significant effect on weight or bmi.*
Write the R code required, to test using a regression model, the following:
1. Is weight significantly associated with age?
2. Is bmi significantly associated with age?
```{r ols-signif, exercise=TRUE}
```
```{r ols-signif-solution}
model1<-lm(weight ~ age, data = demoDat)
summary(model1)
model2<-lm(bmi ~ age, data = demoDat)
summary(model2)
```
```{r quiz2, echo=FALSE}
quiz(caption = "Questions on exercise above",
question("Which of these statements are true for the relationship between weight and age? Apply a significance threshold of 0.05. Select all that apply.",
answer("The intercept is significantly non-zero.", correct = TRUE),
answer("Age is significantly associated with weight."), allow_retry = TRUE
),
question("Which of these statements are true for the relationship between BMI and age? Apply a significance threshold of 0.05. Select all that apply.",
answer("The intercept is significantly non-zero.", correct = TRUE),
answer("Age is significantly associated with BMI."), allow_retry = TRUE
)
)
```
### Checking the Assumptions of Linear Regression
As we mentioned before, by making assumptions about the characteristics of our data we can do significance testing. It is prudent therefore, to check whether these assumptions hold true so that we can be confident in the conclusion we have made. However, this is far from common practice. There are a number of reasons why.
Firstly, the assumptions are to some degree subjective and assessing them is not an exact science. It can be hard to know what to look for if you don't have much experience in fitting regression models. The underlying mathematics of regression is robust to some deviation from these assumptions. So there is some tolerance around these we can allow. Common tools to assess properties like normality might not place the threshold for deviation in the same place as the statistical model needs. Often they are more conservative and therefore you might prematurely reject a valid result.
Secondly, (and perhaps most pertinent) it is not possible to check many of the assumptions without fitting the model first. Therefore, you run the risk that if you find a significant effect you get excited and forget to check the assumptions in favour of investigating a more exciting follow up question. However, as the assumptions are essential for deriving the formula to estimate and test the regression parameters, if they are violated we need to be cautious about accepting the results, especially if they are significant.
Functionality to do this is provided within R. We can easily generate 4 plots, where we can visually inspect some of the assumptions of linear regression by applying the `plot()` function to the output of an `lm()` call. We can see these plots for the height and weight examples from above.
```{r, fig.width = 8, fig.height = 8, fig.align = "center"}
model<- lm(weight ~ height, data = demoDat)
par(mfrow = c(2,2))
plot(model)
```
Decisions on the validity of the results are then given by assessing these plots. Below we outline what is good and bad.
##### Plot 1: Residuals vs Fitted
This plot shows for each observation the error (residual) in the fitted model against the prediction. When looking at this plot we need to ask do the residuals show a non-linear pattern? Ideally they should be random across the possible range of fitted values. Otherwise it suggests that the accuracy of the model is related to characteristics of the samples you are trying to predict (i.e. the model is more accurate for shorted individuals).
**Good:**
* no evidence of a relationship between these variables. (i.e. equal number of points above and below the line or a random scatter)
**Bad:**
* a pattern in the points – may be indicative of a non-linear relationship between your outcome and independent variables.
##### Plot 2: Normal Q-Q
Q-Q plot is an abreviation for quantile-quantile plot. It is used to compare the distribution of an obsevred variable against an expected distribution,in this case the normal distribution. Here we are asking are the residuals normally distributed? If the observed distribution in line with the expected distribution the points will follow the x=y diagonal line.
**Good:**
* the points follow the dotted diagonal line. Some deviation at the extremes (start and end) are to be expcted and tolerated.
**Bad:**
* the points deviate from the dotted diagonal line. Particularly concerning if this is non-symmetric or points in the middle of the distribution.
##### Plot 3: Scale-Location
Are the residuals equally spread across the range of predictions?
**Good:**
* horizontal line with no evidence of a relationship between these variables
**Bad:**
* a non horizontal line and more/less points in one corner of the plot, may indicate heteroscedasticity (i.e. where the variance of the predictor variable is associated with it's value)
##### Plot 4: Residuals vs Leverage
We want our line of best fit to capture the relationship that best represents the full sample and isn't driven by one or two extreme samples. In this plot we look at the leverage, a statistical value of the influence of each sample on the overall results and we are asking if there are any individual samples that overly influencing the fit? This is more problematic if the sample with a higher leverage is at the extreme end of the distribution (i.e. a particularly high or low residual value).
**Good:**
* all inside red dashed line
**Bad:**
* any values in the upper right or lower right corner or cases outside of a dashed red line, indicates samples that don’t fit the trend in the data and are biasing the result.
As these require a subjective assessment of the plots, it can be difficult to decide whether the plot looks good or bad. This is particularly challenging where the data has only a small number of observations. We should also bear in mind that linear regression is fairly robust to the violation of many of the assumptions. Therefore, even if the plots don't look ideal, that does not automatically mean that the result is invalid. This is where statistics moves away from a fixed quantity and into a gray area of subjectivity. Experience is the only real aid here, the more regression models you fit and the more plots you look at enables you to gauge what is a major or minor deviation.
For contrast, let's fit another linear model where the assumptions might not hold true. Let's look at the relationship between weight and hours exercised per week (`exercise_hours`). When we asked the participants how much exercise they did each week they were effectively counting the number of hours, so this variable is a count variable. Count variables are typically Poisson distributed and are limited to whole, positive numbers, so not strictly a continuous variable and therefore potentially violating one of our assumptions.
```{r, echo = FALSE}
hist(demoDat$exercise_hours, xlab = "Hours exercised each week", ylab = "Number of participants", main = "")
```
In the histogram above we can see a non-symetrical distribution with a hard boundary of 0 on the left and a long tail on the right hand side.
Let's take a look at what happens if we use it as a predictor variable against weight.
```{r, fig.width = 8, fig.height = 8, fig.align = "center"}
summary(lm(weight ~ exercise_hours, data = demoDat))
par(mfrow = c(2,2))
plot(lm(weight ~ exercise_hours, data = demoDat))
```
**Plot 1: Residuals vs Fitted** We can see vertical lines of dots which is a reflection of the fact the observed X variable only contains whole numbers. On the right hand side the points appear reasonably random in the space, but perhaps some bias to the left hand side.
**Plot 2: Normal QQ** Points largely on the diagonal line in the centre of the distribution but some deviation at both extremes.
**Plot 3: Scale-Location** We see the vertical lines again and more points of the right hand side. Possibly some weird empty spaces, where points are absent on the left hand side.
**Plot 4: Residuals vs Leverage** There are some samples with extreme values (on the far right) not quite in line the rest of the points but not outside the accepted region.
So in conclusion we can see less desirable behaviour of our observed data as we try to force a discrete variable into a methodology for a continuous variable.
### Inspecting the Assumptions Exercise
*Let's have a go at creating and interpreting some diagnostic plots.*
Write the R code required to determine if a regression model is an appropriate tool for assessing the relationship between:
1. age and weight
2. bmi and average number of alcohol units drunk per week (`alcohol_units`)
Use the provided plots to assess the assumptions of linear regression.
```{r ols-plots, exercise=TRUE}
```
```{r ols-plots-solution}
summary(lm(weight ~ age, data = demoDat))
plot(lm(weight ~ age, data = demoDat))
summary(lm(bmi ~ alcohol_units, data = demoDat))
plot(lm(bmi ~ alcohol_units, data = demoDat))
```
## Multiple Linear Regression
### Expanding the regression framework
So far we have considered a single type of regression analysis with one continuous predictor variable and one continuous outcome variable and fitting a straight line between these. This has enabled us to get to grips with the core concepts but regression can do so much more than this. It is an incredibly flexible framework that can handle
* different types of variables
* multiple variables
* different relationships between variables
We will now look at how we extend the methodology to allow more complex analytical designs. You should think of regression as a modular approach where you select the necessary components depending on the
* properties of the outcome variable(s)
* properties of the predictor variable(s)
* the relationship that you want to model between each predictor variable and outcome variable.
Different types of regression analysis you may be familiar with include:
**Simple Linear Regression**
* 1 continuous outcome variable
* 1 predictor variable
**Multiple Linear Regression**
* 1 continuous outcome variable
* $> 1$ predictor variable
**Multivariate Linear Regression**
* multiple correlated dependent variables
* $> 0$ predictor variable
Next, we are going to discuss *Multiple Linear Regression* which allows us to look at the effect of multiple predictor variables simultaneously on an outcome variable.
### Multiple Linear Regression
To include multiple predictor variables in our existing regression framework, we simply need to add these additional terms to our model/equation.
For example if we have two variables $X_1$ and $X_2$ and we want to use these together to predict Y we can fit the model
$$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 $$
To accommodate the extra predictor variable we need to include an extra regression coefficient. This is still considered a linear regression model because we are combining the effects of these predictor variables in a linear (additive) manner.
In R we don't need any new functions to fit a multiple regression model, we can fit these models with the same `lm()` function. As before R will automatically add the right number of regression coefficients.
Let's look at an example where we model the effect of both age and height on weight.
```{r}
modelM<-lm(weight ~ age + height, data = demoDat)
summary(modelM)
```
To report the results of this extra regression parameter we have a third row in the coefficient's table. Each regression coefficient included in this table, regardless of how many there are, are tested under the same hypothesis framework we discussed for simple linear regression. The results for each coefficient are interpreted in the same way. From this analysis we can see that height is significantly associated with weight (p-value < 0.05) but age is not (p-value > 0.05).
All of the assumptions from simple linear regression hold for multiple linear regression with the addition of one more:
* The dependent variable Y has a linear relationship with the independent variables X.
* The errors (residuals) are normally distributed.
* The observed samples are independent.
* The variances across the range of predictor variables are equal.
* *The independent variables are not highly correlated with each other.*
We can again use the `plot()` function to inspect these for a multiple regression model fit.
### Multiple Regression Analysis Exercise
*Let's practice fitting a regression model with multiple predictor variables.*
Fit and interpret the following regress models.
1. `bmi` predicted by `age` and `alcohol_units`.
2. `bmi` predicted by `age`, `exercise_hours` and `alcohol_units`.
Apply a significance threshold of 0.05.
```{r multiple-regression, exercise = TRUE}
```
```{r multiple-regression-solution}
summary(lm(bmi ~ age + alcohol_units))
summary(lm(bmi ~ age + exercise_hours + alcohol_units))
```
```{r quiz3, echo=FALSE}
quiz(caption = "Questions on exercise above",
question("When just looking at the effect of age and alcohol on BMI, what is the effect of alcohol? Apply a significance threshold of 0.05 to answer this question.",
answer("More alcohol decreases BMI, but not significantly."),
answer("More alcohol significantly decreases BMI."),
answer("Less alcohol decreases BMI, but not significantly.", correct = TRUE),
answer("Less alcohol significantly decreases BMI."), allow_retry = TRUE
),
question("What happens to the results for alcohol when exercise is also included in the model? Apply a significance threshold of 0.05 to answer this question.",
answer("The magnitude of the effect between alcohol and bmi decreases and remains non-significant."),
answer("The magnitude of the effect between alcohol and bmi increases but remains non-significant.", correct = TRUE),
answer("The magnitude of the effect between alcohol and bmi decreases and is now significant."),
answer("The magnitude of the effect between alcohol and bmi increases and is now significant."), allow_retry = TRUE)
)
```
### Assessing the Effect of Multiple Predictor Variables Simultaneously
There are many reasons why you might want to model multiple predictor variables simultaneously.
1. You are interested in understanding the effects of multiple different factors.
2. You think there are some variables that might bias or confound your analysis and you want to adjust for this.
In the second scenario, some predictor variables are just included so that their effect is captured, but you are not explicitly interested in their results.
In the first scenario, you might be interested in quantifying the effect of each predictor term individually. This can be achieved by looking at the regression coefficients for each term in turn, as we did for the example above. Alternatively you might be interested in the combined effect of multiple terms in predicting an outcome. We can do this from our regression analysis by reframing the null and alternative hypothesis.
Let's define our regression model for two predictor variables $X_1$ and $X_2$ as:
$$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 $$
Previously we were interested if a single regression coefficient was non-zero as if that was the case, then that regression parameter would cause some quantity to be added to the prediction. The null hypothesis for testing the joint effect of two predictors is based on the same underlying concept, that there is no effect on the outcome from the specified predictor variables. The mathematical definition of this null effect needs to be altered though to reflect that we have multiple predictor variables and regression coefficients.
If there is no effect of $X_1$ **and** $X_2$ on Y, then the regression coefficients for both terms need to be zero. Such that they both get cancelled out of the equation and it can be simplified to just the intercept. For there to be an improvement in the predictive ability of model, at least one of the two predictive variables needs to have a non-zero coefficient. Importantly though, we only need one of them to improve the prediction to conclude that together they have an effect.
This gives us the null hypothesis of
$$H_{null}: \beta_1 = \beta_2 = 0$$
and the alternative hypothesis of
$$H_{alternative}: \beta_1 \neq 0\text{ or }\beta_2 \neq 0$$
Another way of thinking about this analysis is that we are comparing two models with and without the terms of interest. Our statistical question boils down to which of these models is a better fit to the data?
The more complex model 1:
$$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 $$
or
the simpler model 2:
$$Y = \beta_0 $$
To perform this statistical test we need to use the F-distribution rather than the T-distribution. Our test statistic is now based around the idea of variance explained. Mention of either the F-distribution or "an analysis of variance explained" might trigger an association with ANOVA, analysis of variance. Indeed there is a link here. In fact the R function needed to run this analysis is `anova()`. But first we need to define and fit the two models we want to compare. To specify the simpliest model possible, that with just an intercept term, we include just a `1` on the right hand side of the formula.
```{r}
## the more complex "full" model:
model.full <- lm(weight ~ age + height, data = demoDat)
## the simpler "null" model:
model.null <- lm(weight ~ 1, data = demoDat)
# compare the model fits
anova(model.null, model.full)
```
We can see in the output a table of test statistics where we can see an F-value is reported along with its p-value. In this example we have a (very) significant p-value. Therefore we conclude that a model with both age and height is significantly better than a model with just the intercept. From our previous analysis of the individual regression coefficients we know that this is predominantly due to the height variable.
## Quiz
```{r finalQuiz, echo=FALSE}
quiz(caption = "Have ago at these questions to test your knowledge.",
question("Which of these are reasons to do regression analysis? Select all that apply.",
answer("Clustering of data"),
answer("Prediction", correct = TRUE),
answer("Hypothesis testing", correct = TRUE),
answer("Data summarisation")),
question("In simple linear regression, how many variables are involved?",
answer("One independent variable and one dependent variable", correct = TRUE),
answer("Two independent variables and one dependent variable"),
answer("One independent variable and two dependent variables"),
answer("Two independent variables and two dependent variables")),
question("What is the interpretation of the intercept?",
answer("The mean of the outcome when all predictor variables equal 0.", correct = TRUE),
answer("The mean of the outcome when all predictor variables equal 1.")
),
question("When you have multiple predictor variables is this called:",
answer("Multivariate linear regression"),
answer("Multiple linear regression", correct = TRUE)),
question("Which of these is the null hypothesis for significance testing of a single regression coefficient?",
answer("The value of the regression coefficient = 0.", correct = TRUE),
answer("The value of the regression coefficient is greater than 0."),
answer("The value of the regression coefficient is less than 0."),
answer("The value of the regression coefficient is not equal to 0.")),
question("For the regression model coded in R as $test\\_score \\sim age$, how many regression coefficients are there?",
answer("1"),
answer("2", correct = TRUE),
answer("3"),
answer("4")),
question("For the regression model coded in R as $test\\_score \\sim 1$, how many regression coefficients are there?",
answer("1", correct = TRUE),
answer("2"),
answer("3"),
answer("4")),
question("For the full model $y \\sim age + education\\_years$, which of these are valid null models? Select all that apply",
answer("$y \\sim 1$", correct = TRUE),
answer("$y \\sim age + sex$"),
answer("$y \\sim age + status$"),
answer("$y \\sim education\\_years + age$"))
)
```
## Summary
In this session we have covered a number of concepts:
* An overview of what regression is
* How to represent a line as an equation
* What the regression coefficients represent
We have covered a two different types of regression analysis:
* Simple Linear Regression
* Multiple Linear Regression
We have looked at what hypothesis or significance testing means for a regression model to test
* individual regression coefficients on an outcome
* the joint effect of multiple predictor variables on an outcome
As a consequence we have drawn out the overlap between regression and ANOVA.
In summary regression is a modular framework that can be used to test a endless range of analytical questions.
### What next?
In the Extras section for this tutorial there are details on
* how to extract the regression coefficients from the regression results
* whether to use hypothesis testing of regression coefficients or model comparison for regression models with a single predictor variable
* how to extract the variance explained statistics for a regresison model
In the next two workshops in this Regression series the following content will be covered.
Regression Analysis in R: Adapting to Varied Data Types:
* how categorical variables are modeled in regression
* logistic regression
* introduction to generalised linear regression
Mixed Effects Regression with R:
* how to handle observed data that are not independent.
* how to model relationships influenced by a third variable
## Extras
#### Extracting summary statistics from a model fit in R
If you are new to R, here we will just run through some details on the type of objects these data are stored in and how to access specific elements. This can be helpful for writing automated analysis scripts. Due to the need to contain different types of data in different formats and structures, the output of the regression model fit is stored in a bespoke object, with slots for the the different parts of the output. These slots are named and can be assessed using the `$`. For example to extract just the table of estimated regression coefficients, which are named `coefficients` we use the following command:
```{r}
model <- lm(bmi ~ age + sex, demoDat)
summary(model)$coefficients
```
We can determine the type of object the coefficients table is stored in, using the function `class()`.
```{r}
class(summary(model)$coefficients)
mode(summary(model)$coefficients)
```
The output of the command tells us it is stored in a matrix, which is a data-type in R, where you have rows and columns. A similar data-type is called a data.frame. The difference between these two data-types is that matrices can only contain one data type, which we can determine with the function `mode()`. Here it contains exclusively numeric values. In constrast, in a data frame each column can be a different data type. Our demoDat data is stored in a data.frame and the output of the `str()` function, tells us the data type assigned to each column.
Let's say we wanted to extract a single value from this matrix, there are a number of commands we can use. For example, let's extract the p-value for the age regression slope parameter using the slicing function `[`.
We can provide the row (2nd) and column (4th) number of the matrix entry we want:
```{r}
summary(model)$coefficients[2,4]
```
Alternatively we can specify either the column or row name:
```{r, eval = FALSE}
summary(model)$coefficients["age",4]
```
We can see a list of all components we can extract from the output of `lm()` by running `names()` on the lm object. All of these can be extracted with the `$`.
```{r}
names(model)
model$call ## example of how to extract any of the components listed by the previous command.
```
Similarly we can run `names()` on the `summary(lm)` object as showed here to get a list of all the slots available from that object.
```{r}
names(summary(model))
```
Note these are different to those available for the model object, for example the $R^{2}$ and $\overline{R}^{2}$ are only extractable from the `summary(model)` object.
```{r}
summary(model)$r.squared
summary(model)$adj.r.squared
```
Note that as well as directly assessing these slots using the `$` command, there also exist some predefined functions to extract the commonly requested outputs from the model fit. We have already taken advantage of one of these, `summary()`, others include `coef()`, `effects()`, `residuals()`, `fitted()` and `predict.lm()`.
### Simple Linear Regression Link between F-test and T-test
We can also use an F-test to test a single predictor variable in our model.
```{r}
model<-lm(weight ~ height)
summary(model)
anova(model)
```
In the `summary(model)` output we can see at the bottom that the results of testing the full model with an F-test. If we want to see the full table of sums of squares statistics we can use the `anova()` function on our fitted regression model.
Comparing this table with the coefficients table, we can see that the p-value from the t-test of the age regression parameter and the F-test for the full model are identical. This is not a coincidence and is always true for the specific case of simple linear regerssion.
#### Extracting Variance Explained Statistics
Finally, we will look at the $R^{2}$ and $\overline{R}^{2}$ statistics. We can see from the `summary(model)` output above these are automatically calculated. For the simple linear regression model we have fitted
$R^{2}$ = `r summary(model)$r.squared`
$\overline{R}^{2}$ = `r summary(model)$adj.r.squared`