This repository has been archived by the owner on Sep 15, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 5
/
r4asme14 quantitative.Rmd
364 lines (259 loc) · 10.3 KB
/
r4asme14 quantitative.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
---
title: "14: Analysis of quantitative data"
subtitle: "R 4 ASME"
author: Author – Andrea Mazzella [(GitHub)](https://github.com/andreamazzella)
output: html_notebook
---
-------------------------------------------------------------------------------
## Contents
* Univariable linear regression models
* categorical exposure
* continuous exposure
* scatterplots
* ANOVA (F test)
* quadratic models
* Multivariable linear regression models
* ANOVA (partial F test)
* Plotting residuals
* Histograms
* Quantile-quantile plots
* Residuals vs fitted scatterplots
* Log transformations
-------------------------------------------------------------------------------
## Packages, options, and data management
```{r message=FALSE, warning=FALSE}
# Load packages
library("haven")
library("magrittr")
library("broom")
library("tidyverse")
# Limit significant digits to 3, remove scientific notation
options(digits = 3, scipen = 9)
# Change ggplot default theme to B&W
theme_set(theme_bw())
```
Import and explore `whitehal.dta`. It contains data on risk factors for coronary heart disease; the variables we will explore were all collected at time of entry, so the data is cross-sectional.
These are the variables of interest:
* *Outcome*: systolic blood pressure (continuous: `sbp`)
* *Exposure*: job grade (categorical: `grade4`)
* *Confounder*: age (continuous: `agein`)
```{r import}
# Import the dataset
whitehall <- read_stata("whitehal.dta")
```
```{r, include=FALSE}
# Explore data types
whitehall %>% select(sbp, grade4, agein) %>% glimpse()
```
As you can see, there are no value labels, so I'll add them using the dataset help file.
```{r data_management, include=FALSE}
# Rename and factorise variables, label values
whitehall %<>%
mutate(grade4 = factor(grade4,
levels = c(1, 2, 3, 4),
labels = c("admin", "profess", "clerical", "other"))) %>%
select(id, sbp, grade4, agein)
# Check it worked ok
glimpse(whitehall)
#Summarise
summary(whitehall)
```
# 1. Summarise a continuous variable by a categorical one
Summarise the sBP (mean and standard deviation) across the four job grade strata.
```{r}
whitehall %>% group_by(grade4) %>% summarise(n(), mean(sbp), sd(sbp))
```
# 2. Linear regression (univariable, categorical exposure)
Fit a linear regression model for systolic blood pressure by job grade.
What's the estimate (and 95% CI) of average sBP in a man in the "admin" group? And for a man in the "other" group?
How does this compare with your answer to Q1?
Linear regression is done with `lm()`.
`fct_relevel()` has a similar function to `b4` in Stata.
*Did you know?* If you put a command with an assignment into brackets, it will do the assignment *and* print its output.
```{r}
# Linear regression
(linear <- lm(sbp ~ grade4, whitehall))
tidy(linear, conf.int = T)
# Making "other" the baseline group
lm(sbp ~ fct_relevel(grade4, "other"), whitehall) %>% tidy(conf.int = T)
```
Mean sBP for man in grade 1 (admin) = 133 mmHg (129-138)
Mean sBP for man in grade 4 (other) = 138 (135-142)
# 3. ANOVA: F test
Is there evidence to suggest that sBP is associated with job grade?
In order to get the F statistic, you need to use `anova()`, comparing this model with a model with only the outcome.
```{r}
anova(linear,
lm(sbp ~ 1, whitehall))
```
F = 5.27
ANOVA p = 0.0013
There is good evidence for an association between job grade and sBP.
# 4. Continuous exposures
Explore the association between age and sBP with:
* scatter plots
* categorising age
* regression models treating age as:
- linear
- quadratic
- categorical
## 4a. Scatter plot
Let's explore visually this association with a simple scatter plot.
`geom_smooth()` automatically fits a line to the data.
```{r}
(scatter <- whitehall %>% ggplot(aes(x = agein, y = sbp)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
labs(title = "Systolic blood pressure by age",
x = "Age (years)",
y = "Systolic Blood Pressure (mmHg)"))
```
It's difficult to say wthether the relationship is linear or not by looking at the plot.
## 4b. Categorise a continuous variable and summarise by categories
Let's now summarise sBP by age categories, splitting age into 5-year groups from 40 to 60, and then one last 10-year group.
```{r}
# Categorise
whitehall %<>% mutate(age_grp = cut(agein, breaks = c(seq(40, 60, 5), 70)))
# Ensure it worked
whitehall %>%
group_by(age_grp) %>%
summarise(min(agein), max(agein))
```
We can now summarise sBP by age group.
```{r}
whitehall %>%
group_by(age_grp) %>%
summarise(n(), mean(sbp), sd(sbp))
```
sBP appears appears to "accelerate" with ade, so maybe a linear model is not the best way of fitting these data.
## 4c. Quadratic model
In order to fit a quadratic model we first need to centre the age around its mean, and then square this difference.
```{r}
whitehall %<>% mutate(age_cent = agein - mean(agein),
age_cent_sq = age_cent ^ 2)
```
We can then fit two models: one with this centered age, the other with its quadratic.
```{r}
# Linear model
linear_age <- lm(sbp ~ age_cent, whitehall)
tidy(linear_age)
# Quadratic model
quadr_age <- lm(sbp ~ age_cent + age_cent_sq, whitehall)
tidy(quadr_age)
# F test
anova(linear_age, quadr_age)
```
F test for the quadratic term
F = 6.53
p = 0.011
This means that there is good evidence that the quadratic model fits the data better than a model with a linear age effect.
Let's look at the quadratic fit and compare it with the linear one.
Note that you don't need to predict the sbp values and plot those, you can simply add the quadratic model to `geom_smooth()`.
```{r}
scatter + geom_smooth(method = "lm",
formula = y ~ x + I(x^2),
colour = "yellow")
```
You can see that the quadratic and linear lines are almost the same.
## 4d. Model with categorical age
An alternative would be using age in groups as explanatory variable.
```{r}
categ <- lm(sbp ~ age_grp, whitehall)
tidy(categ)
anova(categ,
lm(sbp ~ 1, whitehall))
```
There is still extremely good evidence of an association.
# 5. Multivariable linear regression
Before looking at the data: could age confound this association between job grade and sBP?
## 5a. Bonus scatterplot
We can visualise this with one regression line per each grade4 stratum. In ggplot we do this by moving the grade4 variable form the `geom_point` layer to the main layer.
```{r}
whitehall %>% ggplot(aes(x = agein, y = sbp, colour = grade4)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm",
formula = y ~ x + I(x^2),
se = F) +
viridis::scale_color_viridis(discrete = T) +
labs(title = "Systolic blood pressure by age and job grade",
subtitle = "Quadratic effects",
x = "Age (years)",
y = "Systolic Blood Pressure (mmHg)",
colour = "Job grade")
```
## 5b. Adding a covariate and partial F test
Now add job grade as a covariate to the quadratic age model.
After adjusting for age, is there evidence that job grade is associated with sBP?
```{r}
# Multivariable regression
quadr_multi <- lm(sbp ~ grade4 + age_cent + age_cent_sq, whitehall)
tidy(quadr_multi)
# ANOVA: partial F test
anova(quadr_multi,
lm(sbp ~ age_cent + age_cent_sq, whitehall))
```
ANOVA (partial F-test) shows no evidence of association between job grade and sBP, once age is accounted for. This means that age is a confounder in this relationship.
# 6 Plotting residuals
We need to check the assumption that residuals are normally distributed, with either a histogram or a Q–Q plot
## 6a. Histogram of residuals
You use `resid()` to access the model residuals.
```{r}
# Histogram
whitehall %>% mutate(residual = resid(quadr_multi)) %>%
ggplot(aes(residual)) + geom_histogram()
```
## 6b. Q–Q plot of residuals
A *quantile-quantile plot* is used to assess two distributions: in this case, the distribution of the residuals (y axis) vs the Normal distribution (x axis). Each point represents a quantile in both distribution. If distribution of quantiles were exactly Normal, all points will fall on the `y = x` line.
In ggplot this is done with `geom_qq()`; for some reason you have to specify the variable in the `aes` option `sample =`. For reference, `stat_qq_line()` adds the line `y = x`.
```{r}
# Q–Q plot
whitehall %>% mutate(residual = resid(quadr_multi)) %>%
ggplot(aes(sample = residual)) +
geom_qq() +
stat_qq_line(colour = "green")
```
Both plots reveal some skew to the right. See Q7 for an attempt at minimising this skew.
## 6c. Residuals vs fitted scatterplot
We can also plot the residuals vs the fitted sBPs.
```{r}
whitehall %>% mutate(residual = resid(quadr_multi),
predicted = predict(quadr_multi)) %>%
ggplot(aes(predicted, residual)) +
geom_point(alpha = 0.25) +
geom_hline(yintercept = 0, colour = "red")
```
There doesn't seem to be a clear relation between these, and the residual variance seems to be constant.
# 7 Log transformation
A log transformation of the outcome variable can ofter remove skew.
## 7a. Fit a log outcome linear regression model.
Fit a simple linear regression model, with log_sbp depending only on grade4.
```{r}
log_linear <- lm(log(sbp) ~ grade4, whitehall)
```
## 7b. Assess residual distribution
As per Q6.
```{r}
# Histogram
whitehall %>% mutate(residual = resid(log_linear)) %>%
ggplot(aes(residual)) + geom_histogram()
# Q–Q plot
whitehall %>% mutate(residual = resid(log_linear)) %>%
ggplot(aes(sample = residual)) +
geom_qq() +
stat_qq_line(colour = "green")
# Residuals vs fitted
whitehall %>% mutate(residual = resid(log_linear),
predicted = predict(log_linear, type = "response")) %>%
ggplot(aes(predicted, residual)) +
geom_point(alpha = 0.25) +
geom_hline(yintercept = 0, colour = "red")
```
The distribution of residuals is now less skewed; the plot of residuals vs fitted shows no clear pattern, and the variance seems constant: the model assumptions are now more appropriate.
(NB: there are only four predicted log sBPs now, because this model only contains a categorical exposure).
## 7c. Exponentiating
How do you interpret the model estimates?
NB: we can back-transform the output from the log scale by exponentiating. This can be done automatically with function `tidy()`.
```{r}
tidy(log_linear, exponentiate = T, conf.int = T)
```
-------------------------------------------------------------------------------