-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathtutorial4-growth-curve-analysis.Rmd
348 lines (244 loc) · 13.7 KB
/
tutorial4-growth-curve-analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
---
title: "R Growth Curve Analysis & Eyetracking Workshop: Tutorial 4: Growth Curve Analyses"
author: "Brock Ferguson"
date: "July 1, 2015"
output:
html_document:
toc: true
theme: readable
---
Load required packages.
```{r}
library(ggplot2)
library(lme4)
library(dplyr)
```
Load our eyetracking data.
```{r}
data <- read.csv('data-eyetracking.csv')
```
Note: In this tutorial, we will do by-subjects time analyses only (aggregating across trials within-subjects) for simplicity but you can do by-items/trials analyses in the same way if you aggregate within trials instead of subjects.
# Bin data within participants by time
Let's prep these data for analysis much like we prepped for the empirical logit analysis.
```{r}
# rescale, bin, aggregate, and transform our DV's all in one step
binned <- data %>%
mutate(TimeS = TimeFromSubphaseOnset / 1000,
Bin = TimeFromSubphaseOnset %/% 50) %>% # re-scale, bin
group_by(ParticipantName,Target,Bin) %>% # aggregate within bins
summarise(PropAnimal = mean(Animate), y = sum(Animate), N = length(Animate), TimeS = min(TimeS)) %>%
mutate(elog = log( (y + .5) / (N - y + .5) ), # empirical logit
wts = 1/(y + .5) + 1/(N - y + .5), # optional weights
Arcsin = asin(sqrt(PropAnimal))) %>% # arcsin-sqrt
ungroup()
```
# Fit our old linear model
Here's a model similar to the linear model we had before :
```{r}
binned$TargetC <- ifelse(binned$Target == 'Animal', .5, -.5)
binned$TargetC <- scale(binned$TargetC, center=T, scale=F)
model <- lmer(elog ~ TargetC*TimeS + (1 + TargetC + TimeS | ParticipantName), data = binned)
summary(model)
```
Visualize the data and model fit:
```{r}
ggplot(binned, aes(x=TimeS, y=elog, color=Target)) +
stat_summary(fun.y=mean, geom="point") +
stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line")
```
We already know that this model isn't a great fit to our data. By assuming linear growth, it gives us weird estimates, one major one being that it thinks our two groups differ at timepoint 0 (which we know not to be true -- the differences emerge over time).
# Natural polynomial growth curve analysis
Let's do our first stab at examining non-linear growth by creating and entering natural polynomials into the model.
```{r}
binned <- binned %>%
mutate(TimeS_2 = TimeS^2,
TimeS_3 = TimeS^3,
TimeS_4 = TimeS^4)
head(binned)
ggplot(binned, aes(x=TimeS, y=TimeS)) +
geom_point() +
geom_point(aes(y=TimeS_2), color='red') +
geom_point(aes(y=TimeS_3), color='blue') +
geom_point(aes(y=TimeS_4), color='green')
```
Each natural polynomial accelerates at a different rate. These have an interesting property in that, when combined, they capture the correlates/estimated slopes of a variable at successive "bends" in the data. i.e., The 2nd (quadratic) polynomial will capture the first bend, the 3rd (cubic) polynomial will capture the second bend, and so on.
In this way, models that include these natural polynomials can model non-linear growth.
```{r}
model <- lmer(elog ~ Target*(TimeS + TimeS_2 + TimeS_3 + TimeS_4) + (1 + Target + TimeS + TimeS_2 + TimeS_3 + TimeS_4 | ParticipantName), data = binned)
summary(model)
```
We have some convergence issues caused by overparameterization, so let's scale back on these polynomials to something that seems more reasonable. A good rule of thumb is to count the number of "bends" in the data before breaking it down by condition.
How many bends do we see?
```{r}
ggplot(binned, aes(x=TimeS, y=elog)) +
stat_smooth(method="loess")
```
There looks to be two bends in the data which, because of the N-1 relationship between polynomials and bends, means we should include 3 polynomial terms.
```{r}
model <- lmer(elog ~ Target*(TimeS + TimeS_2 + TimeS_3) + (1 + Target + TimeS + TimeS_2 + TimeS_3 | ParticipantName), data = binned)
summary(model)
```
There is still a convergence error, likely because the scales of our variables are too different, like before. We will ignore this for now, because our final GCA approach will fix this.
In the mean time, let's see what this model did for us. Looking at the estimates above, it seems to have gotten rid of our timepoint 0 main effect of Target (yes!), and instead shows some strong interactions over time. This seems promising... but let's visualize.
```{r}
ggplot(binned, aes(x=TimeS, y=elog, color=Target)) +
stat_summary(fun.y=mean, geom="point") +
stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line")
```
Awesome! Much better.
Let's store this model because are going to use it for comparison later.
```{r}
natural_model <- model
```
# Orthogonal polynomial growth curve analysis
## Solving the multicollinearity of natural polynomials
We just did our first non-linear growth curve analysis, but it was sub-optimal for two reasons:
(1) these polynomial terms we generated are highly correlated with one another, and multicollinearity in linear models is always bad
(2) our model had trouble converging because of the different scales of our DV's
Thankfully, we have something that will help: *orthogonal polynomials*.
Let's first consider problem (1): our natural polynomial are highly correlated with each other. Here you can see this in a standard correlation matrix.
```{r}
cor(binned[, c('TimeS','TimeS_2','TimeS_3','TimeS_4')])
```
This is not a good thing when we are trying to attribute variance to each factor independently. This isn't unique to time-based models -- any linear model suffers from multicollinearity.
So, what we can do is actually create replacement timecodes for linear, quadratic, cubic, etc. change over time that we *know* by design will be uncorrelated.
`poly()` will generate higher-order polynomials for us, with a vector length equivalent to the length of our original time vector.
We'll go up to 6th-order polynomials, but we'll stick to the first 3 for most of our models.
```{r}
orthogonal_polynomials <- poly(sort(as.vector(unique(binned$TimeS))), 6)
head(orthogonal_polynomials)
```
Column 1 grows linearly. Column 2 grows quadratically. Column 3 grows cubicly... etc.
Visualize this, and verify that they are indeed uncorrelated.
```{r}
ggplot(data.frame(orthogonal_polynomials), aes(x=X1, y=X1)) +
geom_point() +
geom_point(aes(y=X2), color='red') +
geom_point(aes(y=X3), color='blue') +
geom_point(aes(y=X4), color='green') +
geom_point(aes(y=X5), color='purple') +
geom_point(aes(y=X6), color='yellow')
cor(orthogonal_polynomials[, c(1:6)])
round(cor(orthogonal_polynomials[, c(1:6)]),5)
```
Perfect!
I like to merge them into the original dataframe using this technique, which allows for missing data from any given participant.
```{r}
time_codes <- data.frame(
sort(as.vector(unique(binned$TimeS))),
orthogonal_polynomials[, c(1:6)]
)
colnames(time_codes) <- c('TimeS','ot1','ot2','ot3','ot4','ot5','ot6')
binned <- merge(binned, time_codes, by='TimeS')
```
## Orthogonal modeling
Now let's model our data exactly like we did before but using these orthogonal polynomials:
```{r}
model <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3) + (1 + TargetC + ot1 + ot2 + ot3 | ParticipantName), data = binned)
summary(model)
```
Great fit! No errors.
Interestingly, we are back to seeing a main effect of `TargetArtefact`, though... Why? The reason is simple: Our natural polynomials all started at Timepoint 0, meaning that main effects represented differences at the *start* of the time window. In contrast, orthogonal polynomials are, by default, centered at 0, meaning that main effects represent *average* differences (across time) between levels of a factor.
Let's visualize our data and model fit:
```{r}
ggplot(binned, aes(x=TimeS, y=elog, color=Target)) +
stat_summary(fun.y=mean, geom="point") +
stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line")
```
Compare this model to our natural polynomial model.
```{r}
summary(natural_model)
summary(model)
```
We can use the same methods as before to get confidence intervals, test for Type III significance, etc.
```{r}
# confint(model)
# this takes a long with a model this complex....
drop1(model, ~., test="Chisq")
```
`drop1()` suggests that all of our parameters are reliable predictors.
Let's try adding 4th and 5th orthogonal polynomials manually and seeing their effects on this model.
```{r}
model_quartic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3 + ot4) + (1 + TargetC + ot1 + ot2 + ot3 + ot4 | ParticipantName), data = binned)
summary(model_quartic)
ggplot(binned, aes(x=TimeS, y=elog, color=Target)) +
stat_summary(fun.y=mean, geom="point") +
stat_summary(aes(y=predict(model,binned,re.form=NA)), fun.y=mean, geom="line", linetype='dashed') + # 3rd-order model
stat_summary(aes(y=predict(model_quartic,binned,re.form=NA)), fun.y=mean, geom="line") # 4th-order model
anova(model, model_quartic)
```
Despite the very underwhelming difference in model fits, model comparison says that `ot4` is a significant predictor.
One possibility is that this significant difference is because we not only added `ot4` as a fixed effect, we also added it to the random structure. When examining the influence of a fixed effect, it's best to keep your random effect structure constant. The fact that we changed our random structure is why the `anova()` says that our models differ by 8 degrees of freedom when we only wanted to see the influence of the main effect of `ot4` and its interaction with `TargetC` (i.e., two parameters or 2 df difference).
```{r}
model_cubic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3) + (1 + TargetC + ot1 + ot2 + ot3 + ot4 | ParticipantName), data = binned, REML=F)
model_quartic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3 + ot4) + (1 + TargetC + ot1 + ot2 + ot3 + ot4 | ParticipantName), data = binned, REML=F)
anova(model_cubic, model_quartic)
```
Despite the underwhelming difference in visualized model fits, it still says it's a significantly better fit.
What about a 5th polynomial?
```{r}
model_quartic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3 + ot4) + (1 + TargetC + ot1 + ot2 + ot3 + ot4 + ot5 | ParticipantName), data = binned, REML=F)
model_quintic <- lmer(elog ~ TargetC*(ot1 + ot2 + ot3 + ot4 + ot5) + (1 + TargetC + ot1 + ot2 + ot3 + ot4 + ot5 | ParticipantName), data = binned, REML=F)
ggplot(binned, aes(x=TimeS, y=elog, color=Target)) +
stat_summary(fun.y=mean, geom="point") +
stat_summary(aes(y=predict(model_quartic,binned,re.form=NA)), fun.y=mean, geom="line", linetype='dashed') + # 4th-order model
stat_summary(aes(y=predict(model_quintic,binned,re.form=NA)), fun.y=mean, geom="line") # 5th-order model
anova(model_quartic, model_quintic)
```
Adding a 5th polynomial did not improve our fit.
Here's a final reminder of how bad our linear model was:
```{r}
model_linear <- lmer(elog ~ TargetC*(ot1) + (1 + TargetC + ot1 | ParticipantName), data = binned)
summary(model_linear)
ggplot(binned, aes(x=TimeS, y=elog, color=Target)) +
stat_summary(fun.y=mean, geom="point") +
stat_summary(aes(y=predict(model_quartic,binned,re.form=NA)), fun.y=mean, geom="line", linetype='dashed') + # 3rd-order model
stat_summary(aes(y=predict(model_linear,binned,re.form=NA)), fun.y=mean, geom="line") # 2nd-order model
```
## Growth curve analyses with 3+ levels of a factor
I like to design experiments with only 2 levels per factor for simplicity but sometimes we have 3 levels in a factor and, now, main effects do not equal simple effects.
To demonstrate, let's add a third "Neutral" Target level that will follow a similar trajectory to the "Animal" level but shifted down towards 50% chance.
```{r}
new_condition <- binned[which(binned$Target == 'Animal'), ]
new_condition$Target <- 'Neutral'
#new_condition$y <- new_condition$y - round(new_condition$N / 3)
new_condition$y <- new_condition$y + round(rnorm(length(new_condition$y),-.5,2))
new_condition$y <- ifelse(new_condition$y > new_condition$N,new_condition$N,new_condition$y)
new_condition[which(new_condition$y < 1), 'y'] <- 1
new_condition$PropAnimal <- new_condition$y / new_condition$N
new_condition$elog <- log( (new_condition$y) / (new_condition$N - new_condition$y + .5) )
new_condition$wts <- 1/(new_condition$y + .5) + 1/(new_condition$N - new_condition$y + .5)
new_condition$Arcsin <- asin(sqrt(new_condition$PropAnimal))
binned_3levels <- rbind(binned,new_condition)
binned_3levels$Target <- factor(binned_3levels$Target)
ggplot(binned_3levels, aes(x=TimeS, y=elog, color=Target)) +
stat_summary(fun.y=mean, geom="point")
```
Fit a model with Target treatment-coded.
```{r}
model <- lmer(elog ~ Target*(ot1 + ot2 + ot3) + (1 + Target + ot1 + ot2 + ot3 | ParticipantName), data = binned_3levels)
summary(model)
ggplot(binned_3levels, aes(x=TimeS, y=elog, color=Target)) +
stat_summary(fun.y=mean, geom="point") +
stat_summary(aes(y=predict(model,binned_3levels,re.form=NA)), fun.y=mean, geom="line")
```
In order to know how to interpret these simple effects, we need to remember which is our reference level.
Get main effects via model comparison...
```{r}
model_null <- lmer(elog ~ Target*(ot1 + ot2) + ot3 + (1 + Target + ot1 + ot2 + ot3 | ParticipantName), data = binned_3levels)
summary(model_null)
anova(model,model_null)
```
Get simple effects by re-ordering factor levels.
```{r}
levels(binned_3levels$Target)
binned_3levels$Target <- factor(binned_3levels$Target,c('Neutral','Animal','Artefact'))
levels(binned_3levels$Target)
model <- lmer(elog ~ Target*(ot1 + ot2 + ot3) + (1 + Target + ot1 + ot2 + ot3 | ParticipantName), data = binned_3levels)
summary(model)
```
Clean up our workspace.
```{r}
ls()
rm(list=ls())
```