-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathch14power.Rmd
342 lines (305 loc) · 18.3 KB
/
ch14power.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
# Power {#ch-power}
## Introduction {#sec:power-introduction}
With statistical testing of H0, we determine the probability $P$ of the
observed differences or effects (or of even larger differences or effects
than observed) if H0 were true, and thus if
the observed difference had to be attributed solely to chance
(see §\@ref(sec:empiricalcycle) and
Chapter \@ref(ch-testing)). If the probability $P$ is very small, then
we have thus found results which are very improbable if H0 were true.
We then conclude that H0 is presumably *not true* and we thus reject
H0. We then call the difference or effect found, the "significant"
(Latin: 'meaning making'). However, there is in fact a probability,
$P$, that the difference found is actually a fluke, and that, by rejecting
H0, we are making a Type I error (i.e. wrongly rejecting H0,
see
§\@ref(sec:testing-introduction)). As we use a certain
significance level, with which to compare $P$, this $\alpha$
is thus also the probability that we are making a Type I error.
At least as important, however, is the opposite error of *not* wrongly
rejecting H0, a Type II error. Examples of such errors are:
not convicting a suspect even if he is guilty, letting a 'spam'
email message through into my mailbox, examining a patient and nevertheless
not noticing their illness, concluding that birds are silent when
they are in fact singing
(Example 13.1), or wrongly concluding that two
groups do not differ when an important difference does in fact exist
between the two groups. The probability of a Type II error is referred to
with the symbol $\beta$.
If H0 is in fact not true (there is a difference, the message is 'spam',
birds are singing, etc.), then H0 should be rejected, and
$\beta$ should thus be as small as possible. The probability of *rightly*
rejecting H0 is then $(1-\beta)$ (see complement rule \@ref(eq:complementrule));
this probability $(1-\beta)$ is called the *power*.
Power can be interpreted as **the probability of the researcher
being right** (H0 is rejected) **when she is indeed right** (H0 is untrue).
The probabilities of Type I and Type II errors have to be weighed up
carefully against each other. In many studies, the p-values
$\alpha=.05$ (significance level) and $\beta=.20$
(power=$.80$) are used. With these, an implicit weighting is made
that a Type I error is 4× more grave than a Type II error.
For some studies that might be justified but it is also easily
conceivable that, under certain circumstances, a Type II error is
actually more grave or serious than a Type I error. If we find both types of
error more or less equally grave, then we should strive for
a smaller $\beta$ and larger power [@Rose08].
The power of a study depends on three factors: (i) the effect
size $d$, which itself is in turn dependent on the measured difference
$D$ and the variation $s$ in the observations
(formula \@ref(eq:d-standardized)), (ii) the sample size $N$, and (iii)
the significance level $\alpha$. In the following sections, we will
discuss each of these factors separately, and, when doing so, keep the other
two factors as constant as possible. For this discussion, we will use the
depictions of calculated power
(Figures \@ref(fig:powercontours-alpha01) and
\@ref(fig:powercontours-alpha05)). The depicted power contours are specifically
for a *t*-test for independent samples
(§\@ref(sec:ttest-indep)) with two-sided testing.
The relations discussed below also apply to other statistical
tests.
```{r powercontours-alpha01, echo=FALSE, fig.cap="Power expressed in contours (see shading), dependent on the standardised effect size (d) and sample size according to a two-sided t-test for unpaired, independent observations, with two-sided significance level alpha=.01."}
# adapted from plot.powercontours.2050705.R
# HQ 20150705
alpha <- .01
x1 <- seq(0.1,1.0,by=.05) # effect size d
x2 <- seq(2,150,by=5) # sample size N
zz <- matrix(NA, nrow=length(x1),ncol=length(x2))
rownames(zz) <- x1
colnames(zz) <- x2
for (i in x1) {
for (j in x2) zz[as.character(i),as.character(j)] <- power.t.test( d=i, n=j, sig=alpha,
type="two", alt="two" )$power
}
filled.contour( x1, x2, zz, levels=c(0:10)/10,
col=gray(c(10:0)/10), key.title = title(main = "Power"),
xlab="Effect size (d)", ylab="Sample size (n per group)" )
title( main="t-test for\nindependent samples", line=1.1 )
```
```{r powercontours-alpha05, echo=FALSE, fig.cap="Power expressed in contours (see graduation), dependent on the standardised effect size (d) and the sample size (n), according to a two-sided t-test for unpaired, independent observations with significance level alpha=.05."}
# adapted from plot.powercontours.2050705.R
# HQ 20150705
alpha <- .05
x1 <- seq(0.1,1.0,by=.05) # effect size d
x2 <- seq(2,150,by=5) # sample size N
zz <- matrix(NA, nrow=length(x1),ncol=length(x2))
rownames(zz) <- x1
colnames(zz) <- x2
for (i in x1) {
for (j in x2) zz[as.character(i),as.character(j)] <- power.t.test( d=i, n=j, sig=alpha,
type="two", alt="two" )$power
}
filled.contour( x1, x2, zz, levels=c(0:10)/10,
col=gray(c(10:0)/10), key.title = title(main = "Power"),
xlab="Effect size (d)", ylab="Sample size (n per group)" )
title( main="t-test for\nindependent samples", line=1.1 )
```
## Relation between effect size and power {#sec:effectsize-power}
The two figures \@ref(fig:powercontours-alpha01) and
\@ref(fig:powercontours-alpha05) show that, in general, the larger the effect to be
tested is (more to the right in each figure), the larger the power.
This is also not surprising: a larger effect has
a higher probability of being detected in a statistical test, under
the same circumstances. A moderately large effect of
$d=.5$, with $n=30$ observations in each group, only has a probability
of $.48$ of being detected (if $\alpha=.05$, Figure \@ref(fig:powercontours-alpha05)). On the basis
of a study
with $n=30$ observations per group is thus actually a gamble
whether a researcher will actually detect such an effect, and
will reject H0. Put otherwise, the probability of a Type I error is
admittedly safely low ($\alpha=.05$) but the probability of a Type II error is
more than $10\times$ as large, and thus dangerously high ($\beta=.52$) [@Rose08, Ch.12].
A larger effect has a higher probability of being detected. A larger
effect of $d=.8$, for example, results in a power of
$.86$ with this testing. The probability of a Type II error $\beta=.14$ here is
admittedly also larger than the probability of a Type I error, but the
proportion $\beta/\alpha$ is considerably less skewed.
As researchers, we only have an indirect influence on effect size.
We of course have no influence on the true raw difference $D$ in the
population. For the power, however, the raw difference $D$ is not
important, but rather the standardised difference $d=D/s$
(§\@ref(sec:ttest-effectsize)). Thus, if we ensure that the standard
deviation $s$ decreases, then $d$ will increase,
and then the power will also increase
(figures \@ref(fig:powercontours-alpha01) and \@ref(fig:powercontours-alpha05)),
and we thus have a higher probability
of actually detecting an effect!
With this goal in mind, researchers always strive to neutralise
disrupting influences from all kinds of other factors as much as possible.
After all, the disrupting influences produce extra variability in the observations, and,
with this, a lower power with the statistical testing.
In a well-designed study, we want to determine in advance what
the power will be, and how large the sample should be (see below).
For this we need an estimation of the smallest effect size $d$
which we still want to detect
(§\@ref(sec:ttest-effectsize)) [@Quene10]. To estimate the effect size, firstly,
an estimate of the raw difference $D$ between the groups or conditions is needed.
Secondly, an estimation is needed of the variability $s$
in the observations. These estimations can be largely deduced from
earlier publications, in which the standard deviation
$s$ is usually reported. If no earlier research reports are available,
then $s$ can be roughly estimated from some informal
'pilot' observations. Take the difference between the highest and the lowest
(range) of these, divide this range by 4, and use the outcome of this as
a rough estimation for $s$ [@PD08].
## Relation between sample size and power {#sec:samplesize-power}
The relation between the sample size $N$ and the power of a study
is illustrated in
Figure \@ref(fig:powercontours-alpha01) for a strict significance
level $\alpha=.01$, and in
Figure \@ref(fig:powercontours-alpha05) for the most used
significance level $\alpha=.05$. The figures show that, in general, the larger the
sample gets (further upwards), the larger the power is.
The increase is steeper (power increases more quickly) with larger
effects (right-hand side) than with smaller effects (left-hand side). Put differently:
with small effects, the sample is actually always too small to detect
these small effects with sufficient power. We already saw that in
Example 13.3 (Chapter \@ref(ch-testing)).
The two figures \@ref(fig:powercontours-alpha01) and
\@ref(fig:powercontours-alpha05) are based on the comparison between two groups which are equally large, each with
precisely half of the observations ($n_1=n_2=N/2$). That is also most efficient.
The power is based on the *harmonic mean* of $n_1$ and $n_2$ (see §\@ref(sec:harmonicmean)), and
that harmonic mean is always smaller than the arithmetic mean of those two numbers. It is thus
advisable to ensure that the groups or samples which you compare are approximately equally large.
---
> *Example 14.1*: In a study, two groups of participants are compared, with $n_1=10$ and $n_2=50$
($N=n_1+n_2=10+50=60$). The harmonic mean of $n_1=10$ and $n_2=50$ is $H=17$. This study thus has the same
power as a smaller study with two groups, each of 17 participants, thus 34 participants in total. For this
study, thus, 26 participants more have been investigated (and bothered)
than necessary (see also §\@ref(sec:design)) [@ACA11, p.295].
---
## Relation between significance level and power {#sec:significancelevel-power}
The relation between the significance level $\alpha$ and the power is
illustrated by the difference between the two figures
\@ref(fig:powercontours-alpha01) and
\@ref(fig:powercontours-alpha05). For each combination of effect size
and sample size, the power is lower in
Figure \@ref(fig:powercontours-alpha01) for $\alpha=.01$ than in
Figure \@ref(fig:powercontours-alpha05) for $\alpha=.05$. If we choose a higher
significance level $\alpha$, then the probability of rejecting H0
increases, and thus also the power of correctly rejecting H0 when H0
is untrue (see Table \@ref(tab:H0H1outcomes)).
However, unfortunately, with a high significance
level $\alpha$, the probability of wrongly rejecting H0 (i.e. of making a
Type I error) also increases. The investigator
must make a well-considered decision between Type I errors (with
probability $\alpha$) and Type II errors (with probability $\beta$); as said earlier,
this decision has to depend on the seriousness of (the consequences of) these
two types of errors.
## Disadvantages of insufficient power
Unfortunately, many examples can be found of 'underpowered' research
in the domain of language and communication. This research has a too small
probability of rejecting H0 when the investigated effect indeed
exists (H0 is not true). Why is that bad [@Quene10]?
Firstly, the Type II error which occurs here can have serious consequences:
a treatment method which is actually better is not recognised as such,
a patient is not or wrongly diagnosed, a useful innovation is wrongly pushed
aside. This error hinders the growth of our knowledge and our insight, and hinders
scientific progress (see also
Example 3.2 in Chapter \@ref(ch-integrity)).
```{r underpoweredeffectsizes, echo=FALSE, fig.cap="Effect sizes (along the horizontal axis) found in simulated experiments (two-sided t-test for independent observations, alpha=.05), broken down according to sample size (left $n=20$, right $n=80$) and according to testing result (dark symbols: significant; light symbols: not significant). The true effect size between groups is always $d=0.5$, indicated by the grey dashed line. The mean effect size found from the significant outcomes is referred to with the black dashed line. For each sample size, 100 simulations have been carried out (long vertical axis)."}
# adapted from plot.underpoweredeffectsizes.20170225.R
# HQ 20150707
# adapted HQ 20170225 for inaugural lecture, with 2 panels
set.seed(2003)
nsim <- 100
ii <- 1:nsim
# nn <- c(20,50,80) # values of n per group to simulate
nn <- c(20,80) # values of n per group to simulate
dnominal <- 0.5 # d is fixed at 0.5
sd.pooled <- function( x1, x2 ) {
nx1 <- length(x1)
nx2 <- length(x2)
var.pooled <- ((nx1-1)*var(x1)+(nx2-1)*var(x2))/(nx1+nx2-2)
return(sqrt(var.pooled))
}
dres <- matrix(NA,nrow=length(ii),ncol=length(nn))
pres <- matrix(NA,nrow=length(ii),ncol=length(nn))
rownames(dres)<-ii; colnames(dres)<-nn
rownames(pres)<-ii; colnames(pres)<-nn
for (n in nn) {
for (i in ii) {
x1 <- rnorm(n, dnominal, 1) # effect d=0.5
x2 <- rnorm(n, 0 , 1) # baseline
dres[as.character(i),as.character(n)] <- (mean(x1)-mean(x2))/sd.pooled(x1,x2)
pres[as.character(i),as.character(n)] <- t.test(x1,x2,type="two")$p.value
}
}
op <- par( mfrow=c(1,length(nn)), mar=c(5,0.5,5,0.5) )
for (n in 1:length(nn)) {
dreported <- mean(dres[pres[ii,as.character(nn[n])]<.05,as.character(nn[n])])
thebias <- (dreported-dnominal)
thepower <- round(table(pres[ii,as.character(nn[n])]<.05)[2]/nsim,2)
plot( dres[ii,as.character(nn[n])], ii, pch=ifelse(pres[ii,as.character(nn[n])]<.05,19,1),
xlim=c(dnominal-1,dnominal+1), yaxt="n",
xlab="", ylab="", type="n",
main=paste("n1 = n2 =",nn[n],"\nsd(d) =",round(sd(dres[,as.character(nn[n])]),2),"\npower = ",thepower,"\nbias =",round(thebias,2)) )
# lines before points
abline( v=dnominal, lty=2, lwd=2, col="grey" )
# add line for average of d values of significant outcomes
abline( v=dreported, lty=3, lwd=2 )
points( dres[ii,as.character(nn[n])], ii,
# pch=21,
pch=ifelse(pres[ii,as.character(nn[n])]<.05,19,1) )
# col=ifelse(pres[ii,as.character(nn[n])]<.05,scales::alpha("black",.8),1),
# bg= ifelse(pres[ii,as.character(nn[n])]<.05,scales::alpha("black",.8),scales::alpha("white",.2) ) )
}
mtext("effect size (d)", side=1, line=-2, outer=TRUE)
par(op)
```
The outcomes of simulated experiments with different
sample size, and thus with different power, are summarised
in Figure \@ref(fig:underpoweredeffectsizes). We explain the second disadvantage on the basis of
the somewhat complex figure. In the left panel of Figure
\@ref(fig:underpoweredeffectsizes), we can see that the different
(simulations of) 'underpowered' studies show a mixed
picture. Some of these studies do show a significant effect (dark symbols),
and many other studies do not (light symbols).
The mixed picture then usually leads to follow up research, in which
people try to find out *why* the effect does occur in some studies,
and not in others. Might the difference in results be attributable
to differences in stimuli? participants? tasks?
instruments? All that follow up research is *superfluous* though: the mixed
picture from these studies can be explained by the small
power of each study. The needless and superfluous follow up research
costs much time and money (and indirect costs,
see Chapter \@ref(ch-integrity)), and comes at the cost of other, more useful
research [@Schm96, p.118]. Put otherwise: one well designed study
with power which is more than sufficient can avoid many needless follow up studies.
The third disadvantage is based on the experience that studies in which
a significant effect is found (dark symbols) have a higher probability
of being reported; this phenomenon is called
'publication bias' or the 'file drawer problem'. After all, a positive result
often does get published, whilst a negative result often disappears
into a file drawer. With small power, that leads to the third disadvantage,
namely an overestimation or 'bias' of the reported effect size.
In an underpowered study, after all, an effect
must be quite large to be found. In the leftmost panel, we can see that
a significant effect has only been found $31\times$. The average effect size of these
31 significant outcomes is $\overline{d_{\textrm{signif}}}=0.90$ (black dashed line), i.e.
a distortion or 'bias' of $0.40$ relative to the actual
$d=0.50$ (grey dashed line)[^fn14-1]. In the rightmost panel, we can see that
a significant effect has been found $91\times$ (thus the power here is sufficient).
The mean effect size of these 91 significant outcomes hardly deviates
from the actual $d$. Moreover, the standard deviation of the reported
effect size is smaller, and that is again important for later research, meta-studies,
and systematic reviews.
Fourthly, the mixed picture from the different studies, sometimes with significant
outcomes and sometimes without, and with great variation in the reported
effect size, carries the danger that these
outcomes are taken less seriously than 'consumers' of
scientific knowledge (practitioners, health insurers, developers,
policy makers, etc.). In this way, these consumers get the impression
that the scientific evidence for this investigated effect is not strong,
and/or that the researchers are in disagreement about whether the effect exists
and if it does, how large it then is [@Kolf93] (Figure \@ref(fig:underpoweredeffectsizes)).
This, also hinders scientific progress, and it hinders the use
of scientific insights in societal applications.
To avoid all these objections, researchers have to take into account
the desired power of a study in an early stage. Designing and conducting
a study with insufficient power is after all in opposition with the earlier discussed
ethical and moral principles of diligence and responsibility
(§\@ref(sec:integrity-introduction)).
[^fn14-1]: A replication study which does have sufficient power, thus typically finds a smaller effect than the
original 'underpowered' study. The smaller effect found in the replication study is then typically also not
significant. We then say that the replication study "fails to replicate" the effect that was significant in
the original study - but that was actually a spurious finding.