forked from jmbuhr/dataintro
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathfreestyle.qmd
280 lines (217 loc) · 7.7 KB
/
freestyle.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
# Freestyle
```{r}
#| include: false
source("./_common.R")
```
> ... in which we learn about ANOVA, explore ggplot extensions and
build and interactive web application with Shiny.
{{< youtube gZ0u6b_1ej4 >}}
## Setup
Today we are using quite a bunch of packages.
In the lecture these are of course introduced one by one,
but because it is a good habit to have all your imports
and dependencies near the top of your analysis they
show up here already.
```{r}
library(multcomp)
library(patchwork)
library(gganimate)
library(palmerpenguins)
library(broom)
library(tidyverse)
```
## ANOVA
Let's get started.
ANOVA stands for Analysis of Variance.
We start with our familiar penguins dataset and ask
the question: "Is there a difference in bill length between the species?"
```{r}
penguins %>%
ggplot(aes(species, bill_length_mm)) +
geom_boxplot(outlier.color = NA) +
geom_jitter(alpha = 0.2, width = 0.2)
```
Optically, it certainly looks like it.
We could run a bunch of t-tests to compare between the groups,
which would be 3 in total (Adelie–Chinstrap, Adelie–Gentoo, Gentoo–Chinstrap).
And then we would have to correct for multiple testing.
At this point, only running one of tests (like Adelie vs. Chinstrap)
because it looks so promising is not an option!
Looking at the data visually is a form of comparison,
so we would be cheating if we formed our hypothesis only after
looking at the data.
**ANOVA** is a way of comparing multiple groups and tells us, if there
is a difference at all between the groups.
It does not however tell us, between which groups the difference exists,
just that there is an overall difference.
In order to get p-values for direct comparisons between the groups
we run a **post-hoc** (Latin for "after the fact") on our anova result.
First I am sliding in a little extra code chunk, because later down
the line we might want to compare all groups to e.g. a baseline or
control group (see the section about Dunnet below) and this baseline
is decided to be the first factor-level of the grouping variable.
With `fct_relevel` we can move a lovel to the first (or any other)
position.
```{r}
penguins <- penguins %>%
mutate(species = fct_relevel(species, "Gentoo"))
```
Not we create an anova fit with `aov`.
```{r}
anova_fit <- aov(bill_length_mm ~ species, data = penguins)
anova_fit
```
And explore if further with various functions.
```{r}
summary(anova_fit)
```
```{r}
tidy(anova_fit)
```
### Tukey Post-Hoc Test
The most straighforward post-hoc test, which is already built into R,
is "Tukey's Honest Significant Differences",
which compares all groups with all other groups.
```{r}
TukeyHSD(anova_fit) %>%
tidy()
```
### Dunnet Post-Hoc Test
If we have a control group we use Dunnet's test to compare all other
groups to the control group.
This means we have to do less comparisons and end up with a higher statistical power.
Unfortunately, Dunnet is not built into R, but we can get the necessary
functions from the `multcomp` package.
When loading the `multcomp` package we have to be careful
to load it before the `tidyverse`!
Because `multcomp` also loads a bunch of other packages and one
of them brings it's own `select` function, so loading it after
the `tidyverse` would overwrite tidyverse `select` and make our life very hard.
```{r}
glht(anova_fit, mcp(species = "Dunnet")) %>%
tidy()
```
Note, that this package can also do Tukey's test by changing "Dunnet"
to "Tukey".
## ggplot extensions
In order to plot these nice significance stars on top of of our
comparisons let me first introduce you to ggplot extensions:
<https://exts.ggplot2.tidyverse.org/>
This is just a collection of packages that extend ggplot in some shape or form
or simply work very well with it.
### `ggsignif`, `ggpubr`
One of these is `ggsignif`, which we can use to add comparison
brackets to our plot.
```{r}
penguins %>%
ggplot(aes(species, bill_length_mm)) +
geom_boxplot(outlier.color = NA) +
geom_jitter(alpha = 0.2, width = 0.2) +
ggsignif::geom_signif(
comparisons = list(c("Gentoo", "Adelie")),
annotations = c("hello")
)
```
`ggsignif` can also run it's own comparisons, which is
a wilcoxon rank sum test by default, but especially
for statistical tests I prefer to run them myself first
before trusting more packages.
Additionally, we have of course a different type of test (ANOVA),
and we already have our values, so we use it just to manually
add text to the annotation:
```{r}
stars <- function(p) {
case_when(
p <= 0.001 ~ "***",
p <= 0.01 ~ "**",
p <= 0.05 ~ "*",
TRUE ~ "ns"
)
}
dunnet <- glht(anova_fit, mcp(species = "Dunnet")) %>%
tidy() %>%
mutate(contrast = str_split(contrast, " - "),
stars = stars(adj.p.value))
plt <- penguins %>%
ggplot(aes(species, bill_length_mm)) +
geom_boxplot(outlier.color = NA) +
geom_jitter(alpha = 0.2, width = 0.2) +
ggsignif::geom_signif(
comparisons = dunnet$contrast,
annotations = dunnet$stars,
y_position = c(60, 65)
)
plt
```
### `patchwork`
I also saved the plot to a variable and will now create a second
one just to show you the `patchwork` package,
which can combine plots into neat layouts:
```{r}
plt2 <- ggplot(penguins, aes(bill_length_mm, bill_depth_mm, color = species)) +
geom_point()
((plt | plt2) / plt2 ) + plot_layout(guides = 'collect')
```
### `ggfortify`
`ggfortiy` might not strictly be necessary but I want to mention it
to talk about `autoplot`.
`autoplot` is in `ggplot2` already by default.
It creates automatic plot for certain objects, like models for example,
or an anova result.
`ggfortify` makes more objects have the ability to generate an autoplot.
```{r}
library(ggfortify)
autoplot(anova_fit)
```
I want to mention this here because it has an important implication
for communication.
There are two purposes too plots.
The first purpose is to generate insight for you, the data analyst.
Autoplots are often very helpful for this.
But the second purpose is to communicate your findings to a reader!
It takes time and effort to refine a plot from something that generated
and insight for you after having spent a lot of time with your data
to also generating insights for a reader that sees the data for the first time.
Too many people stop at the first step and publish their plots
as is, without thinking much about the reader.
Keep this in mind as you advance in your scientific career.
> Communication is key!
Both code and plots are means of communicating.
### `esquisse`
Two more cool packages: The `esquisse` package let's you
create ggplots interactively with a gui!
But make sure to always save the code this created to ensure
reproducibility!
### `gganimate`
And `gganimate` extends `ggplot` with the ability to use time as a dimension:
```{r}
gapminder::gapminder %>%
ggplot(aes(gdpPercap, lifeExp, size = pop, color = country)) +
geom_point() +
scale_x_log10() +
transition_time(year) +
scale_color_manual(values = gapminder::country_colors) +
guides(color = "none",
size = "none")
```
## Build Apps with `shiny`
You can find the code written during the lecture in: `example-app/`.
## Exercises
### The Whole Deal
Data analysis takes practice.
For this last exercise you get the option to show what you
have learned on a fresh dataset.
But people have different interests, so I am leaving
the choice open.
Choose a dataset from the TidyTuesday project:
<https://github.com/rfordatascience/tidytuesday#datasets>
and write your data analysis report.
Aim to write down some questions you have about the
topic and answer them with the available data.
Explore it with plots and tables and try
to end up with 1 or 2 high quality plots
that clearly communicate you findings.
Above all, have fun!
## Feedback
I will send round a link for a feedback form.
It is anonymous.