-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathR-ecology-lesson--03-dplyr.diff
351 lines (318 loc) · 10.7 KB
/
R-ecology-lesson--03-dplyr.diff
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
diff --git a/home/zhian/Documents/Carpentries/Git/zkamvar--postmaul/data/rmd-repos/datacarpentry--R-ecology-lessonR3/_site/03-dplyr.md b/home/zhian/Documents/Carpentries/Git/zkamvar--postmaul/data/rmd-repos/datacarpentry--R-ecology-lessonR4/_site/03-dplyr.md
index ca77bab..8ac831f 100644
--- a/home/zhian/Documents/Carpentries/Git/zkamvar--postmaul/data/rmd-repos/datacarpentry--R-ecology-lessonR3/_site/03-dplyr.md
+++ b/home/zhian/Documents/Carpentries/Git/zkamvar--postmaul/data/rmd-repos/datacarpentry--R-ecology-lessonR4/_site/03-dplyr.md
@@ -485,13 +485,15 @@ surveys %>%
```
:::
+ #> `summarise()` ungrouping output (override with `.groups` argument)
+
You may also have noticed that the output from these calls doesn't run
off the screen anymore. It's one of the advantages of `tbl_df` over data
frame.
You can also group by multiple columns:
-::: {#cb20 .sourceCode}
+::: {#cb21 .sourceCode}
``` {.sourceCode .r}
surveys %>%
group_by(sex, species_id) %>%
@@ -500,6 +502,8 @@ surveys %>%
```
:::
+ #> `summarise()` regrouping output by 'sex' (override with `.groups` argument)
+
Here, we used `tail()` to look at the last six rows of our summary.
Before, we had used `head()` to look at the first six rows. We can see
that the `sex` column contains `NA` values because some animals had
@@ -511,7 +515,7 @@ this, we can remove the missing values for weight before we attempt to
calculate the summary statistics on weight. Because the missing values
are removed first, we can omit `na.rm = TRUE` when computing the mean:
-::: {#cb21 .sourceCode}
+::: {#cb23 .sourceCode}
``` {.sourceCode .r}
surveys %>%
filter(!is.na(weight)) %>%
@@ -520,12 +524,14 @@ surveys %>%
```
:::
+ #> `summarise()` regrouping output by 'sex' (override with `.groups` argument)
+
Here, again, the output from these calls doesn't run off the screen
anymore. If you want to display more data, you can use the `print()`
function at the end of your chain with the argument `n` specifying the
number of rows to display:
-::: {#cb22 .sourceCode}
+::: {#cb25 .sourceCode}
``` {.sourceCode .r}
surveys %>%
filter(!is.na(weight)) %>%
@@ -535,12 +541,14 @@ surveys %>%
```
:::
+ #> `summarise()` regrouping output by 'sex' (override with `.groups` argument)
+
Once the data are grouped, you can also summarize multiple variables at
the same time (and not necessarily on the same variable). For instance,
we could add a column indicating the minimum weight for each species for
each sex:
-::: {#cb23 .sourceCode}
+::: {#cb27 .sourceCode}
``` {.sourceCode .r}
surveys %>%
filter(!is.na(weight)) %>%
@@ -550,11 +558,13 @@ surveys %>%
```
:::
+ #> `summarise()` regrouping output by 'sex' (override with `.groups` argument)
+
It is sometimes useful to rearrange the result of a query to inspect the
values. For instance, we can sort on `min_weight` to put the lighter
species first:
-::: {#cb24 .sourceCode}
+::: {#cb29 .sourceCode}
``` {.sourceCode .r}
surveys %>%
filter(!is.na(weight)) %>%
@@ -565,10 +575,12 @@ surveys %>%
```
:::
+ #> `summarise()` regrouping output by 'sex' (override with `.groups` argument)
+
To sort in descending order, we need to add the `desc()` function. If we
want to sort the results by decreasing order of mean weight:
-::: {#cb25 .sourceCode}
+::: {#cb31 .sourceCode}
``` {.sourceCode .r}
surveys %>%
filter(!is.na(weight)) %>%
@@ -578,6 +590,8 @@ surveys %>%
arrange(desc(mean_weight))
```
:::
+
+ #> `summarise()` regrouping output by 'sex' (override with `.groups` argument)
:::
::: {#counting .section .level4}
@@ -588,7 +602,7 @@ found for each factor or combination of factors. For this task,
**`dplyr`** provides `count()`. For example, if we wanted to count the
number of rows of data for each sex, we would do:
-::: {#cb26 .sourceCode}
+::: {#cb33 .sourceCode}
``` {.sourceCode .r}
surveys %>%
count(sex)
@@ -600,7 +614,7 @@ grouping by a variable, and summarizing it by counting the number of
observations in that group. In other words, `surveys %>% count()` is
equivalent to:
-::: {#cb27 .sourceCode}
+::: {#cb34 .sourceCode}
``` {.sourceCode .r}
surveys %>%
group_by(sex) %>%
@@ -608,9 +622,11 @@ surveys %>%
```
:::
+ #> `summarise()` ungrouping output (override with `.groups` argument)
+
For convenience, `count()` provides the `sort` argument:
-::: {#cb28 .sourceCode}
+::: {#cb36 .sourceCode}
``` {.sourceCode .r}
surveys %>%
count(sex, sort = TRUE)
@@ -622,7 +638,7 @@ rows/observations for *one* factor (i.e., `sex`). If we wanted to count
*combination of factors*, such as `sex` and `species`, we would specify
the first and the second factor as the arguments of `count()`:
-::: {#cb29 .sourceCode}
+::: {#cb37 .sourceCode}
``` {.sourceCode .r}
surveys %>%
count(sex, species)
@@ -635,7 +651,7 @@ For instance, we might want to arrange the table above in (i) an
alphabetical order of the levels of the species and (ii) in descending
order of the count:
-::: {#cb30 .sourceCode}
+::: {#cb38 .sourceCode}
``` {.sourceCode .r}
surveys %>%
count(sex, species) %>%
@@ -645,7 +661,7 @@ surveys %>%
From the table above, we may learn that, for instance, there are 75
observations of the *albigula* species that are not specified for its
-sex (i.e. `NA`).
+sex (i.e. `NA`).
> ### Challenge {#challenge-2 .challenge}
>
@@ -655,7 +671,7 @@ sex (i.e. `NA`).
> ### Answer {#answer-2 .toc-ignore}
>
> ::: {style="background: #fff;"}
-> ::: {#cb31 .sourceCode}
+> ::: {#cb39 .sourceCode}
> ``` {.sourceCode .r}
> surveys %>%
> count(plot_type)
@@ -672,7 +688,7 @@ sex (i.e. `NA`).
> ### Answer {#answer-3 .toc-ignore}
>
> ::: {style="background: #fff;"}
-> ::: {#cb32 .sourceCode}
+> ::: {#cb40 .sourceCode}
> ``` {.sourceCode .r}
> surveys %>%
> filter(!is.na(hindfoot_length)) %>%
@@ -685,6 +701,8 @@ sex (i.e. `NA`).
> )
> ```
> :::
+>
+> #> `summarise()` ungrouping output (override with `.groups` argument)
> :::
> :::
>
@@ -695,7 +713,7 @@ sex (i.e. `NA`).
> ### Answer {#answer-4 .toc-ignore}
>
> ::: {style="background: #fff;"}
-> ::: {#cb33 .sourceCode}
+> ::: {#cb42 .sourceCode}
> ``` {.sourceCode .r}
> surveys %>%
> filter(!is.na(weight)) %>%
@@ -771,13 +789,19 @@ each genus in each plot over the entire survey period. We use
and variables of interest, and create a new variable for the
`mean_weight`.
-::: {#cb34 .sourceCode}
+::: {#cb43 .sourceCode}
``` {.sourceCode .r}
surveys_gw <- surveys %>%
filter(!is.na(weight)) %>%
group_by(plot_id, genus) %>%
summarize(mean_weight = mean(weight))
+```
+:::
+
+ #> `summarise()` regrouping output by 'plot_id' (override with `.groups` argument)
+::: {#cb45 .sourceCode}
+``` {.sourceCode .r}
str(surveys_gw)
```
:::
@@ -787,7 +811,7 @@ across multiple rows, 196 observations of 3 variables. Using `spread()`
to key on `genus` with values from `mean_weight` this becomes 24
observations of 11 variables, one row for each plot.
-::: {#cb35 .sourceCode}
+::: {#cb46 .sourceCode}
``` {.sourceCode .r}
surveys_spread <- surveys_gw %>%
spread(key = genus, value = mean_weight)
@@ -801,7 +825,7 @@ str(surveys_spread)
We could now plot comparisons between the weight of genera in different
plots, although we may wish to fill in the missing values first.
-::: {#cb36 .sourceCode}
+::: {#cb47 .sourceCode}
``` {.sourceCode .r}
surveys_gw %>%
spread(genus, mean_weight, fill = 0) %>%
@@ -836,7 +860,7 @@ called `genus` and value called `mean_weight` and use all columns except
`plot_id` for the key variable. Here we exclude `plot_id` from being
`gather()`ed.
-::: {#cb37 .sourceCode}
+::: {#cb48 .sourceCode}
``` {.sourceCode .r}
surveys_gather <- surveys_spread %>%
gather(key = "genus", value = "mean_weight", -plot_id)
@@ -857,7 +881,7 @@ and it's easier to specify what to gather than what to leave alone. And
if the columns are directly adjacent, we don't even need to list them
all out - just use the `:` operator!
-::: {#cb38 .sourceCode}
+::: {#cb49 .sourceCode}
``` {.sourceCode .r}
surveys_spread %>%
gather(key = "genus", value = "mean_weight", Baiomys:Spermophilus) %>%
@@ -878,13 +902,19 @@ surveys_spread %>%
> ### Answer {#answer-5 .toc-ignore}
>
> ::: {style="background: #fff;"}
-> ::: {#cb39 .sourceCode}
+> ::: {#cb50 .sourceCode}
> ``` {.sourceCode .r}
> surveys_spread_genera <- surveys %>%
> group_by(plot_id, year) %>%
> summarize(n_genera = n_distinct(genus)) %>%
> spread(year, n_genera)
+> ```
+> :::
+>
+> #> `summarise()` regrouping output by 'plot_id' (override with `.groups` argument)
>
+> ::: {#cb52 .sourceCode}
+> ``` {.sourceCode .r}
> head(surveys_spread_genera)
> ```
> :::
@@ -898,7 +928,7 @@ surveys_spread %>%
> ### Answer {#answer-6 .toc-ignore}
>
> ::: {style="background: #fff;"}
-> ::: {#cb40 .sourceCode}
+> ::: {#cb53 .sourceCode}
> ``` {.sourceCode .r}
> surveys_spread_genera %>%
> gather("year", "n_genera", -plot_id)
@@ -921,7 +951,7 @@ surveys_spread %>%
> ### Answer {#answer-7 .toc-ignore}
>
> ::: {style="background: #fff;"}
-> ::: {#cb41 .sourceCode}
+> ::: {#cb54 .sourceCode}
> ``` {.sourceCode .r}
> surveys_long <- surveys %>%
> gather("measurement", "value", hindfoot_length, weight)
@@ -940,7 +970,7 @@ surveys_spread %>%
> ### Answer {#answer-8 .toc-ignore}
>
> ::: {style="background: #fff;"}
-> ::: {#cb42 .sourceCode}
+> ::: {#cb55 .sourceCode}
> ``` {.sourceCode .r}
> surveys_long %>%
> group_by(year, measurement, plot_type) %>%
@@ -948,6 +978,8 @@ surveys_spread %>%
> spread(measurement, mean_value)
> ```
> :::
+>
+> #> `summarise()` regrouping output by 'year', 'measurement' (override with `.groups` argument)
> :::
> :::
:::
@@ -983,7 +1015,7 @@ data.
Let's start by removing observations of animals for which `weight` and
`hindfoot_length` are missing, or the `sex` has not been determined:
-::: {#cb43 .sourceCode}
+::: {#cb57 .sourceCode}
``` {.sourceCode .r}
surveys_complete <- surveys %>%
filter(!is.na(weight), # remove missing weight
@@ -1000,7 +1032,7 @@ how often each species has been observed, and filter out the rare
species; then, we will extract only the observations for these more
common species:
-::: {#cb44 .sourceCode}
+::: {#cb58 .sourceCode}
``` {.sourceCode .r}
## Extract the most common species_id
species_counts <- surveys_complete %>%
@@ -1020,13 +1052,13 @@ To make sure that everyone has the same data set, check that
Now that our data set is ready, we can save it as a CSV file in our
`data` folder.
-::: {#cb45 .sourceCode}
+::: {#cb59 .sourceCode}
``` {.sourceCode .r}
write_csv(surveys_complete, path = "data/surveys_complete.csv")
```
:::
-Page built on: 📆 2020-07-24 ‒ 🕢 19:45:38
+Page built on: 📆 2020-07-24 ‒ 🕢 19:46:31
:::
------------------------------------------------------------------------