-
Notifications
You must be signed in to change notification settings - Fork 41
/
05-data_tidying.Rmd
524 lines (386 loc) · 12.8 KB
/
05-data_tidying.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
# Data tidying
```{r, include=FALSE}
library(tidyverse)
```
**Learning objectives:**
- Classify datasets as **tidy** or **non-tidy.**
- **Pivot** data to make it tidy.
- Recognize reasons for **non-tidy data**.
Apart from the book content, these slides also borrow from [`vignette("pivot", "tidyr")`](https://tidyr.tidyverse.org/articles/pivot.html) and contain some new diagrams.
## Tidy data: what? {-}
![](images/6-tidy-1.png)
- **variables** are the **columns**
- **observations** are the **rows**
- **values** are the **cells**
**`<meaningful things>`** are in **`<places>`**
## Tidy data: why? {-}
- packages in the **tidyverse** are designed to work with **tidy data**
- hence tidy data allow to focus on the data questions
- placing variables in **columns** allows to take advantage of R’s **vectorized** nature
## An example (1) {-}
```{r include=FALSE}
table0 <- table1 |>
pivot_wider(
names_from = year,
values_from = c(cases, population),
)
```
```{r}
table0
```
## An example (2) {-}
```{r}
table1
```
## An example (3) {-}
```{r}
table2
```
## Which example dataframe was tidy? {-}
## Which example dataframe was tidy? {-}
```{r}
table1
```
## The challenge (1) {-}
- most real data is **untidy**: to facilitate some goal other than analysis
- so you often need to **tidy** the original data. This takes two steps:
1. determine the underlying variables and observations
1. pivot your data into a tidy form
## The challenge (2) {-}
It may be difficult to define what a variable is, especially if you miss context.
- e.g. columns `measurement_1` and `measurement_2`: replicates of the same variable, or referring to different methods / states?
If you cannot figure out, then it’s fine to say a variable is whatever makes your **analysis** easiest.
## Two different scenarios (1) {-}
1. 'too wide' is most common: _values_ (of a variable) have ended up _in **column names**_
- tidy by making the data **longer**: pivot these values into their own column (variable)
## Too wide {-}
```{r}
table4a # only has the 'cases', not the 'population'
```
## Two different scenarios (2) {-}
1. 'too wide' is most common: _values_ (of a variable) have ended up _in **column names**_
- tidy by making the data **longer**: pivot these values into their own column (variable)
1. 'too long' is less common: _names_ of several variables exist as _**cells in one column**_, their values in another
- tidy by making the data **wider**: pivot the column with variable names into several columns (variables)
## Too long {-}
```{r}
table2
```
## Tidy your data (1) {-}
- Make too wide data **longer** with `tidyr::pivot_longer()`.
- Make too long data **wider** with `tidyr::pivot_wider()`.
We'll focus on the relationship between both functions in different situations.
## Tidy your data (2) {-}
Most function arguments contain the strings `names` or `values`:
- **`names`**: refers to **variable** whose _values_ generate **column names** in **wide** table
- **`values`**: refers to **variable** whose _values_ generate **cells** in **wide** table
These arguments always get the **name** of the corresponding variable in the **long** table!
## 1. One `names` variable and one `values` variable {-}
![](images/5_diagram_1.png)
## Example {-}
```{r}
table4a
table_tidy <-
table4a |>
pivot_longer(
cols = !country,
names_to = "year",
values_to = "cases"
)
table_tidy
```
```{r}
table_tidy |>
pivot_wider(
names_from = year,
values_from = cases
)
```
## 2. Multiple `names` variables and one `values` variable {-}
![](images/5_diagram_2.png)
## Example {-}
```{r include=FALSE}
perc_wide <- tibble(
"2018_A" = 100,
"2018_B" = 0,
"2019_A" = 0,
"2019_B" = 100,
"2020_A" = 40,
"2020_B" = 60
)
```
```{r}
perc_wide
perc_tidy <- perc_wide |>
pivot_longer(
cols = everything(),
names_to = c("year", "type"),
names_sep = "_",
values_to = "percentage"
)
perc_tidy
perc_tidy |>
pivot_wider(
names_from = c(year, type),
names_sep = "_",
values_from = percentage
)
```
## Intermezzo: missing values (1) {-}
```{r}
billboard
```
```{r}
billboard %>%
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank"
)
```
## Intermezzo: missing values (2) {-}
Notice the effect of `values_drop_na` in `pivot_longer()`!
```{r}
billboard %>%
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
)
```
## Intermezzo: missing values (3) {-}
```{r include=FALSE}
weekdays <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
daily <- tibble(
day = factor(c("Tue", "Thu", "Fri", "Mon"), levels = weekdays),
value = c(2, 3, 1, 5)
)
```
Suppose we have a factor variable `day`:
```{r}
daily
levels(daily$day)
```
Now you want a wide table including missing values for the absent factor levels.
Does this work?
```{r}
daily |>
pivot_wider(
names_from = day,
values_from = value
)
```
Obviously not.
## Intermezzo: missing values (4) {-}
Notice the effect of `names_expand` in `pivot_wider()`!
```{r}
daily |>
pivot_wider(
names_from = day,
values_from = value,
names_expand = TRUE
)
```
## Intermezzo: missing values (5) {-}
Remember this example:
```{r}
perc_tidy
```
```{r}
perc_wide
```
## Intermezzo: missing values (6) {-}
Suppose we have this, and you want to make it **wide** and add **zeros**, not `NA`s:
```{r include=FALSE}
percentages <- tibble(
year = c(2018, 2019, 2020, 2020),
type = factor(c("A", "B", "A", "B"), levels = c("A", "B")),
percentage = c(100, 100, 40, 60)
)
```
```{r}
percentages
percentages |>
pivot_wider(
names_from = c(year, type),
values_from = percentage
)
```
That's not it...
## Intermezzo: missing values (7) {-}
Notice the effect of `names_expand` and `values_fill` in `pivot_wider()`!
```{r}
percentages |>
pivot_wider(
names_from = c(year, type),
values_from = percentage,
names_expand = TRUE,
values_fill = 0
)
```
## Recap {-}
So far we had:
1. One `names` variable and one `values` variable
1. Multiple `names` variables and one `values` variable
Plus tricks to generate missing values or deal with them.
Let's consider two more situations:
3. The `values` variable's name is included in the wide-table's column names
4. Multiple `values` variables' names are included in the wide-table's column names
## 3. The `values` variable's name is included in the wide-table's column names {-}
![](images/5_diagram_3.png)
## Example {-}
We take the example from the first situation, but with the name of the `values` variable included in the wide table:
```{r include=FALSE}
cases <- table_tidy |>
pivot_wider(
names_from = year,
values_from = cases,
names_glue = "{.value}_{year}"
)
```
```{r}
cases
```
We can now extract this name _instead_ of defining it with `values_to`:
```{r}
cases_long <- cases |>
pivot_longer(
cols = !country,
names_to = c(".value", "year"),
names_sep = "_"
)
cases_long
identical(cases_long, table_tidy)
```
Recreate the wide table:
```{r}
cases_long |>
pivot_wider(
names_from = year,
values_from = cases,
names_glue = "{.value}_{year}"
)
```
## 4. Multiple `values` variables' names are included in the wide-table's column names {-}
![](images/5_diagram_4.png)
## Example {-}
```{r}
household
household_tidy <- household |>
pivot_longer(
cols = !family,
names_to = c(".value", "child"),
names_sep = "_",
values_drop_na = TRUE
)
household_tidy
household_tidy |>
pivot_wider(
names_from = child,
values_from = c(dob, name),
names_glue = "{.value}_{child}"
)
```
## There's more! {-}
See [book chapter](https://r4ds.hadley.nz/data-tidy) and especially the [vignette](https://tidyr.tidyverse.org/articles/pivot.html)!
- drop or add a (wide-table) column name prefix with the `names_prefix` argument
- drop 'id' columns (i.e. non-pivoted columns) in `pivot_wider()` with the `id_cols` argument
- generate rows with missing values in `pivot_wider()` with the `id_expand` argument
- aggregating values in `pivot_wider()`
- combining `pivot_longer()` and `pivot_wider()` for more complex situations
- for more complex cases, instead of `names_*` and `values_*` arguments, you can construct a single specification dataframe that you can provide to both `pivot_longer_spec()` and `pivot_wider_spec()`
- and more
## Meeting Videos {-}
### Cohort 5 {-}
`r knitr::include_url("https://www.youtube.com/embed/5lsv2ftMouA")`
<details>
<summary> Meeting chat log </summary>
```
00:07:37 Federica Gazzelloni: Hello everyone!
00:07:53 Sandra Muroy: Hi Federica!
00:07:58 Adeyemi Olusola: hello everyon3.
00:09:56 Ryan Metcalf: Good day Cohort!
00:10:15 Sandra Muroy: Hi Adeyemi and Ryan :)
00:13:15 Becki R. (she/her): Hello!
00:13:38 Becki R. (she/her): yes
00:28:57 Keuntae Kim: punctuation mark also works for the numeric column name.
00:29:14 Keuntae Kim: table4a$`1999`
00:29:16 Keuntae Kim: I think..
00:29:47 Keuntae Kim: you got that... haha
00:37:53 Njoki Njuki Lucy: Thank you Jon, you just answered my question :)
00:44:12 Federica Gazzelloni: thanks
00:51:17 lucus w: The new {knitr} syntax allows inline chuck options. Better yet you can use yaml styling
00:52:00 lucus w: ```{r}
00:52:39 lucus w: ```{r} #| echo:FALSE ```
00:58:05 Ryan Metcalf: I know realize….I have an older version of the Print book versus the online version! Now I know why I was scratching my head!
00:58:13 Ryan Metcalf: “Now”
01:02:10 Keuntae Kim: Agree. reshape/reshape seem to begin deprecated even if there are still users (like me). 😀
01:02:14 Jon Harmon (jonthegeek): Yeah, I don't think they've done a printing since pivot_* came out.
01:13:45 lucus w: Good job on going through the exercise but maybe randomly select some
01:13:57 Federica Gazzelloni: agree
01:14:53 Keuntae Kim: agree, but exercises are very interesting. they are what we usually encounter when wrangling the data in practice.
01:17:12 lucus w: Sorry folks I have to drop.. good job Sandra!!
01:20:11 Keuntae Kim: Excellent job Sandra! You really dive in the contents very deeply!
01:20:27 Ryan Metcalf: Great Job Sandra!
```
</details>
`r knitr::include_url("https://www.youtube.com/embed/PB3dBof2q6I")`
<details>
<summary> Meeting chat log </summary>
```
00:11:11 Njoki Njuki Lucy: Hi everyone!
00:12:06 Becki R. (she/her): Hello!
00:14:06 Susie N: You could also just do as.numeric correct?
00:14:09 Susie N: Just not as nice
00:19:26 Jon Harmon (jonthegeek): Yup, the convert just figures out the as.* for you automatically.
00:22:11 Keuntae Kim: https://tldp.org/LDP/Bash-Beginners-Guide/html/sect_04_01.html
00:22:38 Keuntae Kim: simple intro. I used this for my text mining analysis.
00:23:56 Jon Harmon (jonthegeek): I use regexr.com a lot to test out regex quickly.
00:24:47 Keuntae Kim: Thank you for the link!
00:24:52 Federica Gazzelloni: thanks
00:50:11 Susie N: Amazing job breaking it all down!
00:50:41 Njoki Njuki Lucy: Thank you Sandra!
00:55:00 Becki R. (she/her): Thanks, Sandra!
00:55:36 Keuntae Kim: In the wide-format, it is easy to produce additional derivative variables--such as percentages?
00:56:24 Keuntae Kim: It is because, in the wide format, each row represents one single unique observation.
00:56:58 Keuntae Kim: Census data
00:57:03 Keuntae Kim: is organized in the wide format.
00:57:26 Jon Harmon (jonthegeek): Read vignettes here for cases where it makes sense to be wide: https://cran.r-project.org/web/packages/widyr/index.html
00:58:17 Federica Gazzelloni: thanks Sandra
```
</details>
### Cohort 6 {-}
`r knitr::include_url("https://www.youtube.com/embed/e5rO37mHrKQ")`
<details>
<summary> Meeting chat log </summary>
```
00:11:55 Marielena Soilemezidi: hi guys! :)
00:12:14 Adeyemi Olusola: Hello
00:12:16 Daniel Adereti: Hallo!
00:14:58 Marielena Soilemezidi: yep!
00:18:42 Daniel Adereti: The link to the paper: https://www.jstatsoft.org/article/view/v059i10
00:34:12 Daniel Adereti: why is table 2 not tidy again?
00:34:26 Marielena Soilemezidi: it mixes populations and cases
00:44:13 Marielena Soilemezidi: nope! All good :)
00:44:17 Daniel Adereti: none from me
01:12:20 Daniel Adereti: We are overtime though, maybe we conclude with Pivotwider next week?
01:13:06 Marielena Soilemezidi: thank you, Adeyemi!
01:13:06 Daniel Adereti: Fantastic class! Thanks Adeyemi!
01:13:14 Marielena Soilemezidi: great presentation :)
01:13:16 Folashade Bayode: Thank you
```
</details>
### Cohort 7 {-}
`r knitr::include_url("https://www.youtube.com/embed/2zHJ9v8Or9Y")`
<details>
<summary> Meeting chat log </summary>
```
00:54:04 Betsy Muriithi: Sorry for joining so late; I got the dates mixed up.. I though I was doing my presentation next Monday 😔
```
</details>
### Cohort 8 {-}
`r knitr::include_url("https://www.youtube.com/embed/xKthmzkTerU")`
### Cohort 9 {-}
`r knitr::include_url("https://www.youtube.com/embed/tr7b8Fpiezk")`
### Cohort 10 {-}