-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy path610_data_wrangling_2.Rmd
530 lines (349 loc) · 16 KB
/
610_data_wrangling_2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
<!--
This file by Martin Monkman
is licensed under a Creative Commons Attribution 4.0 International License
https://creativecommons.org/licenses/by/4.0/
-->
# Data wrangling—continued {#wrangling2}
## Setup
This chunk of R code loads the packages that we will be using.
```{r setup_610, eval=FALSE}
library(tidyverse)
library(gapminder)
```
## Introduction {#wrangling2-intro}
This chapter revisits the process of data wrangling with some additional techniques.
### `ungroup()`
When we use a `group_by()` and `summarize()` pairing, we can calculate summary statistics for each group in our data.
In this example, we calculate the difference in a country's life expectancy from the continent's mean life expectancy
- for example, the difference in 2007 between life expectancy in Canada and the mean life expectancy of countries in the Americas
What we had seen before was the use of `group_by() |> summarize()` to create summary statistics by a group:
```{r}
gapminder |>
filter(year == 2007) |>
group_by(continent) |>
summarize(lifeExp_mean = mean(lifeExp))
```
But how can we compare the Canada value to the Americas mean shown here? One strategy would be to join this table to the original, using "continent" as the key value.
But there is another solution: `group_by() |> mutate()`
Step 1:
* Filter for 2007
* Then group by continent and
* _Mutate_ to get the mean continental life expectancy
**Note:** the tibble that results from this retains the "Groups:". It is important to be aware of this, since any future manipulations of this table will be based on that grouping.
```{r}
gm_life_2007 <-
gapminder |>
# select and filter
select(country, continent, year, lifeExp) |>
filter(year == 2007) |>
# group_by |> mutate
group_by(continent) |>
mutate(lifeExp_con_mean = mean(lifeExp))
gm_life_2007
```
To that data frame, we can append another mutate to calculate the difference between the individual country and the continent.
Note that addition of the the `ungroup()` function, which removes the grouping.
In the resulting table, we see the observations for 2007, with all the countries in the Americas. Note that they all have the same value of "lifeExp_con_mean", which is subtracted from "lifeExp" to get the "lifeExp_diff" variable. Countries which have a life expectancy above the continental mean show positive values, while those where the life expectancy is below, have negative values.
```{r}
gm_life_2007 |>
ungroup() |>
# subtract continent mean from individual country
mutate(lifeExp_diff = lifeExp - lifeExp_con_mean) |>
# select and filter for just the Americas
filter(continent == "Americas")
```
In the version below, it's a single pipe that uses both year and continent as the grouping variables. This gets the same result with the added comparison of the individual country to the world average for that year, for every year in the data:
```{r}
gm_life <- gapminder |>
# select
select(country, continent, year, lifeExp) |>
# continent life expectancy
group_by(year, continent) |>
mutate(lifeExp_con_mean = mean(lifeExp)) |>
ungroup() |>
mutate(lifeExp_con_diff = lifeExp - lifeExp_con_mean) |>
# world life expectancy
group_by(year) |>
mutate(lifeExp_earth_mean = mean(lifeExp)) |>
ungroup() |>
mutate(lifeExp_earth_diff = lifeExp - lifeExp_earth_mean)
gm_life |>
filter(year == 2007 & continent == "Americas")
```
Now we can create a plot to compare Canada to the averages of the continent and the world over time:
```{r}
gm_life |>
filter(country == "Canada") |>
ggplot() +
geom_line(aes(x = year, y = lifeExp_con_diff), colour = "red") +
geom_line(aes(x = year, y = lifeExp_earth_diff), colour = "blue")
```
This isn't a story about Canada's life expectancy getting worse—life expectancy in Canada improved from 68.8 in 1952 to 80.7 in 2007.
What the chart shows are the substantial improvements in the life expectancy for people elsewhere around the world (the blue line) and in the Americas (the red line). While life expectancy in Canada improved and remains well above average, for many countries of the world, life expectancy is much, much longer than it was 70 years ago.
#### Moneyball data
A similar approach was used to create the "pay_index" variable in the Moneyball assignment.
First, we read the data in the file:
```{r}
mlb_pay_wl <- read_csv("data/mlb_pay_wl_original.csv")
mlb_pay_wl
```
You will see that the file contains the team payroll, wins and losses, but not the two calculated variables that we used in the linear regression model.
To replicate the file we used in the assignment, we need to add
1. a win-loss percentage
```{r}
# win-loss percentage is the number of wins, divided by the total number of games played
mlb_pay_wl <- mlb_pay_wl |>
mutate(wl_pct = round(w / (w + l), 3))
mlb_pay_wl
```
2. the pay index, comparing the individual team payroll to the average of all the teams that season, where 100 is the league average.
Note that this requires a group/ungroup within the pipe.
```{r}
mlb_pay_wl <- mlb_pay_wl |>
group_by(year_num) |>
mutate(league_avg_pay = mean(est_payroll)) |>
ungroup() |>
mutate(pay_pct_league = est_payroll / league_avg_pay * 100)
mlb_pay_wl
```
Let's create a histogram to show the distribution of team payroll, relative to the league average:
```{r}
mlb_pay_wl |>
ggplot(aes(x = pay_pct_league)) +
geom_histogram()
```
### `across()`
The `across()` function:
>makes it easy to apply the same transformation to multiple columns, allowing you to use select() semantics inside in "data-masking" functions like summarise() and mutate(). See
These examples are variations of the ones from the [{dplyr} reference "Apply a function (or functions) across multiple columns"](https://dplyr.tidyverse.org/reference/across.html). That resource, and the related vignette ["colwise"](https://dplyr.tidyverse.org/articles/colwise.html) have more examples and details.
For our examples, we will use the `penguins` data table from the [{palmerpenguins} package](https://allisonhorst.github.io/palmerpenguins/).
```{r}
library(palmerpenguins)
penguins
```
In this example, the rounding function is applied to the two bill measurements:
```{r}
penguins |>
mutate(across(c(bill_length_mm, bill_depth_mm), round))
```
Since those measures are of type "double" and the other two numeric measures are integer, we could identify the variables we want to round using the type, via the `is.double` argument:
```{r}
penguins |>
mutate(across(where(is.double), round))
```
It's also possible to use the `across()` function to apply the same function(s) to multiple variables. Here we will use the `group-by() |> summarize()` functions to calculate the mean of the bill measurements.
* Note also the `starts_with()` function that is used to identify the columns of interest.
* The `mean()` function has a tilde in front. In R, the "~" is used to indicate a function, so
`~mean(.x, na.rm = TRUE)`
is a shortcut for
`function(.x){mean(x, na.rm = TRUE)}` where ".x" is a placeholder for every variable we have defined in our `across()` function.
(It might help to think of this like a looping function—for every variable, the `summarise()` loops through the `mean()` function.)
```{r}
penguins |>
group_by(species) |>
summarise(across(starts_with("bill_"),
~mean(.x, na.rm = TRUE)))
```
In this example, the `list` argument defines a list of functions to be applied. By putting the name of the function ahead of the function (e.g. `mean = mean`), that term gets appended to the name of the variables created by the `summarise()` function. (You may want to see what happens when you change this to `mean, sd` or `cat = mean, dog = sd`!)
```{r}
penguins |>
group_by(species) |>
summarise(across(starts_with("bill_"),
list(mean = mean, sd = sd),
na.rm = TRUE))
```
### Calculations across rows
Another common data manipulation outcome we seek is the average _across_ our variables...so far everything we've done has been _down_ the variables. R is really good at the latter. Here's some techniques to apply a function across a row.
(This is drawn from the {dplyr} vignette ["Row-wise operations"](https://dplyr.tidyverse.org/articles/rowwise.html))
First we create a little tibble:
```{r}
df <- tibble(x = 1:2, y = 3:4, z = 5:6)
df
```
Then we apply the `rowwise()` function:
```{r}
df |> rowwise()
```
It looks just the same, but you will note the "Rowwise:" indicator above the table. This means that any functions that are applied will run across the rows instead of the usual columnwise direction.
```{r}
df |>
rowwise() |>
mutate(m = mean(c(x, y, z)))
```
Without the `rowwise()` function we get 3.5 for both rows. This the mean of "x" and "y" and "z" (that is, the mean of the integers 1 through 6):
```{r}
df |>
mutate(m = mean(c(x, y, z)))
```
In the examples below, we will use a bigger tibble:
```{r}
df2 <- tibble(id = letters[1:6], w = 10:15, x = 20:25, y = 30:35, z = 40:45)
df2
```
To calculate the sum of the columns, we could `mutate()` to get a new column, or `summarize()` for just the total.
By putting the "id" in the `rowwise()` function, it acts as a grouping variable _across_ the rows.
```{r}
# mutate for full table
df2 |>
rowwise(id) |>
mutate(total = sum(c(w, x, y, z)))
```
```{r}
# summarize for the total only
df2 |>
rowwise(id) |>
summarize(total = sum(c(w, x, y, z)))
```
To streamline the specification of the variables we use, the `c_across()` function can be applied, indicating the range of the variables we want to sum and average:
```{r}
df2 |>
rowwise(id) |>
summarize(total = sum(c_across(w:z)),
average = mean(c_across(w:z)))
```
There are two other functions that streamline this syntax still further: `rowSums` and `rowMeans`.
**Note:** this function has the `rowwise()` built in!
```{r}
df2 |>
mutate(total = rowSums(across(where(is.numeric))))
```
### Comparing values between rows
It is common in time series analysis to compare values that are in different rows. For example, in a data table with quarterly revenue data, you might want to report on the change from one quarter to the next. Or with monthly data where there is a strong seasonal trend comparing one month to the next might not be all that useful (think of the difference in retail sales in January compared to December), so a comparison to the same month of the previous year would reveal more insight. The [New Housing Price Index example](#walkthrough) we saw earlier incorporates both month-over-month and year-over-year comparisions.
The {dplyr} package has two functions, `lag()` and `lead()`, that allow us access to the value in a different row.
#### Exercies:
Using the {gapminder} data, find the country with biggest jump in life expectancy (between any two consecutive records) for each continent.
```{r}
# One of many solutions
gapminder |>
group_by(country) |>
mutate(prev_lifeExp = lag(lifeExp),
jump = lifeExp - prev_lifeExp) |>
arrange(desc(jump), continent)
# Another solution
gapminder |>
group_by(country) |>
mutate(prev_lifeExp = lag(lifeExp),
jump = lifeExp - prev_lifeExp) |>
ungroup() |>
group_by(continent) |>
slice(which.max(jump)) |>
arrange(desc(jump))
# Another solution
gapminder |>
group_by(country) |>
mutate(prev_lifeExp = lag(lifeExp),
jump = lifeExp - prev_lifeExp) |>
ungroup() |>
group_by(continent) |>
slice_max(jump) |>
arrange(desc(jump))
```
## Recode variables with `case_when()`
We sometimes find ourselves in a situation where we want to recode our existing variables into other categories. For example, sometimes you have a handful of categories that make up the bulk of the total, and summarizing the smaller categories into an "all other" makes the table or plot easier to read. Think of the populations of the 13 Canadian provinces and territories: the four most populous provinces (Ontario, Quebec, British Columbia, and Alberta) account for roughly 85% of the country's population...the other 15% are spread across nine provinces and territories. We might want to show a table with only five rows, with the smallest nine provinces and territories grouped into a single row.
We can do this with a function in {dplyr}, `case_when()`.
```{r}
canpop <- read_csv("data/canpop.csv")
canpop
```
In this first solution, we name the largest provinces separately:
```{r}
canpop |>
mutate(pt_grp = case_when(
# evaluation ~ new value
province_territory == "Ontario" ~ "Ontario",
province_territory == "Quebec" ~ "Quebec",
province_territory == "British Columbia" ~ "British Columbia",
province_territory == "Alberta" ~ "Alberta",
# all others get recoded as "other"
TRUE ~ "other"
))
```
But that's a lot of typing. A more streamlined approach is to put the name of our provinces in a list, and then recode our new variable with the value from the original variable:
```{r}
canpop |>
mutate(pt_grp = case_when(
province_territory %in% c("Ontario", "Quebec", "British Columbia", "Alberta") ~ province_territory,
TRUE ~ "other"
))
```
But we can do even better, by using a comparison to create a population threshold:
```{r}
canpop <- canpop |>
mutate(pt_grp = case_when(
population > 4000000 ~ province_territory,
TRUE ~ "other"
))
canpop
```
Now we can use that new variable to group a table:
```{r}
canpop |>
group_by(pt_grp) |>
summarise(pt_pop = sum(population)) |>
arrange(desc(pt_pop))
```
In another example, we might want to group a series of dates by decade.
```{r}
date_list <- tribble(
~ref_date,
"2000-02-29",
"2004-02-29",
"2011-01-01",
"2014-03-31",
"2019-09-01",
"2023-03-28"
) |>
mutate(ref_date = ymd(ref_date))
date_list
```
To create the decade groupings, the `case_when()` follows a `mutate()` function. Note that we are using `lubridate::year()` to extract the year value for our comparison.
```{r}
date_list <- date_list |>
mutate(ref_decade = case_when(
year(ref_date) < 2010 ~ "aughts",
year(ref_date) < 2020 ~ "teens",
TRUE ~ "other"
))
date_list
```
Another option for this problem would be to round the year value down to the nearest 10. This uses the `round_any()` function from the {plyr} package. (The `round()` function from base R rounds to the nearest 10, so a year like 2019 gets rounded up to the 2020s. The `floor()` function, related to `round()`, rounds down, but only to the nearest integer.)
```{r}
# to round year value down to first year of the decade
date_list |>
mutate(ref_decade = plyr::round_any(year(ref_date), accuracy = 10, f = floor))
```
### Multiple comparisons
In this example from the [`case_when()` reference page:](https://dplyr.tidyverse.org/reference/case_when.html), it uses different combinations of three variables to create a single fourth variable. It uses the built-in data set "starwars", a list of characters from the movies.
```{r}
# case_when is particularly useful inside mutate when you want to
# create a new variable that relies on a complex combination of existing
# variables
starwars
```
For this example, we mutate a new variable based on whether a character is a robot, or whether they are large (defined by height or mass).
```{r}
starwars |>
select(name:mass, gender, species) |>
mutate(
type = case_when(
species == "Droid" ~ "robot",
height > 150 | mass > 200 ~ "large",
TRUE ~ "other"
)
)
```
And order matters! The function `case_when()` finds all of the cases that meet the first criteria, and then applies the second criteria to the remaining observations. If we switch the order and move "robot" below "large", we find (as one example) that C-3PO is now in the large category, as they are over 150 cm in height.
```{r}
# create "type" variable
starwars |>
select(name:mass, species) |>
mutate(
type = case_when(
# large first, then robot
height > 150 | mass > 200 ~ "large",
species == "Droid" ~ "robot",
TRUE ~ "other"
)
)
```
-30-