-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy path150_intro_to_data_wrangling.Rmd
728 lines (383 loc) · 23.2 KB
/
150_intro_to_data_wrangling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
<!-- This file by Martin Monkman
is licensed under a Creative Commons Attribution 4.0 International License
https://creativecommons.org/licenses/by/4.0/
The tables of functions are modified from originals by Gina Reynolds, [Tidyverse in Action](https://github.com/EvaMaeRey/tidyverse_in_action) -->
# Introduction to data wrangling {#data-wrangle}
## Setup
This chunk of R code loads the packages that we will be using.
```{r setup150}
library(tidyverse)
library(gapminder)
```
Data is abundant, and as data analysts we often have to spend time working with our data tables to get it ready for whatever analysis we need to do. "Data wrangling" and "data transformation" are two terms that are used to describe the process of altering the structure of our data.^["Data cleaning" is a whole other thing—we will encounter that later.]
There are three broad categories of data transformation:
- **alter the structure** of the existing data tables. This could be through adding or removing rows or columns to our data table, including calculation of new variables, or importation of additional rows (observations) and joining two tables together.
- **summarize the content** of the source data tables. For example, if our data table has the population of all the countries in the world, we might want a summary table that has the table with the total for each continent.
- with both our existing data or our summary tables, we may want **change the layout**, either for analysis purposes or to make it easier for human readers to absorb the information.
In this section, we will begin to work with all three transformations.
### Reading: Data wrangling {#data-wrangle-reading}
Please refer to Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund [_R for Data Science_, 2nd ed., Data transformation](https://r4ds.hadley.nz/data-transform) for more information about this section.
For this section, and for other exercises throughout this course, we will be using a subset of the Gapminder data. These data tables contain information about the countries of the world, and were used by Hans Rosling to explain that there have been significant changes in the quality of people's lives around the world.
* [The best stats you've ever seen](https://www.ted.com/talks/hans_rosling_the_best_stats_you_ve_ever_seen?language=en)— this is the video that introduced me to Hans Rosling.
* [200 Countries, 200 Years, 4 Minutes](https://www.youtube.com/watch?v=jbkSRLYSojo&ab_channel=BBC)—recreating the chart in this video is one of the capstone project options
* It's outside the scope of this course, but I also recommend Rosling's book _Factfulness_[@Rosling_2018], which goes deeper into the changes in the world and how data explains those changes.
* An article related to the work of Hans Rosling can be found here: ["The world is awful. The world is much better. The world can be much better."](https://ourworldindata.org/much-better-awful-can-be-better), by Max Roser, the Founder and Director of [Our World In Data](https://ourworldindata.org/), an organization dedicated to sharing data and analysis to make progress against the world’s largest problems.
## Transform Data: 1a Alter the structure {#data-wrangle-transform1}
![_dplyr hex_](static/img_tidyverse/dplyr.png){width=250, height=250}
> The [{dplyr} reference site](https://dplyr.tidyverse.org/index.html)
> [Data Transformation Cheat Sheet](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf) (PDF file)
When working with data, particularly large data sets, you encounter a situation where you need to:
* subset the data so that it contains only those _records_ (rows) you are interested in (Why deal with a data table that has every province in Canada when you just need the British Columbia data?)
* subset the data so that it contains only those _variables_ (columns) you are interested in (Why keep looking at variables that do not factor into your analysis, or you will not print in your report?)
* create new variables, often through calculations based on variables in your data
To achieve these three goals, we will be using functions from the tidyverse package {dplyr}. The functions are verbs that describe what they do. Here's a list of the {dplyr} functions that you'll be using:
|function |action |
| :-- | :-- |
| `filter()` | *keep rows*|
| `select()` | *keep variables (or drop them -var or re-order them)*|
| `mutate()` | *create a new variable*|
| `arrange()` | *sort from smallest to largest*|
| `arrange(desc())` | *sort from largest to smallest*|
**Other operators**
In addition to these functions, R has a number of operators that add significantly to your code.
**Arithmetic operators**
| Operator | Description |
| :-- | :-- |
| + | *plus* |
| - | *minus* |
| \* | *multiplication* |
| / | *division* |
| ^ | *exponential* |
**Boolean operators** are a special type of operator that return TRUE or FALSE.
| Operator | Description |
| :-- | :-- |
| == | *equal, tests equality* |
| != | *not equal, tests inequality* |
| \> | *greater than, tests greater than* (also >=) |
| < | *less than, tests less than* (also <=) |
| %in% | *contains, tests inclusion* |
Boolean operators can be combined with *and* or *or*.
| Operator | Description |
| :-- | :-- |
| `&` | *and, returns true if preceding and following are both true, else false* |
| | | *or, returns true if either preceding and following are true, else false* |
> see Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund [_R for Data Science_, 2nd ed., "Comparisions"](https://r4ds.hadley.nz/logicals.html#comparisons) and ["Boolean algebra"](https://r4ds.hadley.nz/logicals.html#boolean-algebra).]
## Introducing the pipe operator ` |> ` {#data-wrangle-pipe}
(see [_R for Data Science_, 2nd ed., The Pipe](https://r4ds.hadley.nz/data-transform.html#sec-the-pipe) for a deep dive into pipes)
The pipe operator ` |> ` allows us to name the data frame first, and then have the next function work with the outcome of the first line.
For example, we could write
`function(data, argument)`
or, with a pipe
`data |> function(argument)`
Read the pipe symbol as "then". We start with our data, _then_ we apply the function.
If we want to do anything with the output from the first statement, we have to assign it to an intermediate object, and then use that intermediate object in the second statement. With the pipe, we can add another pipe and the results of the first statement are passed to the second statement without the intermediate object.
As we get to increasingly complex statements, the power of this will become more evident.
This code calculates the mean of life expectancy (the `lifeExp` variable) in the gapminder data (note that we are accessing the `lifeExp` variable by adding the dollar sign to the dataframe name).
```{r}
mean(gapminder$lifeExp)
```
Another way to do this would be using the pipe operator. The symbol is `|>` and links one line of script to the next. The way to read this is as meaning "and then..."
At this point, we only have one step in our chain, but in future code, you might have multiple steps linked one after the other.
So this little chunk would be read as
* we start with the dataframe "gapminder", "and then..."
* we calculate a variable called "mean_of_life_expectancy" by running the `mean()` function on the variable "lifeExp"
```{r}
gapminder |>
summarise(mean_of_life_expectancy = mean(lifeExp))
```
> **IMPORTANT NOTE**
> The pipe operator `|>` is in base R, but it wasn't always. Prior to its introduction there was a pipe symbol available through the {magrittr} package, which looks like this: `%>%`. You will still see this version in older texts (such as the first edition of _R for Data Science_) and course materials. It was used in BIDA302, so if you come across older course material, it might use the `%>%` pipe operator.
## Filter for rows
The calculations above are for the entire dataframe. Often in data analysis, we want to analyze a single group. We might want to examine life expectancy changes in a single country, or for calculate the average across the world for a single year. For this type of analysis, we can use the `filter()` function.
In the code chunk below, we use the `filter()` function to get the rows for Canada. This will return the 12 rows that have the records for Canada.
```{r}
gapminder |>
filter(country == "Canada")
```
In our next code chunk, we use filter to get the rows for every country, for the single year 2002. Note that because the "year" variable is a numeric type (specifically, an integer) we do not put the year value (2002) inside quotation marks.
```{r}
gapminder |>
filter(year == 2002)
```
We can use this filter as the first step in a pipe that allows us to calculate the average life expectancy for all of the countries in 2002.
```{r}
gapminder |>
filter(year == 2002) |>
summarise(global_life_expectancy = mean(lifeExp))
```
## Creating lists {#data-wrangle-lists}
In this example, we filter to get three countries, "New Zealand", "Canada", and "United States".
Note that to get all three, the statement is _OR_, not _AND_! There are no cases where `country == "New Zealand"` and simultaneously `country == "Canada"` ...
```{r}
gapminder |>
filter(country == "New Zealand" |
country == "Canada" |
country == "United States")
```
Note as well that the order of the values in the `filter()` function does not matter—the table that results is in the same order as the source table.
Instead of this rather verbose syntax, we can achieve the same result by creating a list.
To create a list in R, we use the `c()` function.
```{r filter6_example-solution}
# answer
gapminder |>
filter(country %in% c("New Zealand", "Canada", "United States"))
```
Another option would be to create a named object that contains our list. That list can then be referenced in a `filter()` statement using the `%in%` operator as the evaluator.
```{r}
# alternate answer
country_list <- c("New Zealand", "Canada", "United States") # create object that is list of countries
gapminder |>
filter(country %in% country_list) # filter uses that object
```
This final option has a significant benefit. Now if we need to use that list of countries in another part of our analysis, we can just summon the object, rather than typing the entire list again.
This also reduces the risk of inadvertent error if we want to change the countries we are analyzing. If we have a named list object, we only need to change it in the one place, and that change will be incorporated throughout our code. Contrast that with a find-and-replace approach, where you might accidentally miss one of the places it appears, or you might have a typo in one place.
**NOTE:** Watch your quotation marks! In many cases double and single quotes are interchangable, but fancy "smart" quotes to start and end are not valid.
This example uses the `flights` table from the {nycflights13} package, and filters for Alaska Airlines Inc., which has the value "AS" in the variable "carrier".
```{r, eval=FALSE}
# will work...double quotes
nycflights13::flights |> filter(carrier == "AS")
# will work...single quotes
nycflights13::flights |> filter(carrier == 'AS')
# will not work due to "smart" quotes
nycflights13::flights |> filter(carrier == “AS”)
```
## Select columns {#data-wrangle-select}
You've already seen `filter()` to choose particular rows in your dataframe. What if you want to remove some columns (variables)?
Here we will use `select()` to reduce the number of variables in our gapminder table to "country", "year", and "lifeExp":
```{r}
gapminder |>
select(country, year, lifeExp)
```
You can use the minus sign to drop variable--that is, to define ones to _not_ select.
In this example, note we have to use `c()` to create a list:
```{r}
gapminder |>
select(-c(continent, pop, gdpPercap))
```
You can also define a range of columns to select:
```{r}
gapminder |>
select(country, year:pop)
```
## Add new variables
You can calculate new variables that may be useful in your analysis. The {dplyr} package function for this is `mutate()`.
For this example, we will use the "mpg" data set, which is a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov. For more information about the data set, see the reference page for the dataset: https://ggplot2.tidyverse.org/reference/mpg.html
The first thing we want to do is to take a quick look at the data so we get a sense of its structure, using the `ls.str()` function:
```{r}
ls.str(mpg)
```
We can also view the data as a table:
```{r}
mpg
```
_vectors of the same length_
As part of our analysis, we want to understand the ratio of gas consumption to the size of the engine, measured as the variable `displ` (short for "displacement"). To calculate that ratio, we divide the highway consumption (`hwy`) by the displacement.
The {dplyr} function to create a new variable is `mutate()`. This is shown in the code below:
```{r}
mpg <- mpg |>
mutate(mpg_per_cubic_litre = hwy / displ)
```
Notice that there's now a new variable added to the data called `mpg_per_cubic_litre`.
In that example, the two columns are of the same length—for each car (each row in the data) there is a `hwy` value and a `displ` value.
_Vector of the length = 1_
In other instances, we might want to have a calculation using a constant—we might want to multiply every one of our cases by 2, or add 10 to each.
In the example below, we will add a new variable to our data set using `mutate()`.
This will convert miles per US gallon to litres per 100 kilometers. The first step is to create a constant value, based on the number of gallons in a litre and the number of kilometers in a mile.
Once we have that constant, we can use it in an equation to calculate the litres per 100 kilometers for every car in the data.
```{r}
# calculate conversion constant
# a US gallon is 3.785 litres
# a kilometer is 1.61 miles
lp100km <- (100 * 3.785) / (1.61)
lp100km
# divide constant by hwy mpg to get hwy litres per 100 km
mpg |>
mutate(hwy_lp100km = lp100km / hwy)
```
## Summarize (or Summarise)
<!-- This section of the file is by Charlotte Wickham (with some modifications by Martin Monkman), and is licensed under a Creative Commons Attribution 4.0 International License, adapted from the original work at https://github.com/rstudio/master-the-tidyverse by RStudio. -->
Take a quick look at the data in the {gapminder} package.
```{r}
gapminder
```
R has a built-in function to generate some descriptive (or summary) statistics. The first is `summary()` which gives a few key statistics about the distribution.
```{r}
summary(gapminder)
```
The `summarize()` (or `summarise()`) functions allows us to use a pipe sequence to compute statistics about the data. In the example below, we will calculate three statistics:
* The first (minimum) year in the dataset
* The last (maximum) year in the dataset
* The number of unique countries
```{r}
# answer
gapminder |>
summarize(year_min = min(year),
year_max = max(year),
n_countries = n_distinct(country))
```
Everything we have done so far has been with to manipulate the entire data file (filtering and selecting), or calculating descriptive statistics for all the values in our data. Often in our data analysis we need to compare _within_ our data. Here are some questions we can answer through the gapminder data:
1. What was the median life expectancy in African countries in 2007?
2. What was the median life expectancy by continent in 2007?
3. What was the median life expectancy by continent for each year in the data?
To answer the first of these, we can filter the data to get only the African countries and 2007.
But for the second and third, we need to _group_ our data by continent and year, and then _summarize_ those groups.
The R package {dplyr} has a grouping function `group_by()`. Once the data is grouped, we apply a `summarize()` (or `summarise()` if you prefer!) function with our descriptive statistics.
Let's walk through the steps for the first question:
Extract the rows for African countries in 2007.
Then find:
1. The number of unique countries
2. The median life expectancy of African countries as a group
```{r}
# answer
gapminder |>
# filter for the continent and the year
filter(continent == "Africa" & year == 2007) |>
# calculate the summary statistics
summarise(n_countries = n_distinct(country),
lifeExp_med = median(lifeExp))
```
### Grouping
We can use the {dplyr} function `group_by()` to define our grouping variables. If we group all of the countries by the continent they are part of, we can calculate a statistic for the whole continent.
The second question was to find the median life expectancy by continent in 2007.
```{r}
# answer
gapminder |>
# filter for the year
filter(year == 2007) |>
# apply the grouping variable, and then summarize
group_by(continent) |>
summarize(lifeExp_mean = mean(lifeExp))
```
#### Grouping with multiple variables
We can group by multiple variables.
Find the median life expectancy by continent for each year in the dataframe.
```{r}
gapminder |>
group_by(continent, year) |>
summarize(lifeExp_med = median(lifeExp))
```
Create a summary table with the population and total GDP by continent for the year 1952
```{r}
# answer
gapminder |>
filter(year == 1952) |> # filter to get the records we need
mutate(totalGDP = pop * gdpPercap) |> # note that we need to add the new variable calculated above!
group_by(continent) |> # this defines the grouping category
summarize(pop = sum(pop), #
totalGDP = sum(totalGDP)) #
```
### Counting the number of cases
A common step in a data analysis project is to count the number of cases in our dataset, often by groups.
For these examples, we will use the `mpg` dataset, and produce a summary table that shows the number of vehicles by class (compact, suv, etc) and number of cylinders in the engine (the variable `cyl`).
In this section, we will look at three different approaches to get to the same result.
**`n()`**
Here we start with the original ungrouped dataset, applying the `group_by()`, then use the `n()` function inside a `summarise()` to count the number of cases (or observations).
```{r 150-mpg-summary}
mpg |>
group_by(class, cyl) |>
summarise(n = n())
```
To get a sorted result, we would need to add an `arrange(desc())` function to our pipe:
```{r}
mpg |>
group_by(class, cyl) |>
summarise(n = n()) |>
arrange(desc(n))
```
**`tally()`**
Another approach to count the number of rows in each group is to use the `tally()` function, which takes the place of the `summarize(n = n())` function.
```{r}
mpg |>
group_by(class, cyl) |>
tally()
```
If we want to sort in the `tally()` function, we can add a `sort = TRUE` argument. Note that this behaves differently than `arrange()`; `sort = TRUE` is largest to smallest.
```{r}
mpg |>
group_by(class, cyl) |>
tally(sort = TRUE)
```
**`count()`**
The most streamlined way we could get to our grouped summary table with counts is by using the `count()` function. This function merges both the grouping and tallying functions.
The categories we want to count become the arguments of the `count()` function.
```{r}
mpg |>
count(class, cyl)
```
Similar to tally, we can include a `sort = TRUE` argument.
```{r}
mpg |>
count(class, cyl, sort = TRUE)
```
Without the grouping variables, these three statements produce the same result:
```{r}
mpg |> summarise(n = n())
mpg |> tally()
mpg |> count()
```
There are other counting functions within {dplyr}, see the reference page https://dplyr.tidyverse.org/reference/count.html for more details.
We will return to data wrangling [later in the course](#wrangling2), and explore more complex calculations.
## Take aways
* Extract cases with `filter()`. For filtering, we have to specify the _values_ we want and the _variable_ that they are in.
* Choose the rows we want from our dataframes with `select()`. To select the columns we want, we have to specify the _variable(s)_ that we want.
* Create new variables, with `mutate()`
* Filter using a list with `%in%`
* Connect operations with the pipe symbol `|>`
* Make tables of summaries with `summarise()`
* Do groupwise operations with `group_by()`
* Count the number of cases with `tally()` and `count()`
## Functions for data wrangling
### Reference: {dplyr} functions
The functions in {dplyr} (and many other packages) are verbs that describe what they do.
|function |action |
| :-- | :-- |
| `filter()` | *keep rows*|
| `mutate()` | *create a new variable*|
| `group_by()` | *declare subsets in data*|
| `select()` | *keep variables (or drop them -var or re-order them)*|
| `rename()` | *renaming variables*|
| `summarize()` | *summarize the data, by groups if they have been declared*|
| `distinct()` | *returns only rows that are unique*|
| `case_when()` | *is used for "recoding" variable, often used with mutate()*|
|function |action |
| :-- | :-- |
| `length` <br> `dplyr::count()` | *number of cases* |
| `tally()` | *counting (by groups if group_by() applied)*|
The full list of {dplyr} functions is here: https://dplyr.tidyverse.org/reference/index.html
### Reference: Descriptive statistics
These statistics can all be generated individually, using the functions in the table below. Other statistical measures about a variable can be calculated with the appropriate function. Note that these are just a few of the functions that are included in base R; there are also other statistical measures available through various packages.
| statistic | function |
| :-- | :-- |
| mean | `mean()` |
| median | `median()` |
| mode | `mode()` |
| minimum and maximum | `min()` `max()` |
| range | `range()` |
| quantile | `quantile()` (default: quartiles) |
| — | — |
| sum | `sum()` |
| standard deviation | `sd()` |
| variance | `var()` |
| kurtosis | `kurtosis()` |
| skewness | `skewness()` |
| correlation coefficient | `cor()` |
| correlation coefficient (two variables) | `cor()` |
| — | — |
| count (number of cases) | `length()` |
| number of unique cases | `n_distinct()` |
### Reference: Boolean operators
Boolean operators are a special type of operator that return TRUE or FALSE. They are used when we want to compare values.
| Operator | Description |
| :-- | :-- |
| == | *equal, tests equality* |
| != | *not equal, tests inequality* |
| \> | *greater than, tests greater than* (also >=) |
| < | *less than, tests less than* (also <=) |
| %in% | *contains, tests inclusion* |
Boolean operators can be combined with *and* or *or*.
| Operator | Description |
| :-- | :-- |
| `&` | *and, returns true if preceeding and following are both true, else false* |
| | | *or, returns true if either preceeding and following are true, else false* |
.footnote[see Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund [_R for Data Science_, 2nd ed., "Comparisions""](https://r4ds.hadley.nz/logicals.html#comparisons) and ["Boolean algebra"](https://r4ds.hadley.nz/logicals.html#boolean-algebra).]
-30-