forked from moderndive/ModernDive_book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
05-wrangling.Rmd
executable file
·934 lines (629 loc) · 44.8 KB
/
05-wrangling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
# Data Wrangling via dplyr {#wrangling}
<!--
- Make sure to refer back to plots in the viz chapter and consider how the
material here relates to answering the questions posed in viz chapter
-->
```{r setup_wrangling, include=FALSE, purl=FALSE}
chap <- 5
lc <- 0
rq <- 0
# **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`**
# **`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`**
knitr::opts_chunk$set(
tidy = FALSE,
out.width = '\\textwidth',
fig.height = 4,
warning = FALSE
)
# This bit of code is a bug fix on asis blocks, which we use to show/not show LC solutions, which are written like markdown text. In theory, it shouldn't be necessary for knitr versions <=1.11.6, but I've found I still need to for everything to knit properly in asis blocks. More info here:
# https://stackoverflow.com/questions/32944715/conditionally-display-block-of-markdown-text-using-knitr
library(knitr)
knit_engines$set(asis = function(options) {
if (options$echo && options$eval) knit_child(text = options$code)
})
# This controls which LC solutions to show. Options for solutions_shown: "ALL" (to show all solutions), or subsets of c('5-1', '5-2','5-3', '5-4'), including the null vector c('') to show no solutions.
solutions_shown <- c('')
show_solutions <- function(section){return(solutions_shown == "ALL" | section %in% solutions_shown)}
```
Let's briefly recap where we have been so far and where we are headed. In Chapter \@ref(tidy), we discussed what it means for data to be tidy. We saw that this refers to observations corresponding to rows and variables being stored in columns (one variable for every column). The entries in the data frame correspond to different combinations of observations (specific instances of observational units) and variables. In the `flights` data frame, we saw that each row corresponds to a different flight leaving New York City. In other words, the **observational unit** of the `flights` tidy data frame is a flight. The variables are listed as columns, and for `flights` these columns include both quantitative variables like `dep_delay` and `distance` and also categorical variables like `carrier` and `origin`. An entry in the table corresponds to a particular flight on a given day and a particular value of a given variable representing that flight.
Armed with this knowledge and looking back on Chapter \@ref(viz), we see that organizing data in this tidy way makes it easy for us to produce graphics, specically a set of 5 common graphics we termed the *5 Named Graphics* (5NG):
1. scatter-plots
1. line-graphs
1. boxplots
1. histograms
1. barplots
We can simply specify what variable/column we would like on one axis, (if applicable) what variable we'd like on the other axis, and what type of plot we'd like to make by specifying the `geom`etric object in question. We can also vary aesthetic attributes of the geometric objects in question (points, lines, bar), such as the size and color, along the values of another variable in this tidy dataset. Recall the Gapminder example from Figure \@ref(fig:gapminder).
Lastly, in a few spots in Chapter \@ref(viz) and Chapter \@ref(tidy), we hinted at some ways to summarize and wrangle data to suit your needs, using the `filter()` and `inner_join()` functions. This chapter expands on these functions by giving a variety of examples using what we term the *Five Main Verbs* (5MV) in the `dplyr` package [@R-dplyr].
### Needed packages {-}
Let's load all the packages needed for this chapter (this assumes you've already installed them). Read Chapter \@ref(packages) for information on how to install and load R packages.
```{r, message=FALSE}
library(dplyr)
library(ggplot2)
library(nycflights13)
library(knitr)
```
```{r message=FALSE, warning=FALSE, echo=FALSE}
# Packages needed internally, but not in text.
```
<!--Subsection on Pipe -->
## The pipe `%>%`
Before we introduce the five main verbs, we first introduce the pipe operator (`%>%`). Just as the `+` sign was used to add layers to a plot created using `ggplot()`, the pipe operator allows us to chain together `dplyr` data wrangling functions. The pipe operator can be read as "*then*". The `%>%` operator allows us to go from one step in `dplyr` to the next easily so we can, for example:
* `filter` our data frame to only focus on a few rows *then*
* `group_by` another variable to create groups *then*
* `summarize` this grouped data to calculate the mean for each level of the group.
The piping syntax will be our major focus throughout the rest of this book and you'll find that you'll quickly be addicted to the chaining with some practice.
<!--Subsection on Verbs -->
## Five Main Verbs - The 5MV
The `d` in `dplyr` stands for data frames, so the functions in dplyr are built for working with objects of the data frame type. For now, we focus on the 5MV: the five most commonly used functions that help wrangle and summarize data. A description of these verbs follows, with each subsection devoted to an example of that verb, or a combination of a few verbs, in action.
- `filter`: Pick rows based on conditions about their values
- `summarize`: Create summary measures of variables either
+ over the entire data frame or
+ over groups of observations on variables using `group_by`
- `mutate`: Create a new variable in the data frame by mutating existing ones
- `arrange`: Arrange/sort the rows based on one or more variables
Just as we had the Five Named Graphs for data visualization using `ggplot2` in Chapter \@ref(viz), we have the 5MV here (The Five Main Verbs in `dplyr`) for data wrangling. All of the 5MVs follow the same syntax, with the argument before the pipe `%>%` being the name of the data frame, then the name of the verb, followed with other arguments specifying which criteria you'd like the verb to work with in parentheses.
Keep in mind, there are more advanced functions than just these five and you'll see some examples of this near the end of this Chapter in \@ref(other-verbs), but with the 5MV you'll be able to perform a broad array of data wrangling tasks.
### 5MV#1: Filter observations using filter {#filter}
```{r filter, echo=FALSE, fig.cap="Filter diagram from Data Wrangling with dplyr and tidyr cheatsheet", purl=FALSE}
knitr::include_graphics("images/filter.png")
```
The `filter` function here works much like the "Filter" option in Microsoft Excel; it allows you to specify criteria about values of a variable in your dataset and then chooses only those rows that match that criteria. We begin by focusing only on flights from New York City to Portland, Oregon. The `dest` code (or airport code) for Portland, Oregon is `"PDX"`. Run the following and look at the resulting spreadsheet to ensure that only flights heading to Portland are chosen here:
```{r, eval=FALSE}
portland_flights <- flights %>%
filter(dest == "PDX")
View(portland_flights)
```
Note the following:
* The ordering of the commands:
+ Take the data frame `flights` *then*
+ `filter` the data frame so that only those where the `dest` equals `"PDX"` are included.
* The double equal sign `==` for testing for equality, and not `=`. You are almost guaranteed to make the mistake at least once of only including one equals sign.
You can combine multiple criteria together using operators that make comparisons:
- `|` corresponds to "or"
- `&` corresponds to "and"
We can often skip the use of `&` and just separate our conditions with a comma. You'll see this in the example below.
In addition, you can use other mathematical checks (similar to `==`):
- `>` corresponds to "greater than"
- `<` corresponds to "less than"
- `>=` corresponds to "greater than or equal to"
- `<=` corresponds to "less than or equal to"
- `!=` corresponds to "not equal to"
To see many of these in action, let's select all flights that left JFK airport heading to Burlington, Vermont (`"BTV"`) or Seattle, Washington (`"SEA"`) in the months of October, November, or December. Run the following
```{r, eval=FALSE}
btv_sea_flights_fall <- flights %>%
filter(origin == "JFK", (dest == "BTV" | dest == "SEA"), month >= 10)
View(btv_sea_flights_fall)
```
Note: even though colloquially speaking one might say "all flights leaving Burlington, Vermont *and* Seattle, Washington," in terms of computer logical operations, we really mean "all flights leaving Burlington, Vermont *or* Seattle, Washington." For a given row in the data, `dest` can be "BTV", "SEA", or something else, but not "BTV" and "SEA" at the same time.
Another example uses the `!` to pick rows that *don't* match a condition. The `!` can be read as "not". Here we are selecting rows corresponding to flights that didn't go to Burlington, VT or Seattle, WA.
```{r, eval=FALSE}
not_BTV_SEA <- flights %>%
filter(!(dest == "BTV" | dest == "SEA"))
View(not_BTV_SEA)
```
As a final note we point out that `filter()` should often be the first verb you'll apply to your data. This cleans your dataset to only those rows you care about, or put differently, it narrows down the scope to just the observations your care about.
***
```{block lc-filter, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What's another way using the "not" operator `!` we could filter only the rows that are not going to Burlington, VT nor Seattle, WA in the `flights` data frame? Test this out using the code above.
```{asis lc-filter-solutions, include=show_solutions('5-1')}
**Learning Check Solutions**
**`r paste0("(LC", chap, ".", (lc), ")")`** What's another way using the "not" operator `!` we could filter only the rows that are not going to Burlington, VT nor Seattle, WA in the `flights` data frame? Test this out using the code above.
```
```{r, eval=FALSE, echo=show_solutions('5-1')}
# Original in book
not_BTV_SEA <- flights %>%
filter(!(dest == "BTV" | dest == "SEA"))
# Alternative way
not_BTV_SEA <- flights %>%
filter(!dest == "BTV" & !dest == "SEA")
# Yet another way
not_BTV_SEA <- flights %>%
filter(dest != "BTV" & dest != "SEA")
```
***
### 5MV#2: Summarize variables using summarize
The next common task when working with data is to be able to summarize data: take a large number of values and summarize then with a single value. While this may seem like a very abstract idea, something has simple as the sum, the smallest value, and the largest values are all summaries of a large number of values.
```{r sum1, echo=FALSE, fig.cap="Summarize diagram from Data Wrangling with dplyr and tidyr cheatsheet", purl=FALSE}
knitr::include_graphics("images/summarize1.png")
```
```{r sum2, echo=FALSE, fig.cap="Another summarize diagram from Data Wrangling with dplyr and tidyr cheatsheet", purl=FALSE}
knitr::include_graphics("images/summary.png")
```
We can calculate the standard deviation and mean of the temperature variable `temp` in the `weather` data frame of `nycflights13` in one step using the `summarize` function in `dplyr` (See Appendix \@ref(appendixA) ):
```{r}
summary_temp <- weather %>%
summarize(mean = mean(temp), std_dev = sd(temp))
kable(summary_temp)
```
We've created a small data frame here called `summary_temp` that includes both the `mean` and the `std_dev` of the `temp` variable in `weather`. Notice as shown in Figures \@ref(fig:sum1) and \@ref(fig:sum2), the data frame `weather` went from many rows to a single row of just the summary values in the data frame `summary_temp`.
But why are the values returned `NA`? This stands for `not available/applicable` and is how R encodes *missing values*; if in a data frame for a particular row and column no value exists, `NA` is stored instead. Furthermore, by default any time you try to summarize a number of values (using `mean()` and `sd()` for example) that has one or more missing values, then `NA` is returned.
Values can missing missing for many reasons. Perhaps the data was collected but someone forgot to enter it? Perhaps the data was not collected at all because it was too difficult? Perhaps there was an erroneous value that someone entered that has been correct to read as missing? You'll often encounter issues with missing values.
You can summarize all non-missing values by setting the `na.rm` argument to TRUE (`rm` is short for "remove"). This will remove any `NA` missing values and only return the summary value for all non-missing values. So the code below computes the mean and standard deviation of all non-missing values. Notice how the `na.rm=TRUE` are set as arguments to the `mean()` and `sd()` functions, and not to the `summarize()` function.
```{r}
summary_temp <- weather %>%
summarize(mean = mean(temp, na.rm = TRUE), std_dev = sd(temp, na.rm = TRUE))
kable(summary_temp)
```
It is not good practice to include a `na.rm = TRUE` in your summary commands by default; you should attempt to run them without this argument. This is because removing missing data can have an impact on your analyses. In fact, an entire branch of the field of statistics deals with missing data. The take away point is that `na.rm = TRUE` should only be used after you aware of the implications of its use.
<!--
If we'd like to access either of these values directly we can use the `$` to specify a column in a data frame. For example:
```{r}
#summary_temp$mean
```
-->
What other summary functions can we use inside the `summarize()` verb? Any function in R that takes a vector of values and returns just one. Here are just a few:
* `min()` and `max()`: the minimum and maximum values respectively
* `IQR()`: Interquartile range
* `sum()`: the sum
* `n()`: a count of the number of rows/observations in each group. This particular summary function will make more sense in the `group_by` chapter.
***
```{block lc-summarize, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Say a doctor is studying the effect of smoking on lung cancer for a large number of patients who have records measured at five year intervals. He notices that a large number of patients have missing data points because the patient has died, so he chooses to ignore these patients in his analysis. What is wrong with this doctor's approach?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Modify the above `summarize` function to create `summary_temp` to also use the `n()` summary function: `summarize(count = n())`. What does the returned value correspond to?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why doesn't the following code work? Run the code line by line instead of all at once, and then look at the data. In other words, run `summary_temp <- weather %>% summarize(mean = mean(temp, na.rm = TRUE))` first.
```{r eval=FALSE}
summary_temp <- weather %>%
summarize(mean = mean(temp, na.rm = TRUE)) %>%
summarize(std_dev = sd(temp, na.rm = TRUE))
```
```{asis lc-summarize-solutions, include=show_solutions('5-2')}
**Learning Check Solutions**
**`r paste0("(LC", chap, ".", (lc-2), ")")`** A doctor is studying the effect of smoking on lung cancer for a large number of patients who have records measured at five year intervals. They notice that a large number of patients have missing data points because the patient has died, so they choose to ignore these patients in their analysis. What is wrong with this doctor's approach?
The missing patients may have died of lung cancer! So to ignore them might seriously **bias** your results! It is very important to think of what the consequences on your analysis are of ignoring missing data! Ask yourself:
+ There is a systematic reasons why certain values are missing? If so, you might be biasing your results!
+ If there isn't, then it might be ok to "sweep missing values under the rug."
**`r paste0("(LC", chap, ".", (lc-1), ")")`** Modify the above `summarize` function to create `summary_temp` to also use the `n()` summary function: `summarize(count = n())`. What does the returned value correspond to?
It corresponds to a count of the number of observations/rows:
```
```{r, eval=show_solutions('5-2'), echo=show_solutions('5-2')}
data(weather)
weather %>%
summarize(count = n())
```
```{asis, include=show_solutions('5-2')}
**`r paste0("(LC", chap, ".", (lc), ")")`** Why doesn't the following code work?
```
```{r eval=FALSE, include=show_solutions('5-2')}
summary_temp <- weather %>%
summarize(mean = mean(temp, na.rm = TRUE)) %>%
summarize(std_dev = sd(temp, na.rm = TRUE))
```
```{asis, include=show_solutions('5-2')}
Consider the output of only running the first two lines:
```
```{r, eval=show_solutions('5-2'), echo=show_solutions('5-2')}
weather %>%
summarize(mean = mean(temp, na.rm = TRUE))
```
```{asis, include=show_solutions('5-2')}
Because after the first `summarize()`, the variable `temp` disappears as it has been collapsed to the value `mean`. So when we try to run the second `summarize()`, it can't find the variable temp` to compute the standard deviation of.
```
***
### 5MV#3: Group rows using group_by
```{r groupsummarize, echo=FALSE, fig.cap="Group by and summarize diagram from Data Wrangling with dplyr and tidyr cheatsheet", purl=FALSE}
knitr::include_graphics("images/group_summary.png")
```
It's often more useful to summarize a variable based on the groupings of another variable. Let's say, we are interested in the mean and standard deviation of temperatures but *grouped by month*. To be more specific: we want the mean and standard deviation of temperatures
1. split by month.
1. sliced by month.
1. aggregated by month.
1. collapsed over month.
Run the following code (recall the discussion of the `kable()` function from the `knitr` package in Chapter \@ref(explore-dataframes)):
```{r}
summary_monthly_temp <- weather %>%
group_by(month) %>%
summarize(mean = mean(temp, na.rm = TRUE),
std_dev = sd(temp, na.rm = TRUE))
kable(summary_monthly_temp)
```
This code is identical to the previous code that created `summary_temp`, with an extra `group_by(month)` added. Grouping the `weather` dataset by `month` and then passing this new data frame into `summarize` yields a data frame that shows the mean and standard deviation of temperature for each month in New York City. Note: Since each row in `summary_monthly_temp` represents a summary of different rows in `weather`, the observational units have changed.
It is important to note that `group_by` doesn't change the data frame. It sets *meta-data* (data about the data), specifically the group structure of the data. It is only after we apply the `summarize` function that the data frame changes.
If we would like to remove this group structure meta-data, we can pipe the resulting data frame into the `ungroup()` function. For example, say the group structure meta-data is set to be by month via `group_by(month)`, all future summarizations will be reported on a month-by-month basis. If however, we would like to no longer have this and have all summarizations be for all data in a single group (in this case over the entire year of 2013), then pipe the data frame in question through and `ungroup()` to remove this.
We now revisit the `n()` counting summary function we introduced in the previous section. For example, suppose we'd like to get a sense for how many flights departed each of the three airports in New York City:
```{r}
by_origin <- flights %>%
group_by(origin) %>%
summarize(count = n())
kable(by_origin)
```
We see that Newark (`"EWR"`) had the most flights departing in 2013 followed by `"JFK"` and lastly by LaGuardia (`"LGA"`). Note there is a subtle but important difference between `sum()` and `n()`. While `sum()` simply adds up a large set of numbers, the latter counts the number of times each of many different values occur.
You are not limited to grouping by one variable! Say you wanted to know the number of flights leaving each of the three New York City airports *for each month*, we can also group by a second variable `month`: `group_by(origin, month)`.
```{r}
by_monthly_origin <- flights %>%
group_by(origin, month) %>%
summarize(count = n())
kable(by_monthly_origin)
```
Alternatively, you can use the shortcut `count()` function in `dplyr` to get the same result:
```{r}
by_monthly_origin2 <- flights %>%
dplyr::count(origin, month)
kable(by_monthly_origin2)
```
***
```{block lc-groupby, type='learncheck'}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Recall from Chapter \@ref(viz) when we looked at plots of temperatures by months in NYC. What does the standard deviation column in the `summary_monthly_temp` data frame tell us about temperatures in New York City throughout the year?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What code would be required to get the mean and standard deviation temperature for each day in 2013 for NYC?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Recreate `by_monthly_origin`, but instead of grouping via `group_by(origin, month)`, group variables in a different order `group_by(month, origin)`. What differs in the resulting dataset?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** How could we identify how many flights left each of the three airports for each `carrier`?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** How does the `filter` operation differ from a `group_by` followed by a `summarize`?
```{asis lc-groupby-solutions, include=show_solutions('5-2')}
**Learning Check Solutions**
**`r paste0("(LC", chap, ".", (lc-4), ")")`** What code would be required to get the mean and standard deviation temperature for each day in 2013 for NYC?
```
```{r, echo=show_solutions('5-2')}
library(dplyr)
library(nycflights13)
summary_temp_by_month <- weather %>%
group_by(month) %>%
summarize(
mean = mean(temp, na.rm = TRUE),
std_dev = sd(temp, na.rm = TRUE)
)
```
```{r, echo=FALSE, include=show_solutions('5-2')}
kable(summary_temp_by_month)
```
```{asis, include=show_solutions('5-2')}
The standard deviation is a quantification of **spread** and **variability**. We
see that the period in November, December, and January has the most variation in
weather, so you can expect very different temperatures on different days.
**`r paste0("(LC", chap, ".", (lc - 3), ")")`** What code would be required to get the mean and standard deviation temperature for each day in 2013 for NYC?
```
```{r, echo=show_solutions('5-2'), include=show_solutions('5-2')}
summary_temp_by_day <- weather %>%
group_by(year, month, day) %>%
summarize(
mean = mean(temp, na.rm = TRUE),
std_dev = sd(temp, na.rm = TRUE)
)
summary_temp_by_day
```
```{asis, include=show_solutions('5-2')}
Note: `group_by(day)` is not enough, because `day` is a value between 1-31. We need to `group_by(year, month, day)`
**`r paste0("(LC", chap, ".", (lc - 2), ")")`** Recreate `by_monthly_origin`, but instead of grouping via `group_by(origin, month)`, group variables in a different order `group_by(month, origin)`. What differs in the resulting dataset?
```
```{r, echo=show_solutions('5-2'), include=show_solutions('5-2')}
by_monthly_origin <- flights %>%
group_by(month, origin) %>%
summarize(count = n())
```
```{r, eval=FALSE, echo=show_solutions('5-2')}
by_monthly_origin
```
```{r, echo=FALSE, include=show_solutions('5-2')}
kable(by_monthly_origin)
```
```{asis, include=show_solutions('5-2')}
The difference is they are organized/sorted by `month` first, then `origin`
**`r paste0("(LC", chap, ".", (lc-1), ")")`** How could we identify how many flights left each of the three airports for each `carrier`?
We could summarize the count from each airport using the `n()` function, which *counts rows*.
```
```{r, echo=show_solutions('5-2'), include=show_solutions('5-2')}
count_flights_by_airport <- flights %>%
group_by(origin, month) %>%
summarize(count=n())
```
```{r, eval=FALSE, include=show_solutions('5-2')}
count_flights_by_airport
```
```{r, echo=FALSE, include=show_solutions('5-2')}
kable(count_flights_by_airport)
```
```{asis, include=show_solutions('5-2')}
All remarkably similar!
Note: the `n()` function counts rows, whereas the `sum(VARIABLE_NAME)`
funciton sums all values of a certain numerical variable `VARIABLE_NAME`.
**`r paste0("(LC", chap, ".", (lc), ")")`** How does the `filter` operation differ from a `group_by` followed by a `summarize`?
* `filter` picks out rows from the original dataset without modifying them, whereas
* `group_by %>% summarize` computes summaries of numerical variables, and hence
reports new values.
```
***
### 5MV#4: Create new variables/change old variables using mutate
```{r select, echo=FALSE, fig.cap="Mutate diagram from Data Wrangling with dplyr and tidyr cheatsheet", purl=FALSE}
knitr::include_graphics("images/mutate.png")
```
When looking at the `flights` dataset, there are some clear additional variables that could be calculated based on the values of variables already in the dataset. Passengers are often frustrated when their flights departs late, but change their mood a bit if pilots can make up some time during the flight to get them to their destination close to when they expected to land. This is commonly referred to as "gain" and we will create this variable using the `mutate` function. Note that we have also overwritten the `flights` data frame with what it was before as well as an additional variable `gain` here, or put differently, the `mutate()` command outputs a new data frame which then gets saved over the origainl `flights` data frame.
```{r}
flights <- flights %>%
mutate(gain = arr_delay - dep_delay)
```
Why did we overwrite `flights` instead of assigning the resulting data frame to a new object, like `flights_with_gain`? As a rough rule of thumb, as long as you are not losing information that you might need later, it's acceptable practice to overwrite data frames. However, if you overwrite existing variables and/or change the observational units, recovering the original information might prove difficult. In this case, it might make sense to create a new data object.
Let's look at summary measures of this `gain` variable and even plot it in the form of a histogram:
```{r}
gain_summary <- flights %>%
summarize(
min = min(gain, na.rm = TRUE),
q1 = quantile(gain, 0.25, na.rm = TRUE),
median = quantile(gain, 0.5, na.rm = TRUE),
q3 = quantile(gain, 0.75, na.rm = TRUE),
max = max(gain, na.rm = TRUE),
mean = mean(gain, na.rm = TRUE),
sd = sd(gain, na.rm = TRUE),
missing = sum(is.na(gain))
)
kable(gain_summary)
```
We've recreated the `summary` function we saw in Chapter \@ref(viz) here using the `summarize` function in `dplyr`.
```{r message=FALSE, fig.cap="Histogram of gain variable"}
ggplot(data = flights, mapping = aes(x = gain)) +
geom_histogram(color = "white", bins = 20)
```
We can also create multiple columns at once and even refer to columns that were just created in a new column. Hadley produces one such example in Chapter 5 of "R for Data Science" [@rds2016]:
```{r}
flights <- flights %>%
mutate(
gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
```
***
```{block lc-mutate, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What do positive values of the `gain` variable in `flights` correspond to? What about negative values? And what about a zero value?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Could we create the `dep_delay` and `arr_delay` columns by simply subtracting `dep_time` from `sched_dep_time` and similarly for arrivals? Try the code out and explain any differences between the result and what actually appears in `flights`.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What can we say about the distribution of `gain`? Describe it in a few sentences using the plot and the `gain_summary` data frame values.
```{asis lc-mutate-solutions, include=show_solutions('5-2')}
**Learning Check Solutions**
**`r paste0("(LC", chap, ".", (lc - 2), ")")`** What do positive values of the `gain` variable in `flights` correspond to? What about negative values? And what about a zero value?
* Say a flight departed 20 minutes late, i.e. `dep_delay=20`
* Then arrived 10 minutes late, i.e. `arr_delay=10`.
* Then `gain = arr_delay - dep_delay = 10 - 20 = -10` is negative, so it "made
up time in the air".
0 means the departure and arrival time were the same, so no time was made up in
the air. We see in most cases that the `gain` is near 0 minutes.
I never understood this. If the pilot says "we're going make up time in the air"
because of delay by flying faster, why don't you always just fly faster to begin
with?
**`r paste0("(LC", chap, ".", (lc - 1), ")")`** Could we create the `dep_delay` and `arr_delay` columns by simply subtracting `dep_time` from `sched_dep_time` and similarly for arrivals? Try the code out and explain any differences between the result and what actually appears in `flights`.
No because you can't do direct arithmetic on times. The difference in time between 12:03 and 11:59 is 4 minutes, but `1293-1159 = 134`
**`r paste0("(LC", chap, ".", (lc), ")")`** What can we say about the distribution of `gain`? Describe it in a few sentences using the plot and the `gain_summary` data frame values.
Most of the time the gain is a little under zero, most of the time the gain is between -50 and 50 minutes. There are some extreme cases however!
```
***
### 5MV#5: Reorder the data frame using arrange {#arrange}
One of the most common things people working with data would like to do is sort the data frames by a specific variable in a column. Have you ever been asked to calculate a median by hand? This requires you to put the data in order from smallest to highest in value. The `dplyr` package has a function called `arrange` that we will use to sort/reorder our data according to the values of the specified variable. This is often used after we have used the `group_by` and `summarize` functions as we will see.
Let's suppose we were interested in determining the most frequent destination airports from New York City in 2013:
```{r, eval}
freq_dest <- flights %>%
group_by(dest) %>%
summarize(num_flights = n())
freq_dest
```
You'll see that by default the values of `dest` are displayed in alphabetical order here. We are interested in finding those airports that appear most:
```{r}
freq_dest %>% arrange(num_flights)
```
This is actually giving us the opposite of what we are looking for. It tells us the least frequent destination airports first. To switch the ordering to be descending instead of ascending we use the `desc` (`desc`ending) function:
```{r}
freq_dest %>% arrange(desc(num_flights))
```
<!--Chapter on joins-->
## Joining data frames {#joins}
Another common task is joining/merging two different datasets. For example, in the `flights` data, the variable `carrier` lists the carrier code for the different flights. While `"UA"` and `"AA"` might be somewhat easy to guess for some (United and American Airlines), what are "VX", "HA", and "B6"? This information is provided in a separate data frame `airlines`.
```{r eval=FALSE}
View(airlines)
```
We see that in `airports`, `carrier` is the carrier code while `name` is the full name of the airline. Using this table, we can see that "VX", "HA", and "B6" correspond to Virgin America, Hawaiian Airlines, and JetBlue respectively. However, will we have to continually look up the carrier's name for each flight in the `airlines` dataset? No! Instead of having to do this manually, we can have R automatically do the "looking up" for us.
Note that the values in the variable `carrier` in `flights` match the values in the variable `carrier` in `airlines`. In this case, we can use the variable `carrier` as a *key variable* to join/merge/match the two data frames by. Hadley and Garrett [@rds2016] created the following diagram to help us understand how the different datasets are linked:
```{r reldiagram, echo=FALSE, fig.cap="Data relationships in nycflights13 from R for Data Science", purl=FALSE}
knitr::include_graphics("images/relational-nycflights.png")
```
### Joining by Key Variables
In both `flights` and `airlines`, the key variable we want to join/merge/match the two data frames with has the same name in both datasets: `carriers`. We make use of the `inner_join()` function to join by the variable `carrier`.
```{r eval=FALSE}
flights_joined <- flights %>%
inner_join(airlines, by = "carrier")
View(flights)
View(flights_joined)
```
We observed that the `flights` and `flights_joined` are identical except that `flights_joined` has an additional variable `name` whose values were drawn from `airlines`.
A visual representation of the `inner_join` is given below [@rds2016]:
```{r ijdiagram, echo=FALSE, fig.cap="Diagram of inner join from R for Data Science", purl=FALSE}
knitr::include_graphics("images/join-inner.png")
```
There are more complex joins available, but the `inner_join` will solve nearly all of the problems you'll face in our experience.
### Joining by Key Variables with Different Names
Say instead, you are interested in all the destinations of flights from NYC in 2013 and ask yourself:
- "What cities are these airports in?"
- "Is `"ORD"` Orlando?"
- "Where is `"FLL"`?
The `airports` data frame contains airport codes:
```{r eval=FALSE}
View(airports)
```
However, looking at both the `airports` and `flights` and the visual representation of the relations between the data frames in Figure \@ref(fig:ijdiagram), we see that in:
* `airports` the airport code is in the variable `faa`
* `flights` the airport code is in the variable `origin`
So to join these two datasets, our `inner_join` operation involves a `by` argument that accounts for the different names:
```{r, eval=FALSE}
flights %>%
inner_join(airports, by = c("dest" = "faa"))
```
Let's construct the sequence of commands that computes the number of flights from NYC to each destination, but also includes information about each destination airport:
```{r, eval=FALSE}
named_dests <- flights %>%
group_by(dest) %>%
summarize(num_flights = n()) %>%
arrange(desc(num_flights)) %>%
inner_join(airports, by = c("dest" = "faa")) %>%
rename(airport_name = name)
View(named_dests)
```
In case you didn't know, `"ORD"` is the airport code of Chicago O'Hare airport and `"FLL"` is the main airport in Fort Lauderdale, Florida, which we can now see in our `named_freq_dests` data frame.
***
```{block lc-join, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Looking at Figure \@ref(fig:reldiagram), when joining `flights` and `weather` (or, in other words, matching the hourly weather values with each flight), why do we need to join by all of `year`, `month`, `day`, `hour`, and `origin`, and not just `hour`?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What surprises you about the top 10 destinations from NYC in 2013?
```{asis lc-join-solutions, include=show_solutions('5-3')}
**`r paste0("(LC", chap, ".", (lc - 1), ")")`** Looking at Figure \@ref(fig:reldiagram), when joining `flights` and `weather` (or, in other words, matching the hourly weather values with each flight), why do we need to join by all of `year`, `month`, `day`, `hour`, and `origin`, and not just `hour`?
Because `hour` is simply a value between 0 and 23; to identify a *specific* hour, we need to know which year, month, day and at which airport.
**`r paste0("(LC", chap, ".", (lc), ")")`** What surprises you about the top 10 destinations from NYC in 2013?
This question is subjective! What surprises me is the high number of flights to Boston. Wouldn't it be easier and quicker to take the train?
```
***
<!--Subsection on Other Verbs -->
## Optional: Other verbs {#other-verbs}
On top of the following examples of other verbs, if you'd like to see more examples on using `dplyr`, the 5MV, and `%>%` with the `nycflights13` dataset, check out [Chapter 5](http://r4ds.had.co.nz/transform.html) of Hadley and Garrett's book [@rds2016].
### Select variables using select {#select}
```{r selectfig, echo=FALSE, fig.cap="Select diagram from Data Wrangling with dplyr and tidyr cheatsheet", purl=FALSE}
knitr::include_graphics("images/select.png")
```
We've seen that the `flights` data frame in the `nycflights13` package contains many different variables. The `names` function gives a listing of all the columns in a data frame; in our case you would run `names(flights)`. You can also identify these variables by running the `glimpse` function in the `dplyr` package:
```{r, eval=FALSE}
glimpse(flights)
```
However, say you only want to consider two of these variables, say `carrier` and `flight`. You can `select` these:
```{r, eval=FALSE}
flights %>%
select(carrier, flight)
```
Another one of these variables is `year`. If you remember the original description of the `flights` data frame (or by running `?flights`), you'll remember that this data correspond to flights in 2013 departing New York City. The `year` variable isn't really a variable here in that it doesn't vary... `flights` actually comes from a larger dataset that covers many years. We may want to remove the `year` variable from our dataset since it won't be helpful for analysis in this case. We can deselect `year` by using the `-` sign:
```{r, eval=FALSE}
flights_no_year <- flights %>%
select(-year)
names(flights_no_year)
```
Or we could specify a ranges of columns:
```{r, eval=FALSE}
flight_arr_times <- flights %>%
select(month:day, arr_time:sched_arr_time)
flight_arr_times
```
The `select` function can also be used to reorder columns in combination with the `everything` helper function. Let's suppose we'd like the `hour`, `minute`, and `time_hour` variables, which appear at the end of the `flights` dataset, to actually appear immediately after the `day` variable:
```{r, eval=FALSE}
flights_reorder <- flights %>%
select(month:day, hour:time_hour, everything())
names(flights_reorder)
```
in this case `everything()` picks up all remaining variables. Lastly, the helper functions `starts_with`, `ends_with`, and `contains` can be used to choose column names that match those conditions:
```{r, eval=FALSE}
flights_begin_a <- flights %>%
select(starts_with("a"))
flights_begin_a
```
```{r, eval=FALSE}
flights_delays <- flights %>%
select(ends_with("delay"))
flights_delays
```
```{r, eval=FALSE}
flights_time <- flights %>%
select(contains("time"))
flights_time
```
### Rename variables using rename {#rename}
Another useful function is `rename`, which as you may suspect renames one column to another name. Suppose we wanted `dep_time` and `arr_time` to be `departure_time` and `arrival_time` instead in the `flights_time` data frame:
```{r, eval=FALSE}
flights_time_new <- flights %>%
select(contains("time")) %>%
rename(departure_time = dep_time,
arrival_time = arr_time)
names(flights_time)
```
Note that in this case we used a single `=` sign with the `rename()`. Ex: `departure_time = dep_time`. This is because we are not testing for equality like we would using `==`, but instead we want to assign a new variable `departure_time` to have the same values as `dep_time` and then delete the variable `dep_time`.
It's easy to forget if the new name comes before or after the equals sign. I usually remember this as "New Before, Old After" or NBOA. You'll receive an error if you try to do it the other way:
```
Error: Unknown variables: departure_time, arrival_time.
```
### Find the top number of values using top_n
We can also use the `top_n` function which automatically tells us the most frequent `num_flights`. We specify the top 10 airports here:
```{r, eval=FALSE}
named_dests %>%
top_n(n = 10, wt = num_flights)
```
We'll still need to arrange this by `num_flights` though:
```{r, eval=FALSE}
named_dests %>%
top_n(n = 10, wt = num_flights) %>%
arrange(desc(num_flights))
```
**Note:** Remember that I didn't pull the `n` and `wt` arguments out of thin air. They can be found by using the `?` function on `top_n`.
We can go one stop further and tie together the `group_by` and `summarize` functions we used to find the most frequent flights:
```{r, eval=FALSE}
ten_freq_dests <- flights %>%
group_by(dest) %>%
summarize(num_flights = n()) %>%
arrange(desc(num_flights)) %>%
top_n(n = 10)
View(ten_freq_dests)
```
***
```{block lc-other-verbs, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What are some ways to select all three of the `dest`, `air_time`, and `distance` variables from `flights`? Give the code showing how to do this in at least three different ways.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** How could one use `starts_with`, `ends_with`, and `contains` to select columns from the `flights` data frame? Provide three different examples in total: one for `starts_with`, one for `ends_with`, and one for `contains`.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why might we want to use the `select` function on a data frame?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Create a new data frame that shows the top 5 airports with the largest arrival delays from NYC in 2013.
```{asis lc-other-verbs-solutions, include=show_solutions('5-4')}
**`r paste0("(LC", chap, ".", (lc - 2), ")")`** How could one use `starts_with`, `ends_with`, and `contains` to select columns from the `flights` data frame? Provide three different examples in total: one for `starts_with`, one for `ends_with`, and one for `contains`.
```
```{r, echo=show_solutions('5-4')}
library(dplyr)
library(nycflights13)
```
```{r, echo=show_solutions('5-4'), eval=FALSE}
# The regular way:
flights %>%
select(dest, air_time, distance)
# Since they are sequential columns in the dataset
flights %>%
select(dest:distance)
# Not as effective, by removing everything else
flights %>%
select(-year, -month, -day, -dep_time, -sched_dep_time, -dep_delay, -arr_time,
-sched_arr_time, -arr_delay, -carrier, -flight, -tailnum, -origin,
-hour, -minute, -time_hour)
```
```{asis, include=show_solutions('5-4')}
**`r paste0("(LC", chap, ".", (lc - 1), ")")`** Why might we want to use the `select` function on a data frame?
```
```{r, echo=TRUE, eval=FALSE}
# Anything that starts with "d"
flights %>%
select(starts_with("d"))
# Anything related to delays:
flights %>%
select(ends_with("delay"))
# Anything related to departures:
flights %>%
select(contains("dep"))
```
```{asis, include=show_solutions('5-4')}
**`r paste0("(LC", chap, ".", (lc), ")")`** Create a new data frame that shows the top 5 airports with the largest arrival delays from NYC in 2013.To narrow down the data frame, to make it easier to look at. Using `View()` for example.
```
***
<!--Subsection on Conclusion -->
## Conclusion
### Review Questions
Review questions have been designed using the `fivethirtyeight` R package [@R-fivethirtyeight] with links to the corresponding FiveThirtyEight.com articles in our free DataCamp course **Effective Data Storytelling using the `tidyverse`**. The material in this chapter is covered in the chapters of the DataCamp course available below:
- [Filtering, Grouping, & Summarizing](https://campus.datacamp.com/courses/effective-data-storytelling-using-the-tidyverse/filtering-grouping-summarizing)
- [dplyr Review](https://campus.datacamp.com/courses/effective-data-storytelling-using-the-tidyverse/dplyr-review-8?ex=1)
### What's to come?
This concludes the **Data Exploration** unit of this book. You should start feeling more and more confident about both plotting variables (or multiple variables together) in various datasets and wrangling data as we've done in this chapter. You are encouraged to step back through the code in earlier chapters and make changes as you see fit based on your updated knowledge.
In Chapter \@ref(sim), we'll begin to build the pieces needed to understand how this unit of **Data Exploration** can tie into statistical inference in the **Inference** part of the book. Remember that the focus throughout is on data visualization and we'll see that next when we discuss sampling, resampling, and bootstrapping. These ideas will lead us into hypothesis testing and confidence intervals.
### Resources
As we saw with the RStudio cheatsheet on [data visualization](https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf), RStudio has also created a cheatsheet for data wrangling entitled ["Data Transformation with dplyr"](https://github.com/rstudio/cheatsheets/raw/master/source/pdfs/data-transformation-cheatsheet.pdf).
<!--
* By clicking [here](https://github.com/rstudio/cheatsheets/raw/master/source/pdfs/data-transformation-cheatsheet.pdf)
* Or by clicking the RStudio Menu Bar -> Help -> Cheatsheets -> "Data Manipulation with `dplyr`, `tidyr`"
(The RStudio interface has not been updated to include JUST the dplyr cheatsheet)
We will focus only on the `dplyr` functions in this book, but you are encouraged to also explore `tidyr` if you are presented with data that is not in the tidy format that we have specified as the preferred option for our purposes.
-->
### Script of R code
```{r include=FALSE, eval=FALSE, purl=FALSE}
knitr::purl("05-wrangling.Rmd", "docs/scripts/05-wrangling.R")
```
An R script file of all R code used in this chapter is available [here](http://ismayc.github.io/moderndiver-book/scripts/05-manip.R).