-
Notifications
You must be signed in to change notification settings - Fork 8
/
96-hadleyverse.Rmd
219 lines (143 loc) · 7.63 KB
/
96-hadleyverse.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
# The Hadleyverse {#hadley}
The _Hadleyverse_, short for "Hadley Wickham's universe", is a set of packages that make it easier to handle data.
If you are developing packages, you should be careful since using these packages may create many dependencies and compatibility issues.
If you are analyzing data, and the portability of your functions to other users, machines, and operating systems is not of a concern, you will LOVE these packages.
The term Hadleyverse refers to __all__ of Hadley's packages, but here, we mention only a useful subset, which can be collectively installed via the __tidyverse__ package:
- __ggplot2__ for data visualization. See the Plotting Chapter \@ref(plotting).
- __dplyr__ for data manipulation.
- __tidyr__ for data tidying.
- __readr__ for data import.
- __stringr__ for character strings.
- __anytime__ for time data.
## readr
The __readr__ package [@readr] replaces base functions for importing and exporting data such as `read.table`.
It is faster, with a cleaner syntax.
We will not go into the details and refer the reader to the official documentation [here](http://readr.tidyverse.org/articles/readr.html) and the [R for data sciecne](http://r4ds.had.co.nz/data-import.html) book.
## dplyr
When you think of data frame operations, think __dplyr__ [@dplyr].
Notable utilities in the package include:
- `select()` Select columns from a data frame.
- `filter()` Filter rows according to some condition(s).
- `arrange()` Sort / Re-order rows in a data frame.
- `mutate()` Create new columns or transform existing ones.
- `group_by()` Group a data frame by some factor(s) usually in conjunction to summary.
- `summarize()` Summarize some values from the data frame or across groups.
- `inner_join(x,y,by="col")`return all rows from ‘x’ where there are matching values in ‘x’, and all columns from ‘x’ and ‘y’. If there are multiple matches between ‘x’ and ‘y’, all combination of the matches are returned.
- `left_join(x,y,by="col")` return all rows from ‘x’, and all columns from ‘x’ and ‘y’. Rows in ‘x’ with no match in ‘y’ will have ‘NA’ values in the new columns. If there are multiple matches between ‘x’ and ‘y’, all combinations of the matches are returned.
- `right_join(x,y,by="col")` return all rows from ‘y’, and all columns from ‘x’ and y. Rows in ‘y’ with no match in ‘x’ will have ‘NA’ values in the new columns. If there are multiple matches between ‘x’ and ‘y’, all combinations of the matches are returned.
- `anti_join(x,y,by="col")` return all rows from ‘x’ where there are not matching values in ‘y’, keeping just columns from ‘x’.
The following example involve `data.frame` objects, but __dplyr__ can handle other classes.
In particular `data.table`s from the __data.table__ package [@datatable], which is designed for very large data sets.
__dplyr__ can work with data stored in a database.
In which case, it will convert your command to the appropriate SQL syntax, and issue it to the database.
This has the advantage that (a) you do not need to know the specific SQL implementation of your database, and (b), you can enjoy the optimized algorithms provided by the database supplier.
For more on this, see the [databses vignette](https://cran.r-project.org/web/packages/dplyr/vignettes/databases.html).
The following examples are taken from [Kevin Markham](https://github.com/justmarkham/dplyr-tutorial/blob/master/dplyr-tutorial.Rmd).
The `nycflights13::flights` has delay data for US flights.
```{r}
library(nycflights13)
flights
```
The data is of class `tbl_df` which is an extension of the `data.frame` class, designed for large data sets.
Notice that the printing of `flights` is short, even without calling the `head` function. This is a feature of the `tbl_df` class ( `print(data.frame)` would try to load all the data, thus take a long time).
```{r}
class(flights) # a tbl_df is an extension of the data.frame class
```
Let's filter the observations from the first day of the first month.
Notice how much better (i.e. readable) is the __dplyr__ syntax, with piping, compared to the basic syntax.
```{r, eval=FALSE}
flights[flights$month == 1 & flights$day == 1, ] # old style
library(dplyr)
filter(flights, month == 1, day == 1) #dplyr style
flights %>% filter(month == 1, day == 1) # dplyr with piping.
```
More filtering.
```{r, eval=FALSE}
filter(flights, month == 1 | month == 2) # First OR second month.
slice(flights, 1:10) # selects first ten rows.
arrange(flights, year, month, day) # sort
arrange(flights, desc(arr_delay)) # sort descending
select(flights, year, month, day) # select columns year, month, and day
select(flights, year:day) # select column range
select(flights, -(year:day)) # drop columns
rename(flights, c(tail_num = "tailnum")) # rename column
# add a new computed colume
mutate(flights,
gain = arr_delay - dep_delay,
speed = distance / air_time * 60)
# you can refer to columns you just created! (gain)
mutate(flights,
gain = arr_delay - dep_delay,
gain_per_hour = gain / (air_time / 60)
)
# keep only new variables, not all data frame.
transmute(flights,
gain = arr_delay - dep_delay,
gain_per_hour = gain / (air_time / 60)
)
# simple statistics
summarise(flights,
delay = mean(dep_delay, na.rm = TRUE)
)
# random subsample
sample_n(flights, 10)
sample_frac(flights, 0.01)
```
We now perform operations on subgroups.
we group observations along the plane's tail number (`tailnum`), and compute the count, average distance traveled, and average delay.
We group with `group_by`, and compute subgroup statistics with `summarise`.
```{r, eval=FALSE}
by_tailnum <- group_by(flights, tailnum)
delay <- summarise(by_tailnum,
count = length(),
avg.dist = mean(distance, na.rm = TRUE),
avg.delay = mean(arr_delay, na.rm = TRUE))
delay
```
We can group along several variables, with a hierarchy.
We then collapse the hierarchy one by one.
```{r, eval=FALSE}
daily <- group_by(flights, year, month, day)
per_day <- summarise(daily, flights = n())
per_month <- summarise(per_day, flights = sum(flights))
per_year <- summarise(per_month, flights = sum(flights))
```
Things to note:
- Every call to `summarise` collapses one level in the hierarchy of grouping. The output of `group_by` recalls the hierarchy of aggregation, and collapses along this hierarchy.
We can use __dplyr__ for two table operations, i.e., _joins_.
For this, we join the flight data, with the airplane data in `airplanes`.
```{r}
library(dplyr)
airlines
# select the subset of interesting flight data.
flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)
# join on left table with automatic matching.
flights2 %>% left_join(airlines)
flights2 %>% left_join(weather)
# join with named matching
flights2 %>% left_join(planes, by = "tailnum")
# join with explicit column matching
flights2 %>% left_join(airports, by= c("dest" = "faa"))
```
Types of join with SQL equivalent.
```{r}
# Create simple data
(df1 <- data_frame(x = c(1, 2), y = 2:1))
(df2 <- data_frame(x = c(1, 3), a = 10, b = "a"))
# Return only matched rows
df1 %>% inner_join(df2) # SELECT * FROM x JOIN y ON x.a = y.a
# Return all rows in df1.
df1 %>% left_join(df2) # SELECT * FROM x LEFT JOIN y ON x.a = y.a
# Return all rows in df2.
df1 %>% right_join(df2) # SELECT * FROM x RIGHT JOIN y ON x.a = y.a
# Return all rows.
df1 %>% full_join(df2) # SELECT * FROM x FULL JOIN y ON x.a = y.a
# Like left_join, but returning only columns in df1
df1 %>% semi_join(df2, by = "x") # SELECT * FROM x WHERE EXISTS (SELECT 1 FROM y WHERE x.a = y.a)
```
## tidyr
## reshape2
## stringr
## anytime
## Biblipgraphic Notes
## Practice Yourself