96-hadleyverse.Rmd

# The Hadleyverse {#hadley}

The _Hadleyverse_, short for "Hadley Wickham's universe", is a set of packages that make it easier to handle data. 
If you are developing packages, you should be careful since using these packages may create many dependencies and compatibility issues. 
If you are analyzing data, and the portability of your functions to other users, machines, and operating systems is not of a concern, you will LOVE these packages. 
The term Hadleyverse refers to __all__ of Hadley's packages, but here, we mention only a useful subset, which can be collectively installed via the __tidyverse__ package:

- __ggplot2__ for data visualization. See the Plotting Chapter \@ref(plotting).
- __dplyr__ for data manipulation.
- __tidyr__ for data tidying.
- __readr__ for data import.
- __stringr__ for character strings. 
- __anytime__ for time data. 


## readr

The __readr__ package [@readr] replaces base functions for importing and exporting data such as `read.table`. 
It is faster, with a cleaner syntax. 

We will not go into the details and refer the reader to the official documentation [here](http://readr.tidyverse.org/articles/readr.html) and the [R for data sciecne](http://r4ds.had.co.nz/data-import.html) book.


## dplyr  

When you think of data frame operations, think __dplyr__ [@dplyr].
Notable utilities in the package include:

- `select()`	Select columns from a data frame.
- `filter()`	Filter rows according to some condition(s).
- `arrange()`	Sort / Re-order rows in a data frame.
- `mutate()`	Create new columns or transform existing ones.
- `group_by()`	Group a data frame by some factor(s) usually in conjunction to summary.
- `summarize()`	Summarize some values from the data frame or across groups.
- `inner_join(x,y,by="col")`return all rows from ‘x’ where there are matching values in ‘x’, and all columns from ‘x’ and ‘y’. If there are multiple matches between ‘x’ and ‘y’, all combination of the matches are returned.
- `left_join(x,y,by="col")`	return all rows from ‘x’, and all columns from ‘x’ and ‘y’. Rows in ‘x’ with no match in ‘y’ will have ‘NA’ values in the new columns. If there are multiple matches between ‘x’ and ‘y’, all combinations of the matches are returned.
- `right_join(x,y,by="col")` return all rows from ‘y’, and all columns from ‘x’ and y. Rows in ‘y’ with no match in ‘x’ will have ‘NA’ values in the new columns. If there are multiple matches between ‘x’ and ‘y’, all combinations of the matches are returned.
- `anti_join(x,y,by="col")`	return all rows from ‘x’ where there are not matching values in ‘y’, keeping just columns from ‘x’.

The following example involve `data.frame` objects, but __dplyr__ can handle other classes. 
In particular `data.table`s from the __data.table__ package [@datatable], which is designed for very large data sets. 

__dplyr__ can work with data stored in a database. 
In which case, it will convert your command to the appropriate SQL syntax, and issue it to the database. 
This has the advantage that (a) you do not need to know the specific SQL implementation of your database, and (b), you can enjoy the optimized algorithms provided by the database supplier. 
For more on this, see the [databses vignette](https://cran.r-project.org/web/packages/dplyr/vignettes/databases.html).


The following examples are taken from [Kevin Markham](https://github.com/justmarkham/dplyr-tutorial/blob/master/dplyr-tutorial.Rmd).
The `nycflights13::flights` has delay data for US flights. 
```{r}
library(nycflights13)
flights
```

The data is of class `tbl_df` which is an extension of the `data.frame` class, designed for large data sets. 
Notice that the printing of `flights` is short, even without calling the `head` function. This is a feature of the `tbl_df` class ( `print(data.frame)` would try to load all the data, thus take a long time).

```{r}
class(flights) # a tbl_df is an extension of the data.frame class
```


Let's filter the observations from the first day of the first month.
Notice how much better (i.e. readable) is the __dplyr__ syntax, with piping, compared to the basic syntax.

```{r, eval=FALSE}
flights[flights$month == 1 & flights$day == 1, ] # old style

library(dplyr) 
filter(flights, month == 1, day == 1) #dplyr style
flights %>% filter(month == 1, day == 1) # dplyr with piping.
```

More filtering.

```{r, eval=FALSE}
filter(flights, month == 1 | month == 2) # First OR second month.
slice(flights, 1:10) # selects first ten rows.

arrange(flights, year, month, day) # sort
arrange(flights, desc(arr_delay)) # sort descending

select(flights, year, month, day) # select columns year, month, and day
select(flights, year:day) # select column range
select(flights, -(year:day)) # drop columns
rename(flights, c(tail_num = "tailnum")) # rename column

# add a new computed colume
mutate(flights,
  gain = arr_delay - dep_delay,
  speed = distance / air_time * 60) 

# you can refer to columns you just created! (gain)
mutate(flights,
  gain = arr_delay - dep_delay,
  gain_per_hour = gain / (air_time / 60)
)

# keep only new variables, not all data frame.
transmute(flights,
  gain = arr_delay - dep_delay,
  gain_per_hour = gain / (air_time / 60)
)

# simple statistics
summarise(flights,
  delay = mean(dep_delay, na.rm = TRUE)
  )
# random subsample 
sample_n(flights, 10) 
sample_frac(flights, 0.01) 

```

We now perform operations on subgroups.
we group observations along the plane's tail number (`tailnum`), and compute the count, average distance traveled, and average delay.
We group with `group_by`, and compute subgroup statistics with `summarise`.

```{r, eval=FALSE}
by_tailnum <- group_by(flights, tailnum)

delay <- summarise(by_tailnum,
  count = length(),
  avg.dist = mean(distance, na.rm = TRUE),
  avg.delay = mean(arr_delay, na.rm = TRUE))

delay
```


We can group along several variables, with a hierarchy.
We then collapse the hierarchy one by one. 

```{r, eval=FALSE}
daily <- group_by(flights, year, month, day)
per_day   <- summarise(daily, flights = n())
per_month <- summarise(per_day, flights = sum(flights))
per_year  <- summarise(per_month, flights = sum(flights))
```

Things to note:

- Every call to `summarise` collapses one level in the hierarchy of grouping. The output of `group_by` recalls the hierarchy of aggregation, and collapses along this hierarchy.

We can use __dplyr__ for two table operations, i.e., _joins_.
For this, we join the flight data, with the airplane data in `airplanes`.

```{r}
library(dplyr) 
airlines  

# select the subset of interesting flight data. 
flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier) 

# join on left table with automatic matching.
flights2 %>% left_join(airlines) 

flights2 %>% left_join(weather) 

# join with named matching
flights2 %>% left_join(planes, by = "tailnum") 

# join with explicit column matching
flights2 %>% left_join(airports, by= c("dest" = "faa")) 
```

Types of join with SQL equivalent. 

```{r}
# Create simple data
(df1 <- data_frame(x = c(1, 2), y = 2:1))
(df2 <- data_frame(x = c(1, 3), a = 10, b = "a"))

# Return only matched rows
df1 %>% inner_join(df2) # SELECT * FROM x JOIN y ON x.a = y.a

# Return all rows in df1.
df1 %>% left_join(df2) # SELECT * FROM x LEFT JOIN y ON x.a = y.a

# Return all rows in df2.
df1 %>% right_join(df2) # SELECT * FROM x RIGHT JOIN y ON x.a = y.a

# Return all rows. 
df1 %>% full_join(df2) # SELECT * FROM x FULL JOIN y ON x.a = y.a

# Like left_join, but returning only columns in df1
df1 %>% semi_join(df2, by = "x")  # SELECT * FROM x WHERE EXISTS (SELECT 1 FROM y WHERE x.a = y.a)
```


## tidyr


## reshape2


## stringr


## anytime


## Biblipgraphic Notes

## Practice Yourself