Skip to content

Latest commit

 

History

History
116 lines (99 loc) · 3.24 KB

ex07_group-by-summarise.md

File metadata and controls

116 lines (99 loc) · 3.24 KB

Work on groups of rows via dplyr::group_by() + summarise()

Jenny Bryan 2018-04-11

What if you need to work on groups of rows? Such as the groups induced by the levels of a factor.

You do not need to … split the data frame into mini-data-frames, loop over them, and glue it all back together.

Instead, use dplyr::group_by(), followed by dplyr::summarize(), to compute group-wise summaries.

library(tidyverse)

iris %>%
  group_by(Species) %>%
  summarise(pl_avg = mean(Petal.Length), pw_avg = mean(Petal.Width))
#> # A tibble: 3 x 3
#>   Species    pl_avg pw_avg
#>   <fct>       <dbl>  <dbl>
#> 1 setosa       1.46  0.246
#> 2 versicolor   4.26  1.33 
#> 3 virginica    5.55  2.03

What if you want to return summaries that are not just a single number?

This does not “just work”.

iris %>%
  group_by(Species) %>%
  summarise(pl_qtile = quantile(Petal.Length, c(0.25, 0.5, 0.75)))
#> Error in summarise_impl(.data, dots): Column `pl_qtile` must be length 1 (a summary value), not 3

Solution: package as a length-1 list that contains 3 values, creating a list-column.

iris %>%
  group_by(Species) %>%
  summarise(pl_qtile = list(quantile(Petal.Length, c(0.25, 0.5, 0.75))))
#> # A tibble: 3 x 2
#>   Species    pl_qtile 
#>   <fct>      <list>   
#> 1 setosa     <dbl [3]>
#> 2 versicolor <dbl [3]>
#> 3 virginica  <dbl [3]>

Q from @jcpsantiago via Twitter: How would you unnest so the final output is a data frame with a factor column quantile with levels “25%”, “50%”, and “75%”?

A: I would map() tibble::enframe() on the new list column, to convert each entry from named list to a two-column data frame. Then use tidyr::unnest() to get rid of the list column and return to a simple data frame and, if you like, convert quantile into a factor.

iris %>%
  group_by(Species) %>%
  summarise(pl_qtile = list(quantile(Petal.Length, c(0.25, 0.5, 0.75)))) %>%
  mutate(pl_qtile = map(pl_qtile, enframe, name = "quantile")) %>%
  unnest() %>%
  mutate(quantile = factor(quantile))
#> # A tibble: 9 x 3
#>   Species    quantile value
#>   <fct>      <fct>    <dbl>
#> 1 setosa     25%       1.40
#> 2 setosa     50%       1.50
#> 3 setosa     75%       1.58
#> 4 versicolor 25%       4.00
#> 5 versicolor 50%       4.35
#> 6 versicolor 75%       4.60
#> 7 virginica  25%       5.10
#> 8 virginica  50%       5.55
#> 9 virginica  75%       5.88

If something like this comes up a lot in an analysis, you could package the key “moves” in a function, like so:

enquantile <- function(x, ...) {
  qtile <- enframe(quantile(x, ...), name = "quantile")
  qtile$quantile <- factor(qtile$quantile)
  list(qtile)
}

This makes repeated downstream usage more concise.

iris %>%
  group_by(Species) %>%
  summarise(pl_qtile = enquantile(Petal.Length, c(0.25, 0.5, 0.75))) %>%
  unnest()
#> # A tibble: 9 x 3
#>   Species    quantile value
#>   <fct>      <fct>    <dbl>
#> 1 setosa     25%       1.40
#> 2 setosa     50%       1.50
#> 3 setosa     75%       1.58
#> 4 versicolor 25%       4.00
#> 5 versicolor 50%       4.35
#> 6 versicolor 75%       4.60
#> 7 virginica  25%       5.10
#> 8 virginica  50%       5.55
#> 9 virginica  75%       5.88