Skip to content

Latest commit

 

History

History
344 lines (288 loc) · 11 KB

README.md

File metadata and controls

344 lines (288 loc) · 11 KB

dance

Lifecycle Status Travis build status

Dancing 💃 with the stats, aka tibble() dancing 🕺. dance is a sort of reinvention of dplyr classic verbs, with a more modern stack underneath, i.e. it leverages a lot from vctrs and rlang.

Installation

You can install the development version from GitHub.

# install.packages("pak")
pak::pkg_install("romainfrancois/dance")

Usage

We’ll illustrate tibble dancing with iris grouped by Species.

library(dance)
g <- iris %>% group_by(Species)

waltz(), polka(), tango(), charleston()

These are in the neighborhood of dplyr::summarise().

waltz() takes a grouped tibble and a list of formulas and returns a tibble with: as many columns as supplied formulas, one row per group. It does not prepend the grouping variables (see tango for that).

g %>% 
  waltz(
    Sepal.Length = ~mean(Sepal.Length), 
    Sepal.Width  = ~mean(Sepal.Width)
  )
#> # A tibble: 3 x 2
#>   Sepal.Length Sepal.Width
#>          <dbl>       <dbl>
#> 1         5.01        3.43
#> 2         5.94        2.77
#> 3         6.59        2.97

polka() deals with peeling off one layer of grouping:

g %>% 
  polka()
#> # A tibble: 3 x 1
#>   Species   
#>   <fct>     
#> 1 setosa    
#> 2 versicolor
#> 3 virginica

tango() binds the results of polka() and waltz() so is the closest to dplyr::summarise()

g %>% 
  tango(
    Sepal.Length = ~mean(Sepal.Length), 
    Sepal.Width  = ~mean(Sepal.Width)
  )
#> # A tibble: 3 x 3
#>   Species    Sepal.Length Sepal.Width
#>   <fct>             <dbl>       <dbl>
#> 1 setosa             5.01        3.43
#> 2 versicolor         5.94        2.77
#> 3 virginica          6.59        2.97

charleston() is like tango but it packs the new columns in a tibble:

g %>% 
  charleston(
    Sepal.Length = ~mean(Sepal.Length), 
    Sepal.Width  = ~mean(Sepal.Width)
  )
#> # A tibble: 3 x 2
#>   Species    data$Sepal.Length $Sepal.Width
#>   <fct>                  <dbl>        <dbl>
#> 1 setosa                  5.01         3.43
#> 2 versicolor              5.94         2.77
#> 3 virginica               6.59         2.97

swing, twist

There is no waltz_at(), tango_at(), etc … but instead we can use either the same function on a set of columns or a set of functions on the same column.

For this, we need to learn new dance moves:

swing() and twist() are for applying the same function to a set of columns:

library(tidyselect)

g %>% 
  tango(swing(mean, starts_with("Petal")))
#> # A tibble: 3 x 3
#>   Species    Petal.Length Petal.Width
#>   <fct>             <dbl>       <dbl>
#> 1 setosa             1.46       0.246
#> 2 versicolor         4.26       1.33 
#> 3 virginica          5.55       2.03

g %>% 
  tango(data = twist(mean, starts_with("Petal")))
#> # A tibble: 3 x 2
#>   Species    data$Petal.Length $Petal.Width
#>   <fct>                  <dbl>        <dbl>
#> 1 setosa                  1.46        0.246
#> 2 versicolor              4.26        1.33 
#> 3 virginica               5.55        2.03

They differ in the type of column is created and how to name them:

  • swing() makes as many new columns as are selected by the tidy selection, and the columns are named using a .name glue pattern, this way we might swing() several times.
g %>% 
  tango(
    swing(mean, starts_with("Petal"), .name = "mean_{var}"), 
    swing(median, starts_with("Petal"), .name = "median_{var}"), 
  )
#> # A tibble: 3 x 5
#>   Species mean_Petal.Leng… mean_Petal.Width median_Petal.Le…
#>   <fct>              <dbl>            <dbl>            <dbl>
#> 1 setosa              1.46            0.246             1.5 
#> 2 versic…             4.26            1.33              4.35
#> 3 virgin…             5.55            2.03              5.55
#> # … with 1 more variable: median_Petal.Width <dbl>
  • twist() instead creates a single data frame column.
g %>% 
  tango(
    mean   = twist(mean, starts_with("Petal")), 
    median = twist(median, starts_with("Petal")), 
  )
#> # A tibble: 3 x 3
#>   Species    mean$Petal.Length $Petal.Width median$Petal.Leng… $Petal.Width
#>   <fct>                  <dbl>        <dbl>              <dbl>        <dbl>
#> 1 setosa                  1.46        0.246               1.5           0.2
#> 2 versicolor              4.26        1.33                4.35          1.3
#> 3 virginica               5.55        2.03                5.55          2

The first arguments of swing() and twist() are either a function or a formula that uses . as a placeholder. Subsequent arguments are tidyselect selections.

You can combine swing() and twist() in the same tango() or waltz():

g %>% 
  tango(
    swing(mean, starts_with("Petal"), .name = "mean_{var}"), 
    median = twist(median, contains("."))
  )
#> # A tibble: 3 x 4
#>   Species mean_Petal.Leng… mean_Petal.Width median$Sepal.Le… $Sepal.Width
#>   <fct>              <dbl>            <dbl>            <dbl>        <dbl>
#> 1 setosa              1.46            0.246              5            3.4
#> 2 versic…             4.26            1.33               5.9          2.8
#> 3 virgin…             5.55            2.03               6.5          3  
#> # … with 2 more variables: $Petal.Length <dbl>, $Petal.Width <dbl>

rumba, zumba

Similarly rumba() can be used to apply several functions to a single column. rumba() creates single columns and zumba() packs them into a data frame column.

g %>% 
  tango(
    rumba(Sepal.Width, mean = mean, median = median, .name = "Sepal_{fun}"), 
    Petal = zumba(Petal.Width, mean = mean, median = median)
  )
#> # A tibble: 3 x 4
#>   Species    Sepal_mean Sepal_median Petal$mean $median
#>   <fct>           <dbl>        <dbl>      <dbl>   <dbl>
#> 1 setosa           3.43          3.4      0.246     0.2
#> 2 versicolor       2.77          2.8      1.33      1.3
#> 3 virginica        2.97          3        2.03      2

salsa, chacha, samba, madison

Now we enter the realms of dplyr::mutate() with:

  • salsa() : to create new columns
  • chacha(): to reorganize a grouped tibble so that data for each group is contiguous
  • samba() : chacha() + salsa()
g %>% 
  salsa(
    Sepal = ~Sepal.Length * Sepal.Width, 
    Petal = ~Petal.Length * Petal.Width
  )
#> # A tibble: 150 x 2
#>    Sepal Petal
#>    <dbl> <dbl>
#>  1  17.8 0.280
#>  2  14.7 0.280
#>  3  15.0 0.26 
#>  4  14.3 0.3  
#>  5  18   0.280
#>  6  21.1 0.68 
#>  7  15.6 0.42 
#>  8  17   0.3  
#>  9  12.8 0.280
#> 10  15.2 0.15 
#> # … with 140 more rows

You can swing(), twist(), rumba() and zumba() here too, and if you want the original data, you can use samba() instead of salsa():

g %>% 
  samba(centered = twist(~ . - mean(.), everything(), -Species))
#> # A tibble: 150 x 6
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # … with 140 more rows, and 4 more variables: centered$Sepal.Length <dbl>,
#> #   $Sepal.Width <dbl>, $Petal.Length <dbl>, $Petal.Width <dbl>

madison() packs the columns salsa() would have created

g %>% 
  madison(swing(~ . - mean(.), starts_with("Sepal")))
#> # A tibble: 150 x 6
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # … with 140 more rows, and 2 more variables: data$Sepal.Length <dbl>,
#> #   $Sepal.Width <dbl>

bolero and mambo

bolero() is similar to dplyr::filter(). The formulas may be made by mambo() if you want to apply the same predicate to a tidyselection of columns:

g %>% 
  bolero(~Sepal.Width > 4)
#> # A tibble: 3 x 5
#> # Groups:   Species [3]
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#> 1          5.7         4.4          1.5         0.4 setosa 
#> 2          5.2         4.1          1.5         0.1 setosa 
#> 3          5.5         4.2          1.4         0.2 setosa

g %>% 
  bolero(mambo(~. > 4, starts_with("Sepal")))
#> # A tibble: 3 x 5
#> # Groups:   Species [3]
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#> 1          5.7         4.4          1.5         0.4 setosa 
#> 2          5.2         4.1          1.5         0.1 setosa 
#> 3          5.5         4.2          1.4         0.2 setosa

g %>% 
  bolero(mambo(~. > 4, starts_with("Sepal"), .op = or))
#> # A tibble: 150 x 5
#> # Groups:   Species [3]
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # … with 140 more rows