pwiser

PACKAGE NO LONGER MAINTAINED. See across2x() in dplyover for roughly equivalent functionality to that shown in README below.

The goal of pwiser is to make applying arbitrary functions across combinations of columns within {dplyr} easy. Currently, the only function is pairwise(), which applies a function to all pairs of columns.

pairwise() is an altered version of dplyr::across() and, similarly, is meant to be used within mutate() / transmute() and summarise() verbs. pwiser sprang from conversations on an Rstudio Community thread and related conversations.

Example within `summarise()`

library(dplyr)
library(pwiser)
library(palmerpenguins)

penguins <- na.omit(penguins)

pairwise() respects grouped dataframes:

# When using `pairwise()` within `summarise()` the function(s) applied should
# have an output length of 1 (for each group). (Though could wrap in `list()` to make a list column output.)
cor_p_value <- function(x, y){
  stats::cor.test(x, y)$p.value
}

penguins %>% 
  group_by(species) %>% 
  summarise(pairwise(contains("_mm"), 
                     cor_p_value, 
                     .is_commutative = TRUE),
            n = n())
#> # A tibble: 3 x 5
#>   species  bill_length_mm_bill~ bill_length_mm_flipp~ bill_depth_mm_flipp~     n
#>   <fct>                   <dbl>                 <dbl>                <dbl> <int>
#> 1 Adelie               1.51e- 6              4.18e- 5             1.34e- 4   146
#> 2 Chinstr~             1.53e- 9              4.92e- 5             2.16e- 7    68
#> 3 Gentoo               7.34e-16              1.80e-16             1.40e-19   119

Setting .is_commutative = TRUE can save time on redundant calculations.

Equivalently, could have written with .x and .y in a lambda function:

penguins %>% 
  group_by(species) %>% 
  summarise(pairwise(contains("_mm"), 
                     ~stats::cor.test(.x, .y)$p.value, 
                     .is_commutative = TRUE),
            n = n())

Example within `mutate()`

Can apply multiple functions via a named list:

penguins %>% 
  mutate(pairwise(contains("_mm"), 
                  list(ratio = `/`, difference = `-`),
                  .names = "features_{.fn}_{.col_x}_{.col_y}")) %>% 
  glimpse()
#> Rows: 333
#> Columns: 20
#> $ species                                              <fct> Adelie, Adelie, A~
#> $ island                                               <fct> Torgersen, Torger~
#> $ bill_length_mm                                       <dbl> 39.1, 39.5, 40.3,~
#> $ bill_depth_mm                                        <dbl> 18.7, 17.4, 18.0,~
#> $ flipper_length_mm                                    <int> 181, 186, 195, 19~
#> $ body_mass_g                                          <int> 3750, 3800, 3250,~
#> $ sex                                                  <fct> male, female, fem~
#> $ year                                                 <int> 2007, 2007, 2007,~
#> $ features_ratio_bill_length_mm_bill_depth_mm          <dbl> 2.090909, 2.27011~
#> $ features_difference_bill_length_mm_bill_depth_mm     <dbl> 20.4, 22.1, 22.3,~
#> $ features_ratio_bill_length_mm_flipper_length_mm      <dbl> 0.2160221, 0.2123~
#> $ features_difference_bill_length_mm_flipper_length_mm <dbl> -141.9, -146.5, -~
#> $ features_ratio_bill_depth_mm_bill_length_mm          <dbl> 0.4782609, 0.4405~
#> $ features_difference_bill_depth_mm_bill_length_mm     <dbl> -20.4, -22.1, -22~
#> $ features_ratio_bill_depth_mm_flipper_length_mm       <dbl> 0.10331492, 0.093~
#> $ features_difference_bill_depth_mm_flipper_length_mm  <dbl> -162.3, -168.6, -~
#> $ features_ratio_flipper_length_mm_bill_length_mm      <dbl> 4.629156, 4.70886~
#> $ features_difference_flipper_length_mm_bill_length_mm <dbl> 141.9, 146.5, 154~
#> $ features_ratio_flipper_length_mm_bill_depth_mm       <dbl> 9.679144, 10.6896~
#> $ features_difference_flipper_length_mm_bill_depth_mm  <dbl> 162.3, 168.6, 177~

Can use .names to customize outputted column names.

Installation

Install from GitHub with:

# install.packages("devtools")
devtools::install_github("brshallo/pwiser")

Computation Speed

For problems with lots of data you should use more efficient approaches.

Matrix operations (compared to dataframes) are much more computationally efficient for problems involving combinations (which can get big very quickly). We’ve done nothing to optimize the computation of functions run through pwiser.

For example, when calculating pearson correlations, pairwise() calculates the correlation separately for each pair, whereas stats::cor() (or corrr::correlate() which calls cor() under the hood) uses R’s matrix operations to calculate all correlations simultaneously.

library(modeldata)

data(cells)
cells_numeric <- select(cells, where(is.numeric))

dim(cells_numeric)
#> [1] 2019   56

Let’s do a speed test using the 56 numeric columns from the cells dataset (which means 1540 pairwise combinations or 3080 permutations) imported from {modeltime}.

library(corrr)
if (!requireNamespace("dplyover")) devtools::install_github('TimTeaFan/dplyover')
library(dplyover)

set.seed(123)

microbenchmark::microbenchmark(
  cor = cor(cells_numeric),
  correlate = correlate(cells_numeric),
  colpair_map = colpair_map(cells_numeric, cor),
  pairwise = summarise(cells_numeric, pairwise(where(is.numeric), cor, .is_commutative = TRUE)),
  dplyover = summarise(cells_numeric, across2x(where(is.numeric), where(is.numeric), cor, .comb = "minimal")),
  times = 10L,
  unit = "ms")
#> Unit: milliseconds
#>         expr       min        lq       mean     median        uq       max
#>          cor    5.0872    5.2673    6.04834    5.84305    6.7173    7.5421
#>    correlate   40.2647   43.8159   47.96368   46.35755   50.4519   61.3870
#>  colpair_map  609.4319  621.8832  675.88510  658.82900  673.0988  885.6218
#>     pairwise  232.2557  239.8043  278.56514  256.20245  289.0009  439.8503
#>     dplyover 1393.0210 1411.3117 1719.47016 1722.58500 1849.7877 2223.6372
#>  neval  cld
#>     10 a   
#>     10 a   
#>     10   c 
#>     10  b  
#>     10    d

The stats::cor() and corrr::correlate() approaches are many times faster than using pairwise(). However pairwise() still only takes about one fifth of a second to calculate 1540 correlations in this case. Hence on relatively constrained problems pairwise() is still quite usable. (Though there are many cases where you should go for a matrix based solution.)

pairwise() seems to be faster than corrr::colpair_map() (a more apples-to-apples comparison as both can handle arbitrary functions), though much of this speed difference goes away when .is_commutative = FALSE.

pairwise() (at the moment) seems to also be faster than running the equivalent operation with dplyover::across2x().

Session info

sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19042)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyover_0.0.8.9000  corrr_0.4.3          modeldata_0.1.0     
#> [4] palmerpenguins_0.1.0 pwiser_0.0.1.9000    dplyr_1.0.6         
#> 
#> loaded via a namespace (and not attached):
#>  [1] pillar_1.6.1         compiler_3.5.1       tools_3.5.1         
#>  [4] digest_0.6.27        lattice_0.20-35      evaluate_0.14       
#>  [7] lifecycle_1.0.0      tibble_3.1.2         gtable_0.3.0        
#> [10] pkgconfig_2.0.3      rlang_0.4.11         Matrix_1.2-14       
#> [13] DBI_1.1.1            cli_2.5.0            rstudioapi_0.13     
#> [16] microbenchmark_1.4-7 yaml_2.2.1           mvtnorm_1.1-1       
#> [19] xfun_0.23            stringr_1.4.0        knitr_1.33          
#> [22] generics_0.1.0       vctrs_0.3.8          grid_3.5.1          
#> [25] tidyselect_1.1.1     glue_1.4.2           R6_2.5.0            
#> [28] fansi_0.5.0          survival_3.1-12      rmarkdown_2.8       
#> [31] multcomp_1.4-17      TH.data_1.0-10       purrr_0.3.4         
#> [34] ggplot2_3.3.3        magrittr_2.0.1       codetools_0.2-15    
#> [37] MASS_7.3-50          splines_3.5.1        scales_1.1.1        
#> [40] ellipsis_0.3.2       htmltools_0.5.1.1    assertthat_0.2.1    
#> [43] colorspace_2.0-1     sandwich_3.0-1       utf8_1.2.1          
#> [46] stringi_1.6.2        munsell_0.5.0        crayon_1.4.1        
#> [49] zoo_1.8-9

Limitations

See issue #1 for notes on limitations in current set-up.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
R		R
docs		docs
man		man
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
pwiser.Rproj		pwiser.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pwiser

Example within `summarise()`

Example within `mutate()`

Installation

See Also

`dplyover`

Computation Speed

Limitations

About

Releases

Packages

Languages

License

brshallo/pwiser

Folders and files

Latest commit

History

Repository files navigation

pwiser

Example within summarise()

Example within mutate()

Installation

See Also

dplyover

Computation Speed

Limitations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Example within `summarise()`

Example within `mutate()`

`dplyover`

Packages