Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient pivot_longer() #1392

Merged
merged 7 commits into from
Nov 3, 2022

Conversation

mgirlich
Copy link
Contributor

@mgirlich mgirlich commented Sep 5, 2022

library(tidyr)

relig_income_long <- vctrs::vec_rep(relig_income, 10e3)

bench::mark(
  fastest = pivot_longer(relig_income_long, !religion, names_to = "income", values_to = "count"),
  # fastest = pivot_longer(relig_income_long, !religion, names_to = "income", values_to = "count", cols_vary = "fastest"),
  # slowest = pivot_longer(relig_income_long, !religion, names_to = "income", values_to = "count", cols_vary = "slowest"),
  drop_na = pivot_longer(relig_income_long, !religion, names_to = "income", values_to = "count", values_drop_na = TRUE),
  check = FALSE
)
# CRAN
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 2 drop_na     126.8ms  135.1ms      7.35   158.6MB     22.0
#> 1 fastest      43.3ms   57.8ms     13.5     98.6MB     21.2

# MAIN
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 drop_na     139.3ms  143.7ms      6.87   126.1MB     18.9
#> 2 fastest      35.1ms   44.1ms     17.9     62.5MB     15.9
#> 3 slowest      28.3ms     37ms     22.7     55.6MB     26.4

# VEC_UNCHOP()
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 2 fastest      36.6ms   46.4ms     17.3     62.5MB     15.4
#> 3 slowest      28.4ms   38.4ms     21.9     55.6MB     25.6

# 
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 drop_na      44.8ms   44.8ms      22.3    71.2MB    268. 

fish_encounters_long <- vctrs::vec_rbind(!!!vctrs::vec_rep(list(fish_encounters), 1e3), .names_to = "id")

bench::mark(
  basic = pivot_wider(fish_encounters_long, names_from = station, values_from = seen),
  values_fn = pivot_wider(fish_encounters_long, names_from = station, values_from = seen, values_fn = identity),
  check = FALSE
)
# CRAN
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 basic        11.5ms   12.6ms     71.3     11.6MB     7.92
#> 2 values_fn   122.5ms    130ms      7.63    20.8MB    15.3

# This PR - No effect by replacing `vec_c()` with `vec_unchop()`?
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 basic          12ms   13.4ms     66.7     11.6MB     7.85
#> 2 values_fn     115ms  126.9ms      7.51    19.9MB    15.0

Created on 2022-09-05 with reprex v2.0.2

@hadley
Copy link
Member

hadley commented Oct 20, 2022

Lets reconsider the vec_assign() early exit in vctrs.

@hadley hadley closed this Oct 20, 2022
@DavisVaughan
Copy link
Member

I would have actually recommended the usage of vec_any_missing() here, which is what this PR is pushing towards.

99% of the time, when I avoid a vec_assign() call by checking for vec_any_missing() first I ALSO get to avoid a vec_detect_missing() call (or something similar) to generate the location vector.

i.e. this pattern

if (vec_any_missing(x)) {
  loc <- vec_detect_missing(x)
  out <- vec_assign(out, loc, 0)
}

So even if we optimize vec_assign() to avoid the early exit (which I am still not convinced of when compared against the added complexity), I'd probably still make the changes proposed in this PR to use vec_any_missing() to avoid the call entirely, since we also avoid vec_detect_missing().

@hadley hadley reopened this Oct 31, 2022
Merge commit '803a02a381a28eee890caa5ed2924858225e0fa0'

#Conflicts:
#	R/pivot-long.R
#	R/replace_na.R
#	R/unnest-longer.R
#	R/utils.R
@DavisVaughan
Copy link
Member

Benchmark extracted from the original comment, mostly related to values_drop_na = TRUE

library(tidyr)

relig_income_long <- vctrs::vec_rep(relig_income, 10e3)

bench::mark(
  drop_na = pivot_longer(relig_income_long, !religion, names_to = "income", values_to = "count", values_drop_na = TRUE),
  iterations = 100
)

# Main - More memory allocs, more work, more gc
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 drop_na       110ms    182ms      5.51     126MB     15.1

# This PR - way less memory usage due to not slicing/detect missing when not needed
#> # A tibble: 1 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 drop_na      47.4ms   49.6ms      20.1    64.5MB     97.1

@DavisVaughan
Copy link
Member

Another one for replace_na(). Here we replace missing values in the numeric columns of flights with 1L. There are 14 columns but not all of them have missing values, so for many of the columns we can skip some of the work.

library(tidyr)
library(dplyr, warn.conflicts = FALSE)
library(nycflights13)

names <- names(select(flights, where(is.numeric)))
replace <- as.list(rlang::rep_named(names, 1L))

# replacing in 14 columns, but not all have missing values
length(names)
#> [1] 14

bench::mark(replace_na(flights, replace), iterations = 100)

# Main
#> # A tibble: 1 × 6
#>   expression                        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 replace_na(flights, replace)   19.2ms   21.3ms      43.4    44.4MB     209.

# This PR
#> # A tibble: 1 × 6
#>   expression                        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 replace_na(flights, replace)   10.2ms   11.1ms      83.8    17.4MB     41.3

Created on 2022-11-03 with reprex v2.0.2.9000

@DavisVaughan DavisVaughan merged commit f6b9509 into tidyverse:main Nov 3, 2022
@DavisVaughan
Copy link
Member

Thanks!

@mgirlich mgirlich deleted the pivot_longer-efficiency branch January 2, 2023 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants