More efficient `pivot_longer()` #1392

mgirlich · 2022-09-05T06:38:41Z

Replace vec_c(!!!x) by vec_unchop(). It is supposed to be more efficient although I didn't see any effect. (Apply vctrs principles to map() and modify() purrr#894 (comment))
Avoid vec_slice() if no value is missing b/c vec_slice() does not have a fast path (yet?)
Add notes where vec_any_missing() would be useful

library(tidyr)

relig_income_long <- vctrs::vec_rep(relig_income, 10e3)

bench::mark(
  fastest = pivot_longer(relig_income_long, !religion, names_to = "income", values_to = "count"),
  # fastest = pivot_longer(relig_income_long, !religion, names_to = "income", values_to = "count", cols_vary = "fastest"),
  # slowest = pivot_longer(relig_income_long, !religion, names_to = "income", values_to = "count", cols_vary = "slowest"),
  drop_na = pivot_longer(relig_income_long, !religion, names_to = "income", values_to = "count", values_drop_na = TRUE),
  check = FALSE
)
# CRAN
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 2 drop_na     126.8ms  135.1ms      7.35   158.6MB     22.0
#> 1 fastest      43.3ms   57.8ms     13.5     98.6MB     21.2

# MAIN
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 drop_na     139.3ms  143.7ms      6.87   126.1MB     18.9
#> 2 fastest      35.1ms   44.1ms     17.9     62.5MB     15.9
#> 3 slowest      28.3ms     37ms     22.7     55.6MB     26.4

# VEC_UNCHOP()
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 2 fastest      36.6ms   46.4ms     17.3     62.5MB     15.4
#> 3 slowest      28.4ms   38.4ms     21.9     55.6MB     25.6

# 
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 drop_na      44.8ms   44.8ms      22.3    71.2MB    268. 

fish_encounters_long <- vctrs::vec_rbind(!!!vctrs::vec_rep(list(fish_encounters), 1e3), .names_to = "id")

bench::mark(
  basic = pivot_wider(fish_encounters_long, names_from = station, values_from = seen),
  values_fn = pivot_wider(fish_encounters_long, names_from = station, values_from = seen, values_fn = identity),
  check = FALSE
)
# CRAN
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 basic        11.5ms   12.6ms     71.3     11.6MB     7.92
#> 2 values_fn   122.5ms    130ms      7.63    20.8MB    15.3

# This PR - No effect by replacing `vec_c()` with `vec_unchop()`?
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 basic          12ms   13.4ms     66.7     11.6MB     7.85
#> 2 values_fn     115ms  126.9ms      7.51    19.9MB    15.0

^{Created on 2022-09-05 with reprex v2.0.2}

hadley · 2022-10-20T12:30:58Z

Lets reconsider the vec_assign() early exit in vctrs.

DavisVaughan · 2022-10-31T17:41:20Z

I would have actually recommended the usage of vec_any_missing() here, which is what this PR is pushing towards.

99% of the time, when I avoid a vec_assign() call by checking for vec_any_missing() first I ALSO get to avoid a vec_detect_missing() call (or something similar) to generate the location vector.

i.e. this pattern

if (vec_any_missing(x)) {
  loc <- vec_detect_missing(x)
  out <- vec_assign(out, loc, 0)
}

So even if we optimize vec_assign() to avoid the early exit (which I am still not convinced of when compared against the added complexity), I'd probably still make the changes proposed in this PR to use vec_any_missing() to avoid the call entirely, since we also avoid vec_detect_missing().

Merge commit '803a02a381a28eee890caa5ed2924858225e0fa0' #Conflicts: # R/pivot-long.R # R/replace_na.R # R/unnest-longer.R # R/utils.R

DavisVaughan · 2022-11-03T19:53:43Z

Benchmark extracted from the original comment, mostly related to values_drop_na = TRUE

library(tidyr)

relig_income_long <- vctrs::vec_rep(relig_income, 10e3)

bench::mark(
  drop_na = pivot_longer(relig_income_long, !religion, names_to = "income", values_to = "count", values_drop_na = TRUE),
  iterations = 100
)

# Main - More memory allocs, more work, more gc
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 drop_na       110ms    182ms      5.51     126MB     15.1

# This PR - way less memory usage due to not slicing/detect missing when not needed
#> # A tibble: 1 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 drop_na      47.4ms   49.6ms      20.1    64.5MB     97.1

DavisVaughan · 2022-11-03T19:59:21Z

Another one for replace_na(). Here we replace missing values in the numeric columns of flights with 1L. There are 14 columns but not all of them have missing values, so for many of the columns we can skip some of the work.

library(tidyr)
library(dplyr, warn.conflicts = FALSE)
library(nycflights13)

names <- names(select(flights, where(is.numeric)))
replace <- as.list(rlang::rep_named(names, 1L))

# replacing in 14 columns, but not all have missing values
length(names)
#> [1] 14

bench::mark(replace_na(flights, replace), iterations = 100)

# Main
#> # A tibble: 1 × 6
#>   expression                        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 replace_na(flights, replace)   19.2ms   21.3ms      43.4    44.4MB     209.

# This PR
#> # A tibble: 1 × 6
#>   expression                        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 replace_na(flights, replace)   10.2ms   11.1ms      83.8    17.4MB     41.3

^{Created on 2022-11-03 with reprex v2.0.2.9000}

DavisVaughan · 2022-11-03T20:18:40Z

Thanks!

mgirlich added 3 commits September 4, 2022 12:11

Replace vec_c() by vec_unchop()

12bf490

Only use vec_slice() if a value is missing

bd49a84

Add reminders to use vec_any_missing()

020607c

hadley mentioned this pull request Oct 20, 2022

Early exit for empty index in vec_assign() r-lib/vctrs#1590

Open

hadley closed this Oct 20, 2022

hadley reopened this Oct 31, 2022

DavisVaughan added 3 commits November 3, 2022 15:29

Merged main into mgirlich:pivot_longer-efficiency

a582716

Merge commit '803a02a381a28eee890caa5ed2924858225e0fa0' #Conflicts: # R/pivot-long.R # R/replace_na.R # R/unnest-longer.R # R/utils.R

vec_unchop() -> list_unchop()

5f83182

Use vec_any_missing() where it is easy and makes sense

eee54cf

NEWS bullets

74485c5

DavisVaughan merged commit f6b9509 into tidyverse:main Nov 3, 2022

mgirlich deleted the pivot_longer-efficiency branch January 2, 2023 08:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More efficient `pivot_longer()` #1392

More efficient `pivot_longer()` #1392

mgirlich commented Sep 5, 2022 •

edited

Loading

hadley commented Oct 20, 2022

DavisVaughan commented Oct 31, 2022

DavisVaughan commented Nov 3, 2022

DavisVaughan commented Nov 3, 2022

DavisVaughan commented Nov 3, 2022

More efficient pivot_longer() #1392

More efficient pivot_longer() #1392

Conversation

mgirlich commented Sep 5, 2022 • edited Loading

hadley commented Oct 20, 2022

DavisVaughan commented Oct 31, 2022

DavisVaughan commented Nov 3, 2022

DavisVaughan commented Nov 3, 2022

DavisVaughan commented Nov 3, 2022

More efficient `pivot_longer()` #1392

More efficient `pivot_longer()` #1392

mgirlich commented Sep 5, 2022 •

edited

Loading