-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More efficient pivot_longer()
#1392
Conversation
Lets reconsider the |
I would have actually recommended the usage of 99% of the time, when I avoid a i.e. this pattern if (vec_any_missing(x)) {
loc <- vec_detect_missing(x)
out <- vec_assign(out, loc, 0)
} So even if we optimize |
Merge commit '803a02a381a28eee890caa5ed2924858225e0fa0' #Conflicts: # R/pivot-long.R # R/replace_na.R # R/unnest-longer.R # R/utils.R
Benchmark extracted from the original comment, mostly related to library(tidyr)
relig_income_long <- vctrs::vec_rep(relig_income, 10e3)
bench::mark(
drop_na = pivot_longer(relig_income_long, !religion, names_to = "income", values_to = "count", values_drop_na = TRUE),
iterations = 100
)
# Main - More memory allocs, more work, more gc
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 drop_na 110ms 182ms 5.51 126MB 15.1
# This PR - way less memory usage due to not slicing/detect missing when not needed
#> # A tibble: 1 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 drop_na 47.4ms 49.6ms 20.1 64.5MB 97.1 |
Another one for library(tidyr)
library(dplyr, warn.conflicts = FALSE)
library(nycflights13)
names <- names(select(flights, where(is.numeric)))
replace <- as.list(rlang::rep_named(names, 1L))
# replacing in 14 columns, but not all have missing values
length(names)
#> [1] 14
bench::mark(replace_na(flights, replace), iterations = 100)
# Main
#> # A tibble: 1 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 replace_na(flights, replace) 19.2ms 21.3ms 43.4 44.4MB 209.
# This PR
#> # A tibble: 1 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 replace_na(flights, replace) 10.2ms 11.1ms 83.8 17.4MB 41.3 Created on 2022-11-03 with reprex v2.0.2.9000 |
Thanks! |
vec_c(!!!x)
byvec_unchop()
. It is supposed to be more efficient although I didn't see any effect. (Apply vctrs principles tomap()
andmodify()
purrr#894 (comment))vec_slice()
if no value is missing b/cvec_slice()
does not have a fast path (yet?)vec_any_missing()
would be usefulCreated on 2022-09-05 with reprex v2.0.2