-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
group_by() + slice_max() quite slow #216
Comments
Probably because it uses library(data.table)
library(dplyr, warn.conflicts = FALSE)
library(dtplyr)
DT <- rbindlist(rep(list(mtcars), 1000), idcol = "id")
DF <- as_tibble(DT)
DT %>%
group_by(id) %>%
slice_max(mpg, n = 2) %>%
show_query()
#> `_DT1`[, .SD[order(mpg, decreasing = TRUE)][frankv(-mpg, ties.method = "min",
#> na.last = "keep") <= 2L], keyby = .(id)] Created on 2021-03-04 by the reprex package (v1.0.0) |
Using library(data.table)
library(dplyr, warn.conflicts = FALSE)
library(dtplyr)
DT <- rbindlist(rep(list(mtcars), 1000), idcol = "id")
DF <- as_tibble(DT)
bench::mark(
df = DF %>%
group_by(id) %>%
slice_max(mpg, n = 2) %>%
ungroup(),
dt = DT %>%
group_by(id) %>%
slice_max(mpg, n = 2) %>%
collect(),
I = DT[DT[order(mpg, decreasing = TRUE), .I[frankv(-mpg, ties.method = "min", na.last = "keep") <= 2L], keyby = .(id)]$V1],
iterations = 2,
check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 df 72.8ms 74.9ms 13.4 4.46MB 26.7
#> 2 dt 734.1ms 741.7ms 1.35 70.79MB 17.5
#> 3 I 388.7ms 398.4ms 2.51 33.6MB 17.6 Created on 2021-03-05 by the reprex package (v1.0.0) profvis::profvis({DT[DT[order(mpg, decreasing = TRUE), .I[frankv(-mpg, ties.method = "min", na.last = "keep") <= 2L], keyby = .(id)]$V1]}) |
There is indeed an open issue with library(data.table)
library(dplyr, warn.conflicts = FALSE)
library(dtplyr)
DT <- rbindlist(rep(list(mtcars), 1e3), idcol = "id")
DF <- as_tibble(DT)
# many small groups
bench::mark(
dplyr = DF %>%
group_by(id) %>%
slice_max(mpg, n = 2) %>%
ungroup(),
dtplyr = DT %>%
group_by(id) %>%
slice_max(mpg, n = 2) %>%
collect(),
I_frankv = DT[DT[order(mpg, decreasing = TRUE), .I[frankv(-mpg, ties.method = "min", na.last = "keep") <= 2L], keyby = .(id)]$V1],
I_rank = DT[DT[order(mpg, decreasing = TRUE), .I[rank(-mpg, ties.method = "min", na.last = "keep") <= 2L], keyby = .(id)]$V1],
iterations = 2,
check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 4 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dplyr 76.4ms 77.9ms 12.8 4.46MB 32.1
#> 2 dtplyr 697.5ms 710.3ms 1.41 70.79MB 17.6
#> 3 I_frankv 377.4ms 380.6ms 2.63 33.6MB 19.7
#> 4 I_rank 19ms 21ms 47.6 1.87MB 23.8
# few large groups
DT <- rbindlist(rep(list(mtcars), 1e6))
DT[, id := rep.int(1:4, times = .N / 4)]
DF <- as_tibble(DT)
bench::mark(
dplyr = DF %>%
group_by(id) %>%
slice_max(mpg, n = 2) %>%
ungroup(),
dtplyr = DT %>%
group_by(id) %>%
slice_max(mpg, n = 2) %>%
collect(),
I_frankv = DT[DT[order(mpg, decreasing = TRUE), .I[frankv(-mpg, ties.method = "min", na.last = "keep") <= 2L], keyby = .(id)]$V1],
I_rank = DT[DT[order(mpg, decreasing = TRUE), .I[rank(-mpg, ties.method = "min", na.last = "keep") <= 2L], keyby = .(id)]$V1],
iterations = 2,
check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 4 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dplyr 7.36s 7.46s 0.134 3.17GB 0.268
#> 2 dtplyr 6.88s 7.28s 0.137 5.41GB 0.481
#> 3 I_frankv 4.64s 4.66s 0.215 2.07GB 0.107
#> 4 I_rank 5.2s 5.25s 0.191 2.55GB 0.191 Created on 2021-03-05 by the reprex package (v1.0.0) the performance benefits of |
The
dplyr
version is way faster than thedtplyr
one:Created on 2021-03-04 by the reprex package (v1.0.0)
Session info
The text was updated successfully, but these errors were encountered: