Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group_by() + slice_max() quite slow #216

Closed
mgirlich opened this issue Mar 4, 2021 · 3 comments · Fixed by #217
Closed

group_by() + slice_max() quite slow #216

mgirlich opened this issue Mar 4, 2021 · 3 comments · Fixed by #217

Comments

@mgirlich
Copy link
Collaborator

mgirlich commented Mar 4, 2021

The dplyr version is way faster than the dtplyr one:

library(data.table)
library(dplyr, warn.conflicts = FALSE)
library(dtplyr)
DT <- rbindlist(rep(list(mtcars), 1000), idcol = "id")

DF <- as_tibble(DT)

bench::mark(
  df = DF %>% 
    group_by(id) %>% 
    slice_max(mpg, n = 2),
  dt = DT %>% 
    group_by(id) %>% 
    slice_max(mpg, n = 2) %>% 
    collect(),
  iterations = 2,
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 df           90.7ms   97.1ms     10.3     4.45MB     30.9
#> 2 dt          755.2ms  774.6ms      1.29   70.78MB     18.1

Created on 2021-03-04 by the reprex package (v1.0.0)

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.3 (2020-10-10)
#>  os       macOS Big Sur 10.16         
#>  system   x86_64, darwin17.0          
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       UTC                         
#>  date     2021-03-04                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version    date       lib source                          
#>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 4.0.2)                  
#>  backports     1.2.1      2020-12-09 [1] CRAN (R 4.0.2)                  
#>  bench         1.1.1      2020-01-13 [1] CRAN (R 4.0.2)                  
#>  cli           2.3.1      2021-02-23 [1] CRAN (R 4.0.3)                  
#>  crayon        1.4.1      2021-02-08 [1] CRAN (R 4.0.3)                  
#>  data.table  * 1.14.0     2021-02-21 [1] CRAN (R 4.0.3)                  
#>  DBI           1.1.1      2021-01-15 [1] CRAN (R 4.0.3)                  
#>  digest        0.6.27     2020-10-24 [1] CRAN (R 4.0.2)                  
#>  dplyr       * 1.0.5      2021-02-25 [1] Github (tidyverse/dplyr@7a96866)
#>  dtplyr      * 1.1.0.9000 2021-03-04 [1] local                           
#>  ellipsis      0.3.1      2020-05-15 [1] CRAN (R 4.0.2)                  
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 4.0.1)                  
#>  fansi         0.4.2      2021-01-15 [1] CRAN (R 4.0.2)                  
#>  fs            1.5.0      2020-07-31 [1] CRAN (R 4.0.2)                  
#>  generics      0.1.0      2020-10-31 [1] CRAN (R 4.0.2)                  
#>  glue          1.4.2      2020-08-27 [1] CRAN (R 4.0.2)                  
#>  highr         0.8        2019-03-20 [1] CRAN (R 4.0.2)                  
#>  htmltools     0.5.1.1    2021-01-22 [1] CRAN (R 4.0.2)                  
#>  knitr         1.31       2021-01-27 [1] CRAN (R 4.0.3)                  
#>  lifecycle     1.0.0      2021-02-15 [1] CRAN (R 4.0.3)                  
#>  magrittr      2.0.1      2020-11-17 [1] CRAN (R 4.0.2)                  
#>  pillar        1.5.0      2021-02-22 [1] CRAN (R 4.0.3)                  
#>  pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.0.2)                  
#>  profmem       0.6.0      2020-12-13 [1] CRAN (R 4.0.2)                  
#>  purrr         0.3.4      2020-04-17 [1] CRAN (R 4.0.2)                  
#>  R6            2.5.0      2020-10-28 [1] CRAN (R 4.0.2)                  
#>  reprex        1.0.0      2021-01-27 [1] CRAN (R 4.0.2)                  
#>  rlang         0.4.10     2020-12-30 [1] CRAN (R 4.0.2)                  
#>  rmarkdown     2.7        2021-02-19 [1] CRAN (R 4.0.3)                  
#>  rstudioapi    0.13       2020-11-12 [1] CRAN (R 4.0.2)                  
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 4.0.2)                  
#>  stringi       1.5.3      2020-09-09 [1] CRAN (R 4.0.2)                  
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 4.0.2)                  
#>  styler        1.3.2      2020-02-23 [1] CRAN (R 4.0.2)                  
#>  tibble        3.1.0      2021-02-25 [1] CRAN (R 4.0.2)                  
#>  tidyselect    1.1.0      2020-05-11 [1] CRAN (R 4.0.2)                  
#>  utf8          1.1.4      2018-05-24 [1] CRAN (R 4.0.2)                  
#>  vctrs         0.3.6.9000 2021-02-17 [1] Github (r-lib/vctrs@9af59e9)    
#>  withr         2.4.1      2021-01-26 [1] CRAN (R 4.0.2)                  
#>  xfun          0.21       2021-02-10 [1] CRAN (R 4.0.3)                  
#>  yaml          2.2.1      2020-02-01 [1] CRAN (R 4.0.2)                  
#> 
#> [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library
@hadley
Copy link
Member

hadley commented Mar 4, 2021

Probably because it uses .SD and not .I:

library(data.table)
library(dplyr, warn.conflicts = FALSE)
library(dtplyr)
DT <- rbindlist(rep(list(mtcars), 1000), idcol = "id")

DF <- as_tibble(DT)

DT %>% 
  group_by(id) %>% 
  slice_max(mpg, n = 2) %>% 
  show_query()
#> `_DT1`[, .SD[order(mpg, decreasing = TRUE)][frankv(-mpg, ties.method = "min", 
#>     na.last = "keep") <= 2L], keyby = .(id)]

Created on 2021-03-04 by the reprex package (v1.0.0)

@mgirlich
Copy link
Collaborator Author

mgirlich commented Mar 5, 2021

Using .I doubles the speed but it is still quite a bit slower. The issue seems to be frankv()

library(data.table)
library(dplyr, warn.conflicts = FALSE)
library(dtplyr)
DT <- rbindlist(rep(list(mtcars), 1000), idcol = "id")
DF <- as_tibble(DT)

bench::mark(
  df = DF %>% 
    group_by(id) %>% 
    slice_max(mpg, n = 2) %>% 
    ungroup(),
  dt = DT %>% 
    group_by(id) %>% 
    slice_max(mpg, n = 2) %>% 
    collect(),
  I = DT[DT[order(mpg, decreasing = TRUE), .I[frankv(-mpg, ties.method = "min", na.last = "keep") <= 2L], keyby = .(id)]$V1],
  iterations = 2,
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 3 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 df           72.8ms   74.9ms     13.4     4.46MB     26.7
#> 2 dt          734.1ms  741.7ms      1.35   70.79MB     17.5
#> 3 I           388.7ms  398.4ms      2.51    33.6MB     17.6

Created on 2021-03-05 by the reprex package (v1.0.0)

profvis::profvis({DT[DT[order(mpg, decreasing = TRUE), .I[frankv(-mpg, ties.method = "min", na.last = "keep") <= 2L], keyby = .(id)]$V1]})

image

@mgirlich
Copy link
Collaborator Author

mgirlich commented Mar 5, 2021

There is indeed an open issue with frank() by group: Rdatatable/data.table#3988
The simplest solution seems to be to use rank() instead. Here are some benchmarks:

library(data.table)
library(dplyr, warn.conflicts = FALSE)
library(dtplyr)

DT <- rbindlist(rep(list(mtcars), 1e3), idcol = "id")
DF <- as_tibble(DT)

# many small groups
bench::mark(
  dplyr = DF %>% 
    group_by(id) %>% 
    slice_max(mpg, n = 2) %>% 
    ungroup(),
  dtplyr = DT %>% 
    group_by(id) %>% 
    slice_max(mpg, n = 2) %>% 
    collect(),
  I_frankv = DT[DT[order(mpg, decreasing = TRUE), .I[frankv(-mpg, ties.method = "min", na.last = "keep") <= 2L], keyby = .(id)]$V1],
  I_rank = DT[DT[order(mpg, decreasing = TRUE), .I[rank(-mpg, ties.method = "min", na.last = "keep") <= 2L], keyby = .(id)]$V1],
  iterations = 2,
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 4 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr        76.4ms   77.9ms     12.8     4.46MB     32.1
#> 2 dtplyr      697.5ms  710.3ms      1.41   70.79MB     17.6
#> 3 I_frankv    377.4ms  380.6ms      2.63    33.6MB     19.7
#> 4 I_rank         19ms     21ms     47.6     1.87MB     23.8

# few large groups
DT <- rbindlist(rep(list(mtcars), 1e6))
DT[, id := rep.int(1:4, times = .N / 4)]
DF <- as_tibble(DT)

bench::mark(
  dplyr = DF %>% 
    group_by(id) %>% 
    slice_max(mpg, n = 2) %>% 
    ungroup(),
  dtplyr = DT %>% 
    group_by(id) %>% 
    slice_max(mpg, n = 2) %>% 
    collect(),
  I_frankv = DT[DT[order(mpg, decreasing = TRUE), .I[frankv(-mpg, ties.method = "min", na.last = "keep") <= 2L], keyby = .(id)]$V1],
  I_rank = DT[DT[order(mpg, decreasing = TRUE), .I[rank(-mpg, ties.method = "min", na.last = "keep") <= 2L], keyby = .(id)]$V1],
  iterations = 2,
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 4 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr         7.36s    7.46s     0.134    3.17GB    0.268
#> 2 dtplyr        6.88s    7.28s     0.137    5.41GB    0.481
#> 3 I_frankv      4.64s    4.66s     0.215    2.07GB    0.107
#> 4 I_rank         5.2s    5.25s     0.191    2.55GB    0.191

Created on 2021-03-05 by the reprex package (v1.0.0)

the performance benefits of frankv are relatively small even for bigger groups (8 million rows in this case) so I think it indeed makes sense to switch to use rank() in this case.

hadley pushed a commit that referenced this issue Mar 5, 2021
And ensure it works with character columns. Fixes #216. Fixes #218.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants