group_by() + slice_max() quite slow #216

mgirlich · 2021-03-04T14:25:52Z

The dplyr version is way faster than the dtplyr one:

library(data.table)
library(dplyr, warn.conflicts = FALSE)
library(dtplyr)
DT <- rbindlist(rep(list(mtcars), 1000), idcol = "id")

DF <- as_tibble(DT)

bench::mark(
  df = DF %>% 
    group_by(id) %>% 
    slice_max(mpg, n = 2),
  dt = DT %>% 
    group_by(id) %>% 
    slice_max(mpg, n = 2) %>% 
    collect(),
  iterations = 2,
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 df           90.7ms   97.1ms     10.3     4.45MB     30.9
#> 2 dt          755.2ms  774.6ms      1.29   70.78MB     18.1

^{Created on 2021-03-04 by the reprex package (v1.0.0)}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.3 (2020-10-10)
#>  os       macOS Big Sur 10.16         
#>  system   x86_64, darwin17.0          
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       UTC                         
#>  date     2021-03-04                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version    date       lib source                          
#>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 4.0.2)                  
#>  backports     1.2.1      2020-12-09 [1] CRAN (R 4.0.2)                  
#>  bench         1.1.1      2020-01-13 [1] CRAN (R 4.0.2)                  
#>  cli           2.3.1      2021-02-23 [1] CRAN (R 4.0.3)                  
#>  crayon        1.4.1      2021-02-08 [1] CRAN (R 4.0.3)                  
#>  data.table  * 1.14.0     2021-02-21 [1] CRAN (R 4.0.3)                  
#>  DBI           1.1.1      2021-01-15 [1] CRAN (R 4.0.3)                  
#>  digest        0.6.27     2020-10-24 [1] CRAN (R 4.0.2)                  
#>  dplyr       * 1.0.5      2021-02-25 [1] Github (tidyverse/dplyr@7a96866)
#>  dtplyr      * 1.1.0.9000 2021-03-04 [1] local                           
#>  ellipsis      0.3.1      2020-05-15 [1] CRAN (R 4.0.2)                  
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 4.0.1)                  
#>  fansi         0.4.2      2021-01-15 [1] CRAN (R 4.0.2)                  
#>  fs            1.5.0      2020-07-31 [1] CRAN (R 4.0.2)                  
#>  generics      0.1.0      2020-10-31 [1] CRAN (R 4.0.2)                  
#>  glue          1.4.2      2020-08-27 [1] CRAN (R 4.0.2)                  
#>  highr         0.8        2019-03-20 [1] CRAN (R 4.0.2)                  
#>  htmltools     0.5.1.1    2021-01-22 [1] CRAN (R 4.0.2)                  
#>  knitr         1.31       2021-01-27 [1] CRAN (R 4.0.3)                  
#>  lifecycle     1.0.0      2021-02-15 [1] CRAN (R 4.0.3)                  
#>  magrittr      2.0.1      2020-11-17 [1] CRAN (R 4.0.2)                  
#>  pillar        1.5.0      2021-02-22 [1] CRAN (R 4.0.3)                  
#>  pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.0.2)                  
#>  profmem       0.6.0      2020-12-13 [1] CRAN (R 4.0.2)                  
#>  purrr         0.3.4      2020-04-17 [1] CRAN (R 4.0.2)                  
#>  R6            2.5.0      2020-10-28 [1] CRAN (R 4.0.2)                  
#>  reprex        1.0.0      2021-01-27 [1] CRAN (R 4.0.2)                  
#>  rlang         0.4.10     2020-12-30 [1] CRAN (R 4.0.2)                  
#>  rmarkdown     2.7        2021-02-19 [1] CRAN (R 4.0.3)                  
#>  rstudioapi    0.13       2020-11-12 [1] CRAN (R 4.0.2)                  
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 4.0.2)                  
#>  stringi       1.5.3      2020-09-09 [1] CRAN (R 4.0.2)                  
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 4.0.2)                  
#>  styler        1.3.2      2020-02-23 [1] CRAN (R 4.0.2)                  
#>  tibble        3.1.0      2021-02-25 [1] CRAN (R 4.0.2)                  
#>  tidyselect    1.1.0      2020-05-11 [1] CRAN (R 4.0.2)                  
#>  utf8          1.1.4      2018-05-24 [1] CRAN (R 4.0.2)                  
#>  vctrs         0.3.6.9000 2021-02-17 [1] Github (r-lib/vctrs@9af59e9)    
#>  withr         2.4.1      2021-01-26 [1] CRAN (R 4.0.2)                  
#>  xfun          0.21       2021-02-10 [1] CRAN (R 4.0.3)                  
#>  yaml          2.2.1      2020-02-01 [1] CRAN (R 4.0.2)                  
#> 
#> [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

The text was updated successfully, but these errors were encountered:

hadley · 2021-03-04T14:46:42Z

Probably because it uses .SD and not .I:

library(data.table)
library(dplyr, warn.conflicts = FALSE)
library(dtplyr)
DT <- rbindlist(rep(list(mtcars), 1000), idcol = "id")

DF <- as_tibble(DT)

DT %>% 
  group_by(id) %>% 
  slice_max(mpg, n = 2) %>% 
  show_query()
#> `_DT1`[, .SD[order(mpg, decreasing = TRUE)][frankv(-mpg, ties.method = "min", 
#>     na.last = "keep") <= 2L], keyby = .(id)]

^{Created on 2021-03-04 by the reprex package (v1.0.0)}

mgirlich · 2021-03-05T06:58:09Z

Using .I doubles the speed but it is still quite a bit slower. The issue seems to be frankv()

library(data.table)
library(dplyr, warn.conflicts = FALSE)
library(dtplyr)
DT <- rbindlist(rep(list(mtcars), 1000), idcol = "id")
DF <- as_tibble(DT)

bench::mark(
  df = DF %>% 
    group_by(id) %>% 
    slice_max(mpg, n = 2) %>% 
    ungroup(),
  dt = DT %>% 
    group_by(id) %>% 
    slice_max(mpg, n = 2) %>% 
    collect(),
  I = DT[DT[order(mpg, decreasing = TRUE), .I[frankv(-mpg, ties.method = "min", na.last = "keep") <= 2L], keyby = .(id)]$V1],
  iterations = 2,
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 3 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 df           72.8ms   74.9ms     13.4     4.46MB     26.7
#> 2 dt          734.1ms  741.7ms      1.35   70.79MB     17.5
#> 3 I           388.7ms  398.4ms      2.51    33.6MB     17.6

^{Created on 2021-03-05 by the reprex package (v1.0.0)}

profvis::profvis({DT[DT[order(mpg, decreasing = TRUE), .I[frankv(-mpg, ties.method = "min", na.last = "keep") <= 2L], keyby = .(id)]$V1]})

mgirlich · 2021-03-05T07:24:58Z

There is indeed an open issue with frank() by group: Rdatatable/data.table#3988
The simplest solution seems to be to use rank() instead. Here are some benchmarks:

library(data.table)
library(dplyr, warn.conflicts = FALSE)
library(dtplyr)

DT <- rbindlist(rep(list(mtcars), 1e3), idcol = "id")
DF <- as_tibble(DT)

# many small groups
bench::mark(
  dplyr = DF %>% 
    group_by(id) %>% 
    slice_max(mpg, n = 2) %>% 
    ungroup(),
  dtplyr = DT %>% 
    group_by(id) %>% 
    slice_max(mpg, n = 2) %>% 
    collect(),
  I_frankv = DT[DT[order(mpg, decreasing = TRUE), .I[frankv(-mpg, ties.method = "min", na.last = "keep") <= 2L], keyby = .(id)]$V1],
  I_rank = DT[DT[order(mpg, decreasing = TRUE), .I[rank(-mpg, ties.method = "min", na.last = "keep") <= 2L], keyby = .(id)]$V1],
  iterations = 2,
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 4 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr        76.4ms   77.9ms     12.8     4.46MB     32.1
#> 2 dtplyr      697.5ms  710.3ms      1.41   70.79MB     17.6
#> 3 I_frankv    377.4ms  380.6ms      2.63    33.6MB     19.7
#> 4 I_rank         19ms     21ms     47.6     1.87MB     23.8

# few large groups
DT <- rbindlist(rep(list(mtcars), 1e6))
DT[, id := rep.int(1:4, times = .N / 4)]
DF <- as_tibble(DT)

bench::mark(
  dplyr = DF %>% 
    group_by(id) %>% 
    slice_max(mpg, n = 2) %>% 
    ungroup(),
  dtplyr = DT %>% 
    group_by(id) %>% 
    slice_max(mpg, n = 2) %>% 
    collect(),
  I_frankv = DT[DT[order(mpg, decreasing = TRUE), .I[frankv(-mpg, ties.method = "min", na.last = "keep") <= 2L], keyby = .(id)]$V1],
  I_rank = DT[DT[order(mpg, decreasing = TRUE), .I[rank(-mpg, ties.method = "min", na.last = "keep") <= 2L], keyby = .(id)]$V1],
  iterations = 2,
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 4 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr         7.36s    7.46s     0.134    3.17GB    0.268
#> 2 dtplyr        6.88s    7.28s     0.137    5.41GB    0.481
#> 3 I_frankv      4.64s    4.66s     0.215    2.07GB    0.107
#> 4 I_rank         5.2s    5.25s     0.191    2.55GB    0.191

^{Created on 2021-03-05 by the reprex package (v1.0.0)}

the performance benefits of frankv are relatively small even for bigger groups (8 million rows in this case) so I think it indeed makes sense to switch to use rank() in this case.

And ensure it works with character columns. Fixes #216. Fixes #218.

mgirlich mentioned this issue Mar 5, 2021

slice_max() and slice_min() speed #217

Merged

hadley closed this as completed in #217 Mar 5, 2021

hadley pushed a commit that referenced this issue Mar 5, 2021

slice_max() and slice_min() speed (#217)

5a084b0

And ensure it works with character columns. Fixes #216. Fixes #218.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

group_by() + slice_max() quite slow #216

group_by() + slice_max() quite slow #216

mgirlich commented Mar 4, 2021

hadley commented Mar 4, 2021

mgirlich commented Mar 5, 2021

mgirlich commented Mar 5, 2021

group_by() + slice_max() quite slow #216

group_by() + slice_max() quite slow #216

Comments

mgirlich commented Mar 4, 2021

hadley commented Mar 4, 2021

mgirlich commented Mar 5, 2021

mgirlich commented Mar 5, 2021