.SD[i] could be optimized better and more generally #4886

myoung3 · 2021-02-01T20:09:50Z

Is it feasible to change how SD[i] gets optimized?
Can x[,.SD[1L], by="g"] be optimized to x[x[,.I[1L],by="g"]$V1]?

Could we conceivably optimize all .SD[i], by calls to the nested form above? This would speed up not just the already-optimized .SD[1L] subsets but all arbitrary (unoptimized) .SD[i=<>], by subsets. Also see tidyverse/dtplyr#176 for a more general example showing how that nested approach is much faster when dealing with unoptimized .SD[i] statements (~100x speedup for a large number of groups).

library(data.table)
n <- 1e6
ncols <- 200
x <- as.data.table(matrix(rnorm(n*ncols),ncol = ncols))
x[, g:=sample(1:(n/2),.N,replace=TRUE)] #~approx 2 rows per group

system.time(y_I <- x[x[,.I[1L],by="g"]$V1])
#>    user  system elapsed 
#>    0.70    0.59    0.93
system.time(y_SD <- x[,.SD[1L], by="g"])
#>    user  system elapsed 
#>    2.48    0.34    2.79
setcolorder(y_I, "g") #to match y_SD
identical(y_I, y_SD)
#> [1] TRUE


system.time(y_I <- x[x[,last(.I),by="g"]$V1])
#>    user  system elapsed 
#>    1.64    0.21    1.19
system.time(y_SD <- x[,last(.SD), by="g"])
#>    user  system elapsed 
#>    2.61    0.17    2.80
setcolorder(y_I, "g") #to match y_SD
identical(y_I, y_SD)
#> [1] TRUE


x[, g:=sample(1:(n/5),.N,replace=TRUE)] #~approx 5 rows per group

system.time(y_I <- x[x[,.I[1L],by="g"]$V1])
#>    user  system elapsed 
#>    0.28    0.12    0.33
system.time(y_SD <- x[,.SD[1L], by="g"])
#>    user  system elapsed 
#>    0.94    0.11    1.02
setcolorder(y_I, "g") #to match y_SD
identical(y_I, y_SD)
#> [1] TRUE


system.time(y_I <- x[x[,last(.I),by="g"]$V1])
#>    user  system elapsed 
#>    0.59    0.06    0.42
system.time(y_SD <- x[,last(.SD), by="g"])
#>    user  system elapsed 
#>    1.16    0.14    1.24
setcolorder(y_I, "g") #to match y_SD
identical(y_I, y_SD)
#> [1] TRUE

Created on 2021-02-01 by the reprex package (v0.3.0)

Abridged verbose output for .SD[1L]:

lapply optimization changed j from '.SD[1L]' to 'list(V1[1L], V2[1L], V3[1L], V4[1L], V5[1L], V6[1L], V7[1L], V8[1L], V9[1L], V10[1L], V11[1L], V12[1L], V13[1L], V14[1L], V15[1L], V16[1L], V17[1L], V18[1L], V19[1L], V20[1L], V21[1L], V22[1L], V23[1L], '
GForce optimized j to 'list(`g[`(V1, 1L), `g[`(V2, 1L), `g[`(V3, 1L), `g[`(V4, 1L), `g[`(V5, 1L), `g[`(V6, 1L), `g[`(V7, 1L), `g[`(V8, 1L), `g[`(V9, 1L), `g[`(V10, 1L), `g[`(V11, 1L), `g[`(V12, 1L), `g[`(V13, 1L), `g[`(V14, '

sessionInfo

R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reprex_0.3.0      data.table_1.13.6 dtplyr_1.0.1.9000 dplyr_1.0.3      

loaded via a namespace (and not attached):
 [1] rstudioapi_0.11  whisker_0.4      knitr_1.29       magrittr_2.0.1  
 [5] tidyselect_1.1.0 R6_2.5.0         rlang_0.4.10     fansi_0.4.2     
 [9] tools_4.0.3      xfun_0.20        utf8_1.1.4       cli_2.2.0       
[13] DBI_1.1.0        clipr_0.7.0      htmltools_0.5.0  ellipsis_0.3.1  
[17] digest_0.6.27    assertthat_0.2.1 tibble_3.0.6     lifecycle_0.2.0 
[21] crayon_1.3.4     processx_3.4.5   callr_3.5.1      purrr_0.3.4     
[25] ps_1.3.3         vctrs_0.3.6      fs_1.4.1         evaluate_0.14   
[29] glue_1.4.2       rmarkdown_2.6    compiler_4.0.3   pillar_1.4.7    
[33] generics_0.1.0   pkgconfig_2.0.3

The text was updated successfully, but these errors were encountered:

myoung3 · 2021-02-01T20:28:09Z

Need to consider x[V1 > 0, .SD[1L], by="g"]. A naive chaining approach is already faster:

library(data.table)
n <- 1e6
ncols <- 200
x <- as.data.table(matrix(rnorm(n*ncols),ncol = ncols))
x[, g:=sample(1:(n/2),.N,replace=TRUE)] #~approx 2 rows per group


system.time(y_SD <- x[V1 > -1, .SD[1L],by="g"])
#>    user  system elapsed 
#>    2.18    0.19    2.42
system.time({
  temp <- x[V1 > -1]
  y_I <- temp[temp[,.I[1L], by="g"]$V1]
})
#>    user  system elapsed 
#>    0.94    0.66    0.97

setcolorder(y_I, "g") #to match y_SD
identical(y_SD, y_I)
#> [1] TRUE

^{Created on 2021-02-01 by the reprex package (v0.3.0)}

myoung3 · 2021-02-01T20:42:42Z

Need to consider x[, .SD[i, j],by="g"]. Since this isn't optimized currently, the nested subset followed by a second by statement is much faster (>300x speedup):

library(data.table)
n <- 1e6
ncols <- 200
x <- as.data.table(matrix(rnorm(n*ncols),ncol = ncols))
x[, g:=sample(1:(n/2),.N,replace=TRUE)] #~approx 2 rows per group


system.time(y_SD <- x[V1 > -1, .SD[1L, list(V2sum=sum(V2))],by="g"])
#>    user  system elapsed 
#>  317.45    3.30  329.58
system.time({
  temp <- x[V1 > -1]
  y_I <- temp[temp[,.I[1L], by="g"]$V1,list(V2sum=sum(V2)),by="g"]
})
#>    user  system elapsed 
#>    0.69    0.25    0.61

setcolorder(y_I, "g") #to match y_SD
identical(y_SD, y_I)
#> [1] TRUE

^{Created on 2021-02-01 by the reprex package (v0.3.0)}

myoung3 · 2021-02-01T20:45:05Z

Finally, x[, .SD[i2],by="g", .SDcols=vars1] should be optimized to:

temp <- x[, vars1,with=FALSE]
temp[temp[, .I[i2],by="g"]$V1]

and

x[i1, .SD[i2, j],by="g"] should be optimized to:

temp <- x[i1, list(<any vars used in j>)]
temp[temp[, .I[i2],by="g"]$V1, j, by="g"]

skipping the timings to give my laptop a break

myoung3 · 2021-02-02T00:28:07Z

I guess this related to (or possibly a duplicate of) #735 . I started this issue focusing on the fact that the current optimization of .SD[1L] isn't as fast as it could be (which is distinct from #735). But then the more timings I did the more I realized that maybe the current approach to optimizing .SD[i] is both slower and more work than just using nested .I statements for everything.

jangorecki · 2021-02-02T08:03:54Z

I am not sure but there might be a PR that already tries to achieve it.

ColeMiller1 · 2021-02-02T13:01:52Z

I attempted a filter_at() equivalent in i instead of sub setting in j. While I could make the PR better, the concept did not seem to be well received. #4139

Related, I did some work on a having argument. #4412 but looking back, it could be improved on as well.

myoung3 · 2021-02-02T17:13:18Z

Thanks for those links. I think we are thinking on the same lines internally, but I'm suggesting zero interface changes here. If we can get this working, it would eliminate a lot of the caveats about not using .SD because it's slow.

"This is generally a bad idea to call [.data.table from inside [.data.table. I recall there was a single(?) place where we already did that, and we wanted to get rid of it. If there is another way, then we should consider to use that instead."
Link

@jangorecki Given your previous statement about avoiding [.data.table inside [.data table, do you think that's still the case given these timings? Recursion is tricky but it seems like it might actually be the right tools here.

ColeMiller1 · 2021-02-02T18:00:51Z

I actually agree with Jan. If implemented, it should use forder and internal functions to get the indices.

I still think implementation in i is better. We do a lot of work to increase performance to subset in j, but as soon as a user wants to do something like sum(), all the optimizations go away.

Regardless, I will look into implementing indices outside of using [.data.table.

myoung3 · 2021-02-02T19:35:41Z

I like the idea of calling lower-order functions directly to avoid recursion. My only concern is that recursion might be the only way to properly deal with certain pathological inputs eg x[i, .SD[,.SD[?], by2],by]] or x[i, .SD[.SD,?, on],by]] (I'm not even sure what these would do or what purpose they would serve, but we'd want to maintain consistency with whatever they do now).

myoung3 · 2021-02-02T19:46:37Z

But I could definitely be convinced that this isn't an issue, or that recursion isn't necessary to solve it.

myoung3 · 2021-02-02T19:49:28Z

And I guess there's a difference between optimizing all nested .SD[I] calls (harder, because recursion is hard) and optimizing only top-level .SD[I] calls while leaving nested calls unoptimized but not breaking them (maybe easier, or maybe just hard in a different way)

myoung3 changed the title ~~.SD[1L] optimization is slower than x[x[,.I[1L],by="group"]$V1]~~ .SD[i] could be optimized better and more generally Feb 1, 2021

myoung3 mentioned this issue Feb 1, 2021

Improve translation of grouped filter() tidyverse/dtplyr#176

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.SD[i] could be optimized better and more generally #4886

.SD[i] could be optimized better and more generally #4886

myoung3 commented Feb 1, 2021 •

edited

Loading

myoung3 commented Feb 1, 2021

myoung3 commented Feb 1, 2021

myoung3 commented Feb 1, 2021 •

edited

Loading

myoung3 commented Feb 2, 2021

jangorecki commented Feb 2, 2021 •

edited

Loading

ColeMiller1 commented Feb 2, 2021

myoung3 commented Feb 2, 2021 •

edited

Loading

ColeMiller1 commented Feb 2, 2021

myoung3 commented Feb 2, 2021 •

edited

Loading

myoung3 commented Feb 2, 2021

myoung3 commented Feb 2, 2021 •

edited

Loading

.SD[i] could be optimized better and more generally #4886

.SD[i] could be optimized better and more generally #4886

Comments

myoung3 commented Feb 1, 2021 • edited Loading

myoung3 commented Feb 1, 2021

myoung3 commented Feb 1, 2021

myoung3 commented Feb 1, 2021 • edited Loading

myoung3 commented Feb 2, 2021

jangorecki commented Feb 2, 2021 • edited Loading

ColeMiller1 commented Feb 2, 2021

myoung3 commented Feb 2, 2021 • edited Loading

ColeMiller1 commented Feb 2, 2021

myoung3 commented Feb 2, 2021 • edited Loading

myoung3 commented Feb 2, 2021

myoung3 commented Feb 2, 2021 • edited Loading

myoung3 commented Feb 1, 2021 •

edited

Loading

myoung3 commented Feb 1, 2021 •

edited

Loading

jangorecki commented Feb 2, 2021 •

edited

Loading

myoung3 commented Feb 2, 2021 •

edited

Loading

myoung3 commented Feb 2, 2021 •

edited

Loading

myoung3 commented Feb 2, 2021 •

edited

Loading