Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.SD[i] could be optimized better and more generally #4886

Open
myoung3 opened this issue Feb 1, 2021 · 11 comments
Open

.SD[i] could be optimized better and more generally #4886

myoung3 opened this issue Feb 1, 2021 · 11 comments

Comments

@myoung3
Copy link
Contributor

myoung3 commented Feb 1, 2021

Is it feasible to change how SD[i] gets optimized?
Can x[,.SD[1L], by="g"] be optimized to x[x[,.I[1L],by="g"]$V1]?

Could we conceivably optimize all .SD[i], by calls to the nested form above? This would speed up not just the already-optimized .SD[1L] subsets but all arbitrary (unoptimized) .SD[i=<>], by subsets. Also see tidyverse/dtplyr#176 for a more general example showing how that nested approach is much faster when dealing with unoptimized .SD[i] statements (~100x speedup for a large number of groups).

library(data.table)
n <- 1e6
ncols <- 200
x <- as.data.table(matrix(rnorm(n*ncols),ncol = ncols))
x[, g:=sample(1:(n/2),.N,replace=TRUE)] #~approx 2 rows per group

system.time(y_I <- x[x[,.I[1L],by="g"]$V1])
#>    user  system elapsed 
#>    0.70    0.59    0.93
system.time(y_SD <- x[,.SD[1L], by="g"])
#>    user  system elapsed 
#>    2.48    0.34    2.79
setcolorder(y_I, "g") #to match y_SD
identical(y_I, y_SD)
#> [1] TRUE


system.time(y_I <- x[x[,last(.I),by="g"]$V1])
#>    user  system elapsed 
#>    1.64    0.21    1.19
system.time(y_SD <- x[,last(.SD), by="g"])
#>    user  system elapsed 
#>    2.61    0.17    2.80
setcolorder(y_I, "g") #to match y_SD
identical(y_I, y_SD)
#> [1] TRUE


x[, g:=sample(1:(n/5),.N,replace=TRUE)] #~approx 5 rows per group

system.time(y_I <- x[x[,.I[1L],by="g"]$V1])
#>    user  system elapsed 
#>    0.28    0.12    0.33
system.time(y_SD <- x[,.SD[1L], by="g"])
#>    user  system elapsed 
#>    0.94    0.11    1.02
setcolorder(y_I, "g") #to match y_SD
identical(y_I, y_SD)
#> [1] TRUE


system.time(y_I <- x[x[,last(.I),by="g"]$V1])
#>    user  system elapsed 
#>    0.59    0.06    0.42
system.time(y_SD <- x[,last(.SD), by="g"])
#>    user  system elapsed 
#>    1.16    0.14    1.24
setcolorder(y_I, "g") #to match y_SD
identical(y_I, y_SD)
#> [1] TRUE

Created on 2021-02-01 by the reprex package (v0.3.0)

Abridged verbose output for .SD[1L]:

lapply optimization changed j from '.SD[1L]' to 'list(V1[1L], V2[1L], V3[1L], V4[1L], V5[1L], V6[1L], V7[1L], V8[1L], V9[1L], V10[1L], V11[1L], V12[1L], V13[1L], V14[1L], V15[1L], V16[1L], V17[1L], V18[1L], V19[1L], V20[1L], V21[1L], V22[1L], V23[1L], '
GForce optimized j to 'list(`g[`(V1, 1L), `g[`(V2, 1L), `g[`(V3, 1L), `g[`(V4, 1L), `g[`(V5, 1L), `g[`(V6, 1L), `g[`(V7, 1L), `g[`(V8, 1L), `g[`(V9, 1L), `g[`(V10, 1L), `g[`(V11, 1L), `g[`(V12, 1L), `g[`(V13, 1L), `g[`(V14, '

sessionInfo

R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reprex_0.3.0      data.table_1.13.6 dtplyr_1.0.1.9000 dplyr_1.0.3      

loaded via a namespace (and not attached):
 [1] rstudioapi_0.11  whisker_0.4      knitr_1.29       magrittr_2.0.1  
 [5] tidyselect_1.1.0 R6_2.5.0         rlang_0.4.10     fansi_0.4.2     
 [9] tools_4.0.3      xfun_0.20        utf8_1.1.4       cli_2.2.0       
[13] DBI_1.1.0        clipr_0.7.0      htmltools_0.5.0  ellipsis_0.3.1  
[17] digest_0.6.27    assertthat_0.2.1 tibble_3.0.6     lifecycle_0.2.0 
[21] crayon_1.3.4     processx_3.4.5   callr_3.5.1      purrr_0.3.4     
[25] ps_1.3.3         vctrs_0.3.6      fs_1.4.1         evaluate_0.14   
[29] glue_1.4.2       rmarkdown_2.6    compiler_4.0.3   pillar_1.4.7    
[33] generics_0.1.0   pkgconfig_2.0.3 
@myoung3
Copy link
Contributor Author

myoung3 commented Feb 1, 2021

Need to consider x[V1 > 0, .SD[1L], by="g"]. A naive chaining approach is already faster:

library(data.table)
n <- 1e6
ncols <- 200
x <- as.data.table(matrix(rnorm(n*ncols),ncol = ncols))
x[, g:=sample(1:(n/2),.N,replace=TRUE)] #~approx 2 rows per group


system.time(y_SD <- x[V1 > -1, .SD[1L],by="g"])
#>    user  system elapsed 
#>    2.18    0.19    2.42
system.time({
  temp <- x[V1 > -1]
  y_I <- temp[temp[,.I[1L], by="g"]$V1]
})
#>    user  system elapsed 
#>    0.94    0.66    0.97

setcolorder(y_I, "g") #to match y_SD
identical(y_SD, y_I)
#> [1] TRUE

Created on 2021-02-01 by the reprex package (v0.3.0)

@myoung3 myoung3 changed the title .SD[1L] optimization is slower than x[x[,.I[1L],by="group"]$V1] .SD[i] could be optimized better and more generally Feb 1, 2021
@myoung3
Copy link
Contributor Author

myoung3 commented Feb 1, 2021

Need to consider x[, .SD[i, j],by="g"]. Since this isn't optimized currently, the nested subset followed by a second by statement is much faster (>300x speedup):

library(data.table)
n <- 1e6
ncols <- 200
x <- as.data.table(matrix(rnorm(n*ncols),ncol = ncols))
x[, g:=sample(1:(n/2),.N,replace=TRUE)] #~approx 2 rows per group


system.time(y_SD <- x[V1 > -1, .SD[1L, list(V2sum=sum(V2))],by="g"])
#>    user  system elapsed 
#>  317.45    3.30  329.58
system.time({
  temp <- x[V1 > -1]
  y_I <- temp[temp[,.I[1L], by="g"]$V1,list(V2sum=sum(V2)),by="g"]
})
#>    user  system elapsed 
#>    0.69    0.25    0.61

setcolorder(y_I, "g") #to match y_SD
identical(y_SD, y_I)
#> [1] TRUE

Created on 2021-02-01 by the reprex package (v0.3.0)

@myoung3
Copy link
Contributor Author

myoung3 commented Feb 1, 2021

Finally, x[, .SD[i2],by="g", .SDcols=vars1] should be optimized to:

temp <- x[, vars1,with=FALSE]
temp[temp[, .I[i2],by="g"]$V1]

and

x[i1, .SD[i2, j],by="g"] should be optimized to:

temp <- x[i1, list(<any vars used in j>)]
temp[temp[, .I[i2],by="g"]$V1, j, by="g"]

skipping the timings to give my laptop a break

@myoung3
Copy link
Contributor Author

myoung3 commented Feb 2, 2021

I guess this related to (or possibly a duplicate of) #735 . I started this issue focusing on the fact that the current optimization of .SD[1L] isn't as fast as it could be (which is distinct from #735). But then the more timings I did the more I realized that maybe the current approach to optimizing .SD[i] is both slower and more work than just using nested .I statements for everything.

@jangorecki
Copy link
Member

jangorecki commented Feb 2, 2021

I am not sure but there might be a PR that already tries to achieve it.

@ColeMiller1
Copy link
Contributor

I attempted a filter_at() equivalent in i instead of sub setting in j. While I could make the PR better, the concept did not seem to be well received. #4139

Related, I did some work on a having argument. #4412 but looking back, it could be improved on as well.

@myoung3
Copy link
Contributor Author

myoung3 commented Feb 2, 2021

Thanks for those links. I think we are thinking on the same lines internally, but I'm suggesting zero interface changes here. If we can get this working, it would eliminate a lot of the caveats about not using .SD because it's slow.

"This is generally a bad idea to call [.data.table from inside [.data.table. I recall there was a single(?) place where we already did that, and we wanted to get rid of it. If there is another way, then we should consider to use that instead."
Link

@jangorecki Given your previous statement about avoiding [.data.table inside [.data table, do you think that's still the case given these timings? Recursion is tricky but it seems like it might actually be the right tools here.

@ColeMiller1
Copy link
Contributor

I actually agree with Jan. If implemented, it should use forder and internal functions to get the indices.

I still think implementation in i is better. We do a lot of work to increase performance to subset in j, but as soon as a user wants to do something like sum(), all the optimizations go away.

Regardless, I will look into implementing indices outside of using [.data.table.

@myoung3
Copy link
Contributor Author

myoung3 commented Feb 2, 2021

I like the idea of calling lower-order functions directly to avoid recursion. My only concern is that recursion might be the only way to properly deal with certain pathological inputs eg x[i, .SD[,.SD[?], by2],by]] or x[i, .SD[.SD,?, on],by]] (I'm not even sure what these would do or what purpose they would serve, but we'd want to maintain consistency with whatever they do now).

@myoung3
Copy link
Contributor Author

myoung3 commented Feb 2, 2021

But I could definitely be convinced that this isn't an issue, or that recursion isn't necessary to solve it.

@myoung3
Copy link
Contributor Author

myoung3 commented Feb 2, 2021

And I guess there's a difference between optimizing all nested .SD[I] calls (harder, because recursion is hard) and optimizing only top-level .SD[I] calls while leaving nested calls unoptimized but not breaking them (maybe easier, or maybe just hard in a different way)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants