-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.SD[i] could be optimized better and more generally #4886
Comments
Need to consider library(data.table)
n <- 1e6
ncols <- 200
x <- as.data.table(matrix(rnorm(n*ncols),ncol = ncols))
x[, g:=sample(1:(n/2),.N,replace=TRUE)] #~approx 2 rows per group
system.time(y_SD <- x[V1 > -1, .SD[1L],by="g"])
#> user system elapsed
#> 2.18 0.19 2.42
system.time({
temp <- x[V1 > -1]
y_I <- temp[temp[,.I[1L], by="g"]$V1]
})
#> user system elapsed
#> 0.94 0.66 0.97
setcolorder(y_I, "g") #to match y_SD
identical(y_SD, y_I)
#> [1] TRUE Created on 2021-02-01 by the reprex package (v0.3.0) |
Need to consider library(data.table)
n <- 1e6
ncols <- 200
x <- as.data.table(matrix(rnorm(n*ncols),ncol = ncols))
x[, g:=sample(1:(n/2),.N,replace=TRUE)] #~approx 2 rows per group
system.time(y_SD <- x[V1 > -1, .SD[1L, list(V2sum=sum(V2))],by="g"])
#> user system elapsed
#> 317.45 3.30 329.58
system.time({
temp <- x[V1 > -1]
y_I <- temp[temp[,.I[1L], by="g"]$V1,list(V2sum=sum(V2)),by="g"]
})
#> user system elapsed
#> 0.69 0.25 0.61
setcolorder(y_I, "g") #to match y_SD
identical(y_SD, y_I)
#> [1] TRUE Created on 2021-02-01 by the reprex package (v0.3.0) |
Finally,
and
skipping the timings to give my laptop a break |
I guess this related to (or possibly a duplicate of) #735 . I started this issue focusing on the fact that the current optimization of .SD[1L] isn't as fast as it could be (which is distinct from #735). But then the more timings I did the more I realized that maybe the current approach to optimizing .SD[i] is both slower and more work than just using nested |
I am not sure but there might be a PR that already tries to achieve it. |
Thanks for those links. I think we are thinking on the same lines internally, but I'm suggesting zero interface changes here. If we can get this working, it would eliminate a lot of the caveats about not using .SD because it's slow. "This is generally a bad idea to call [.data.table from inside [.data.table. I recall there was a single(?) place where we already did that, and we wanted to get rid of it. If there is another way, then we should consider to use that instead." @jangorecki Given your previous statement about avoiding [.data.table inside [.data table, do you think that's still the case given these timings? Recursion is tricky but it seems like it might actually be the right tools here. |
I actually agree with Jan. If implemented, it should use I still think implementation in Regardless, I will look into implementing indices outside of using |
I like the idea of calling lower-order functions directly to avoid recursion. My only concern is that recursion might be the only way to properly deal with certain pathological inputs eg |
But I could definitely be convinced that this isn't an issue, or that recursion isn't necessary to solve it. |
And I guess there's a difference between optimizing all nested .SD[I] calls (harder, because recursion is hard) and optimizing only top-level .SD[I] calls while leaving nested calls unoptimized but not breaking them (maybe easier, or maybe just hard in a different way) |
Is it feasible to change how
SD[i]
gets optimized?Can
x[,.SD[1L], by="g"]
be optimized tox[x[,.I[1L],by="g"]$V1]
?Could we conceivably optimize all
.SD[i], by
calls to the nested form above? This would speed up not just the already-optimized.SD[1L]
subsets but all arbitrary (unoptimized).SD[i=<>], by
subsets. Also see tidyverse/dtplyr#176 for a more general example showing how that nested approach is much faster when dealing with unoptimized.SD[i]
statements (~100x speedup for a large number of groups).Created on 2021-02-01 by the reprex package (v0.3.0)
Abridged verbose output for
.SD[1L]
:sessionInfo
The text was updated successfully, but these errors were encountered: