-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve translation of grouped filter() #176
Comments
This comment has been minimized.
This comment has been minimized.
I don't think this has anything to do with library(dtplyr)
library(dplyr, warn.conflicts = FALSE)
set.seed(1)
ncol <- 100
n <- 1e4
dt1 <- data.table::as.data.table(replicate(ncol, rnorm(n)))
dt1[, g:=1:.N]
dt1[, x:=sample(1:2,size=n,replace=TRUE)]
dt1 %>%
lazy_dt() %>%
group_by(g) %>%
filter(sum(x) == 1) %>%
show_query()
#> `_DT1`[, .SD[sum(x) == 1], keyby = .(g)]
bench::mark(
dt1[, .SD[sum(x) == 1], keyby = .(g)],
dt1[dt1[,.I[sum(x) == 1], keyby = .(g)]$V1],
check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec`
#> <bch:expr> <bch:t> <bch:> <dbl>
#> 1 dt1[, .SD[sum(x) == 1], keyby = .(g)] 1.25s 1.25s 0.802
#> 2 dt1[dt1[, .I[sum(x) == 1], keyby = .(g)]$V1] 14.69ms 16.3ms 53.8
#> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl> Created on 2021-01-30 by the reprex package (v0.3.0.9001) Modulo slight differences in the column order, which are (I think) caused by |
This approach would presumably also be more efficient for grouped slices. |
Hmm, you're right it doesn't seem to be related to select at all. I though it was slow because the .SD subset was taking all columns first, and then subsequently taking the selected columns. But it seems like .SD subsets are just slow when there are lots of groups. Related: Rdatatable/data.table#735 For what it's worth, combining the filter and select statements into one library(data.table)
library(dplyr,warn=FALSE)
set.seed(1)
ncol <- 100
n <- 1e7
dt1 <- as.data.table(replicate(ncol, rnorm(n)))
dt1[, g:=1:.N]
dt1[, x:=sample(1:2,size=n,replace=TRUE)]
gc()
#basline memory usage ~8000mb
system.time(dt1[dt1[,.I[sum(x)==1],by="g"]$V1,list(g,x)])
#user system elapsed
#17.984 0.284 15.243
#peak memory, according to htop: ~8100mb
gc()
system.time(dt1[dt1[,.I[sum(x)==1],by="g"]$V1][,list(g,x)])
#user system elapsed
#19.528 4.132 15.236
#peak memory, according to htop: ~11.7G (timings via htop since I don't don't trust in-language memory profiling especially for packages that could potentially do their own memory allocation outside of R's C API) It makes sense that the memory usage of those two calls are different since the second one would need to allocate a copy of the entire filtered table (ie for all columns) whereas the first approach never allocated memory for columns that aren't used. Anyway, back to time the speed of the |
Right, if I could use this approach for grouped filtering, I’d no longer need a separate subsetting. But using this approach is going to be tricky when the lazy dt has already (eg) been mutated. I think it’d need an intermediate assignment which would somehow need to be hoisted to the top if the result was used in a join. I don’t think that’s impossible to overcome but it’s definitely a decent amount of work. |
To make that concrete: library(dtplyr)
library(dplyr, warn.conflicts = FALSE)
dt1 <- lazy_dt(data.frame(g = 1:2, x = 1:2))
dt1 %>%
mutate(y = 1) %>%
group_by(g) %>%
filter(sum(x + y) > 2) %>%
show_query()
#> copy(`_DT1`)[, `:=`(y = 1)][, .SD[sum(x + y) > 2], keyby = .(g)] Would need to be translated to (I think):
And if you added a join on the end: dt1 %>%
mutate(y = 1) %>%
group_by(g) %>%
filter(sum(x + y) > 2) %>%
left_join(dt2 %>% mutate(z = 1)) %>%
show_query()
#> copy(`_DT9`)[, `:=`(z = 1)][copy(`_DT8`)[, `:=`(y = 1)][, .SD[sum(x +
#> y) > 2], keyby = .(g)], on = .(x), allow.cartesian = TRUE] Then the assignment still has to come first: DT2 <- copy(`_DT1`)[, `:=`(y = 1)]
copy(`_DT2`)[, `:=`(z = 1)][DT2[DT2[, .I[sum(x + y) > 2], .keyby = .(g)]], on = .(x), allow.cartesian = TRUE] Although if dtplyr supported intermediate assignment, you'd probably want to use another one to improve readability: DT2 <- copy(`_DT1`)[, `:=`(y = 1)]
DT3 <- DT2[DT2[, .I[sum(x + y) > 2], .keyby = .(g)]]
copy(`_DT2`)[, `:=`(z = 1)][DT3, on = .(x), allow.cartesian = TRUE] |
Hmmm, maybe this wouldn't be too hard to implement? Would need to think about how to generate the names for the temporary objects, and confirm that we can still collapse a select after a grouped file/slice. |
Have this working: dt %>%
group_by(cyl) %>%
filter(vs > mean(vs)) %>%
select(mpg:cyl) %>%
show_query()
#> `_DT2`[`_DT2`[, .I[vs > mean(vs)], keyby = .(cyl)]$V1, .(mpg, cyl)] If the grouped filter happens on top of existing computation, it first creates an intermediate variable: dt %>%
filter(cyl > 4) %>%
group_by(cyl) %>%
filter(vs > mean(vs)) %>%
select(mpg:cyl)
#> _DT2 <- `_DT2`[cyl > 4]
#> `_DT2`[`_DT2`[, .I[vs > mean(vs)], keyby = .(cyl)]$V1, .(mpg, cyl)] |
I guess in this case I don't strictly need the intermediate variable, but using it doesn't seem too harmful. |
I think you do in fact need the intermediate variable to avoid doing the cyl> 4 subset twice. The following would not work, because the indices returned by `_DT2`[cyl > 4][`_DT2`[, .I[vs > mean(vs)], keyby = .(cyl)]$V1, .(mpg, cyl)] Defining the intermediate variable is still fast. Interfacing with dtplyr has made me think about how data.table might optimize its own .SD[i] subsets better. More timing here: Rdatatable/data.table#4886 It actually turns out that |
Which means that the |
@myoung3 oh yeah, good point about the indices changing. I guess you could maybe do something like `_DT2`[intersect(which(cyl > 4), .I[vs > mean(vs)], keyby = .(cyl)]$V1), .(mpg, cyl)] but that doesn't have any benefits over the intermediate approach. |
It occurred to be to be worried about the intermediate table being a reference copy when the subset resulted in an identity subset, but that doesn't seem to be the case: library(data.table)
dt <- data.table(x=1:10)
dt2 <- dt[1:10]
dt3 <- dt[1:5]
tracemem(dt)
#> [1] "<0000000015543880>"
tracemem(dt2)
#> [1] "<0000000015BF5510>"
tracemem(dt3)
#> [1] "<0000000015218D18>"
dt2[, test:=1L]
dt
#> x
#> 1: 1
#> 2: 2
#> 3: 3
#> 4: 4
#> 5: 5
#> 6: 6
#> 7: 7
#> 8: 8
#> 9: 9
#> 10: 10 Created on 2021-02-01 by the reprex package (v0.3.0) For our purposes this is good/simplifies things (since mutability is not the default for dtplyr). |
@myoung3 fwiw even that was a problem, dtplyr objects already contain all the metadata to automatically create a single copy when needed. |
Hey Hadley,
Thanks for your recent work on this package and for fixing #127. I'm excited about this package and specifically interested in edge cases where dtplyr code doesn't translate to optimal data.table code. I'm documenting the performance difference between how #127 currently gets solved (as of c31d3d0) and the optimal data.table query (the solution posted by ColeMiller1 in #127). It's probably hard to implement the optimal solution currently, but maybe in the future?
Created on 2021-01-30 by the reprex package (v0.3.0)
I can probably think of more examples like this (where the translated query is suboptimal). Let me know if that would be useful. Thanks again for your work on the package.
The text was updated successfully, but these errors were encountered: