Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve translation of grouped filter() #176

Closed
myoung3 opened this issue Jan 30, 2021 · 14 comments · Fixed by #177
Closed

Improve translation of grouped filter() #176

myoung3 opened this issue Jan 30, 2021 · 14 comments · Fixed by #177

Comments

@myoung3
Copy link

myoung3 commented Jan 30, 2021

Hey Hadley,
Thanks for your recent work on this package and for fixing #127. I'm excited about this package and specifically interested in edge cases where dtplyr code doesn't translate to optimal data.table code. I'm documenting the performance difference between how #127 currently gets solved (as of c31d3d0) and the optimal data.table query (the solution posted by ColeMiller1 in #127). It's probably hard to implement the optimal solution currently, but maybe in the future?

library(dtplyr)
library(data.table)
library(dplyr,warn=FALSE)
set.seed(1)
ncol <- 100
n <- 1e5
dt1 <- as.data.table(replicate(ncol, rnorm(n)))
dt1[, g:=1:.N]
dt1[, x:=sample(1:2,size=n,replace=TRUE)]

dt2_dtplyr <- dt1 %>%
  lazy_dt() %>%
  group_by(g) %>%
  filter(sum(x) == 1) %>% 
  select(g, x)

show_query(dt2_dtplyr)
#> `_DT1`[, .SD[sum(x) == 1], keyby = .(g)][, .(g, x)]
  
system.time(
  dt2_dtplyr <- dt2_dtplyr %>%
  as.data.table()
)
#>    user  system elapsed 
#>   25.86    0.07   26.50

system.time({
  dt2_dt <- dt1[dt1[,.I[sum(x)==1],by="g"]$V1,list(g,x)]  
  setkey(dt2_dt,g) #to match the key created by dtplyr approach
})
#>    user  system elapsed 
#>     0.2     0.0     0.2

identical(dt2_dt, dt2_dtplyr)
#> [1] TRUE



## here's the dplyr timing for reference:
dt1 <- as_tibble(dt1)
system.time({
  dt2_dplyr <- dt1 %>%
    group_by(g) %>%
    filter(sum(x) == 1) %>% 
    select(g, x)
  
  setDT(dt2_dplyr)
  setattr(dt2_dplyr, "groups",NULL) #drop the tibble grouping so identical test passes
  setkey(dt2_dplyr, g)
  
})
#>    user  system elapsed 
#>    0.72    0.00    0.74

identical(dt2_dplyr, dt2_dt)
#> [1] TRUE

Created on 2021-01-30 by the reprex package (v0.3.0)

I can probably think of more examples like this (where the translated query is suboptimal). Let me know if that would be useful. Thanks again for your work on the package.

@myoung3

This comment has been minimized.

@hadley
Copy link
Member

hadley commented Jan 30, 2021

I don't think this has anything to do with select(); you're actually proposing a more efficient translation for a grouped filter()?

library(dtplyr)
library(dplyr, warn.conflicts = FALSE)
set.seed(1)
ncol <- 100
n <- 1e4
dt1 <- data.table::as.data.table(replicate(ncol, rnorm(n)))
dt1[, g:=1:.N]
dt1[, x:=sample(1:2,size=n,replace=TRUE)]

dt1 %>%
  lazy_dt() %>%
  group_by(g) %>%
  filter(sum(x) == 1) %>% 
  show_query()
#> `_DT1`[, .SD[sum(x) == 1], keyby = .(g)]

bench::mark(
  dt1[, .SD[sum(x) == 1], keyby = .(g)],
  dt1[dt1[,.I[sum(x) == 1], keyby = .(g)]$V1],
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression                                       min median `itr/sec`
#>   <bch:expr>                                   <bch:t> <bch:>     <dbl>
#> 1 dt1[, .SD[sum(x) == 1], keyby = .(g)]          1.25s  1.25s     0.802
#> 2 dt1[dt1[, .I[sum(x) == 1], keyby = .(g)]$V1] 14.69ms 16.3ms    53.8  
#> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>

Created on 2021-01-30 by the reprex package (v0.3.0.9001)

Modulo slight differences in the column order, which are (I think) caused by keyby moving the keys to the left?

@hadley hadley changed the title Speed of select() after grouped-filter() Improve translation of grouped filter() Jan 30, 2021
@hadley
Copy link
Member

hadley commented Jan 31, 2021

This approach would presumably also be more efficient for grouped slices.

@myoung3
Copy link
Author

myoung3 commented Jan 31, 2021

Hmm, you're right it doesn't seem to be related to select at all. I though it was slow because the .SD subset was taking all columns first, and then subsequently taking the selected columns. But it seems like .SD subsets are just slow when there are lots of groups. Related: Rdatatable/data.table#735

For what it's worth, combining the filter and select statements into one [.data.table call has less memory overhead than chaining them, even if the timings are the same:

library(data.table)
library(dplyr,warn=FALSE)
set.seed(1)
ncol <- 100
n <- 1e7
dt1 <- as.data.table(replicate(ncol, rnorm(n)))
dt1[, g:=1:.N]
dt1[, x:=sample(1:2,size=n,replace=TRUE)]


gc()
#basline memory usage ~8000mb

system.time(dt1[dt1[,.I[sum(x)==1],by="g"]$V1,list(g,x)])
#user  system elapsed 
#17.984   0.284  15.243
#peak memory, according to htop: ~8100mb
gc()
system.time(dt1[dt1[,.I[sum(x)==1],by="g"]$V1][,list(g,x)])
#user  system elapsed 
#19.528   4.132  15.236 
#peak memory, according to htop: ~11.7G

(timings via htop since I don't don't trust in-language memory profiling especially for packages that could potentially do their own memory allocation outside of R's C API)

It makes sense that the memory usage of those two calls are different since the second one would need to allocate a copy of the entire filtered table (ie for all columns) whereas the first approach never allocated memory for columns that aren't used.

Anyway, back to time the speed of the .I approach vs the .SD subset approach: yes I think you're right that this speedup would apply to slicing as well. Here's someone commenting about the speed difference between the .I approach and the .SD approach for something like slice_min: https://stackoverflow.com/questions/24070714/extract-row-corresponding-to-minimum-value-of-a-variable-by-group/41838383#41838383

@hadley
Copy link
Member

hadley commented Jan 31, 2021

Right, if I could use this approach for grouped filtering, I’d no longer need a separate subsetting. But using this approach is going to be tricky when the lazy dt has already (eg) been mutated. I think it’d need an intermediate assignment which would somehow need to be hoisted to the top if the result was used in a join. I don’t think that’s impossible to overcome but it’s definitely a decent amount of work.

@hadley
Copy link
Member

hadley commented Jan 31, 2021

To make that concrete:

library(dtplyr)
library(dplyr, warn.conflicts = FALSE)

dt1 <- lazy_dt(data.frame(g = 1:2, x = 1:2))

dt1 %>% 
  mutate(y = 1) %>% 
  group_by(g) %>% 
  filter(sum(x + y) > 2) %>% 
  show_query()
#> copy(`_DT1`)[, `:=`(y = 1)][, .SD[sum(x + y) > 2], keyby = .(g)]

Would need to be translated to (I think):

DT2 <- copy(`_DT1`)[, `:=`(y = 1)]
DT2[DT2[, .I[sum(x + y) > 2], .keyby = .(g)]]

And if you added a join on the end:

dt1 %>% 
  mutate(y = 1) %>% 
  group_by(g) %>% 
  filter(sum(x + y) > 2) %>% 
  left_join(dt2 %>% mutate(z = 1)) %>%   
  show_query()
#> copy(`_DT9`)[, `:=`(z = 1)][copy(`_DT8`)[, `:=`(y = 1)][, .SD[sum(x + 
#>    y) > 2], keyby = .(g)], on = .(x), allow.cartesian = TRUE]

Then the assignment still has to come first:

DT2 <- copy(`_DT1`)[, `:=`(y = 1)]
copy(`_DT2`)[, `:=`(z = 1)][DT2[DT2[, .I[sum(x + y) > 2], .keyby = .(g)]], on = .(x), allow.cartesian = TRUE]

Although if dtplyr supported intermediate assignment, you'd probably want to use another one to improve readability:

DT2 <- copy(`_DT1`)[, `:=`(y = 1)]
DT3 <- DT2[DT2[, .I[sum(x + y) > 2], .keyby = .(g)]]
copy(`_DT2`)[, `:=`(z = 1)][DT3, on = .(x), allow.cartesian = TRUE]

@hadley
Copy link
Member

hadley commented Jan 31, 2021

Hmmm, maybe this wouldn't be too hard to implement? dtplyr_step() would need a new assignments field which step_assign() would use. The dt_assignments() generic would work like dt_sources() (aggregating across both parents for the two-table verbs), and dt_eval() would be execute the assignments before evaluating the dt_call(). The print method would also need to show them.

Would need to think about how to generate the names for the temporary objects, and confirm that we can still collapse a select after a grouped file/slice.

@hadley
Copy link
Member

hadley commented Feb 1, 2021

Have this working:

dt %>% 
  group_by(cyl) %>%
  filter(vs > mean(vs)) %>% 
  select(mpg:cyl) %>%
  show_query()
#>  `_DT2`[`_DT2`[, .I[vs > mean(vs)], keyby = .(cyl)]$V1, .(mpg, cyl)]

If the grouped filter happens on top of existing computation, it first creates an intermediate variable:

dt %>% 
  filter(cyl > 4) %>% 
  group_by(cyl) %>%
  filter(vs > mean(vs)) %>% 
  select(mpg:cyl) 
#> _DT2 <- `_DT2`[cyl > 4]
#> `_DT2`[`_DT2`[, .I[vs > mean(vs)], keyby = .(cyl)]$V1, .(mpg, cyl)]

@hadley
Copy link
Member

hadley commented Feb 1, 2021

I guess in this case I don't strictly need the intermediate variable, but using it doesn't seem too harmful.

@myoung3
Copy link
Author

myoung3 commented Feb 1, 2021

I think you do in fact need the intermediate variable to avoid doing the cyl> 4 subset twice. The following would not work, because the indices returned by `_DT2`[, .I[vs > mean(vs)], keyby = .(cyl)] wouldn't correspond to the indices of `_DT2`[cyl > 4]:

`_DT2`[cyl > 4][`_DT2`[, .I[vs > mean(vs)], keyby = .(cyl)]$V1, .(mpg, cyl)]

Defining the intermediate variable is still fast. Interfacing with dtplyr has made me think about how data.table might optimize its own .SD[i] subsets better. More timing here: Rdatatable/data.table#4886

It actually turns out that .SD[1L], by and last(.SD),by are optimized, but they still aren't as fast as the .I solution

@myoung3
Copy link
Author

myoung3 commented Feb 1, 2021

Which means that the .I approach is the best, regardless of whether it's an optimized .SD[i] statement or not. (I was worried we'd need different dtplyr logic depending on whether .SD[i] was optimized or not but the timings in the data.table PR show it's not a concern.)

@hadley
Copy link
Member

hadley commented Feb 1, 2021

@myoung3 oh yeah, good point about the indices changing. I guess you could maybe do something like

`_DT2`[intersect(which(cyl > 4), .I[vs > mean(vs)], keyby = .(cyl)]$V1), .(mpg, cyl)]

but that doesn't have any benefits over the intermediate approach.

@myoung3
Copy link
Author

myoung3 commented Feb 1, 2021

It occurred to be to be worried about the intermediate table being a reference copy when the subset resulted in an identity subset, but that doesn't seem to be the case:

library(data.table)
dt <- data.table(x=1:10)
dt2 <- dt[1:10]
dt3 <- dt[1:5]

tracemem(dt)
#> [1] "<0000000015543880>"
tracemem(dt2)
#> [1] "<0000000015BF5510>"
tracemem(dt3)
#> [1] "<0000000015218D18>"

dt2[, test:=1L]
dt
#>      x
#>  1:  1
#>  2:  2
#>  3:  3
#>  4:  4
#>  5:  5
#>  6:  6
#>  7:  7
#>  8:  8
#>  9:  9
#> 10: 10

Created on 2021-02-01 by the reprex package (v0.3.0)

For our purposes this is good/simplifies things (since mutability is not the default for dtplyr).

@hadley
Copy link
Member

hadley commented Feb 2, 2021

@myoung3 fwiw even that was a problem, dtplyr objects already contain all the metadata to automatically create a single copy when needed.

@hadley hadley mentioned this issue Feb 2, 2021
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants