Improve translation of grouped filter() #176

myoung3 · 2021-01-30T17:34:31Z

Hey Hadley,
Thanks for your recent work on this package and for fixing #127. I'm excited about this package and specifically interested in edge cases where dtplyr code doesn't translate to optimal data.table code. I'm documenting the performance difference between how #127 currently gets solved (as of c31d3d0) and the optimal data.table query (the solution posted by ColeMiller1 in #127). It's probably hard to implement the optimal solution currently, but maybe in the future?

library(dtplyr)
library(data.table)
library(dplyr,warn=FALSE)
set.seed(1)
ncol <- 100
n <- 1e5
dt1 <- as.data.table(replicate(ncol, rnorm(n)))
dt1[, g:=1:.N]
dt1[, x:=sample(1:2,size=n,replace=TRUE)]

dt2_dtplyr <- dt1 %>%
  lazy_dt() %>%
  group_by(g) %>%
  filter(sum(x) == 1) %>% 
  select(g, x)

show_query(dt2_dtplyr)
#> `_DT1`[, .SD[sum(x) == 1], keyby = .(g)][, .(g, x)]
  
system.time(
  dt2_dtplyr <- dt2_dtplyr %>%
  as.data.table()
)
#>    user  system elapsed 
#>   25.86    0.07   26.50

system.time({
  dt2_dt <- dt1[dt1[,.I[sum(x)==1],by="g"]$V1,list(g,x)]  
  setkey(dt2_dt,g) #to match the key created by dtplyr approach
})
#>    user  system elapsed 
#>     0.2     0.0     0.2

identical(dt2_dt, dt2_dtplyr)
#> [1] TRUE



## here's the dplyr timing for reference:
dt1 <- as_tibble(dt1)
system.time({
  dt2_dplyr <- dt1 %>%
    group_by(g) %>%
    filter(sum(x) == 1) %>% 
    select(g, x)
  
  setDT(dt2_dplyr)
  setattr(dt2_dplyr, "groups",NULL) #drop the tibble grouping so identical test passes
  setkey(dt2_dplyr, g)
  
})
#>    user  system elapsed 
#>    0.72    0.00    0.74

identical(dt2_dplyr, dt2_dt)
#> [1] TRUE

^{Created on 2021-01-30 by the reprex package (v0.3.0)}

I can probably think of more examples like this (where the translated query is suboptimal). Let me know if that would be useful. Thanks again for your work on the package.

The text was updated successfully, but these errors were encountered:

hadley · 2021-01-30T23:29:59Z

I don't think this has anything to do with select(); you're actually proposing a more efficient translation for a grouped filter()?

library(dtplyr)
library(dplyr, warn.conflicts = FALSE)
set.seed(1)
ncol <- 100
n <- 1e4
dt1 <- data.table::as.data.table(replicate(ncol, rnorm(n)))
dt1[, g:=1:.N]
dt1[, x:=sample(1:2,size=n,replace=TRUE)]

dt1 %>%
  lazy_dt() %>%
  group_by(g) %>%
  filter(sum(x) == 1) %>% 
  show_query()
#> `_DT1`[, .SD[sum(x) == 1], keyby = .(g)]

bench::mark(
  dt1[, .SD[sum(x) == 1], keyby = .(g)],
  dt1[dt1[,.I[sum(x) == 1], keyby = .(g)]$V1],
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression                                       min median `itr/sec`
#>   <bch:expr>                                   <bch:t> <bch:>     <dbl>
#> 1 dt1[, .SD[sum(x) == 1], keyby = .(g)]          1.25s  1.25s     0.802
#> 2 dt1[dt1[, .I[sum(x) == 1], keyby = .(g)]$V1] 14.69ms 16.3ms    53.8  
#> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>

^{Created on 2021-01-30 by the reprex package (v0.3.0.9001)}

Modulo slight differences in the column order, which are (I think) caused by keyby moving the keys to the left?

hadley · 2021-01-31T00:01:29Z

This approach would presumably also be more efficient for grouped slices.

myoung3 · 2021-01-31T00:55:14Z

Hmm, you're right it doesn't seem to be related to select at all. I though it was slow because the .SD subset was taking all columns first, and then subsequently taking the selected columns. But it seems like .SD subsets are just slow when there are lots of groups. Related: Rdatatable/data.table#735

For what it's worth, combining the filter and select statements into one [.data.table call has less memory overhead than chaining them, even if the timings are the same:

library(data.table)
library(dplyr,warn=FALSE)
set.seed(1)
ncol <- 100
n <- 1e7
dt1 <- as.data.table(replicate(ncol, rnorm(n)))
dt1[, g:=1:.N]
dt1[, x:=sample(1:2,size=n,replace=TRUE)]


gc()
#basline memory usage ~8000mb

system.time(dt1[dt1[,.I[sum(x)==1],by="g"]$V1,list(g,x)])
#user  system elapsed 
#17.984   0.284  15.243
#peak memory, according to htop: ~8100mb
gc()
system.time(dt1[dt1[,.I[sum(x)==1],by="g"]$V1][,list(g,x)])
#user  system elapsed 
#19.528   4.132  15.236 
#peak memory, according to htop: ~11.7G

(timings via htop since I don't don't trust in-language memory profiling especially for packages that could potentially do their own memory allocation outside of R's C API)

It makes sense that the memory usage of those two calls are different since the second one would need to allocate a copy of the entire filtered table (ie for all columns) whereas the first approach never allocated memory for columns that aren't used.

Anyway, back to time the speed of the .I approach vs the .SD subset approach: yes I think you're right that this speedup would apply to slicing as well. Here's someone commenting about the speed difference between the .I approach and the .SD approach for something like slice_min: https://stackoverflow.com/questions/24070714/extract-row-corresponding-to-minimum-value-of-a-variable-by-group/41838383#41838383

hadley · 2021-01-31T14:13:48Z

Right, if I could use this approach for grouped filtering, I’d no longer need a separate subsetting. But using this approach is going to be tricky when the lazy dt has already (eg) been mutated. I think it’d need an intermediate assignment which would somehow need to be hoisted to the top if the result was used in a join. I don’t think that’s impossible to overcome but it’s definitely a decent amount of work.

hadley · 2021-01-31T15:08:28Z

To make that concrete:

library(dtplyr)
library(dplyr, warn.conflicts = FALSE)

dt1 <- lazy_dt(data.frame(g = 1:2, x = 1:2))

dt1 %>% 
  mutate(y = 1) %>% 
  group_by(g) %>% 
  filter(sum(x + y) > 2) %>% 
  show_query()
#> copy(`_DT1`)[, `:=`(y = 1)][, .SD[sum(x + y) > 2], keyby = .(g)]

Would need to be translated to (I think):

DT2 <- copy(`_DT1`)[, `:=`(y = 1)]
DT2[DT2[, .I[sum(x + y) > 2], .keyby = .(g)]]

And if you added a join on the end:

dt1 %>% 
  mutate(y = 1) %>% 
  group_by(g) %>% 
  filter(sum(x + y) > 2) %>% 
  left_join(dt2 %>% mutate(z = 1)) %>%   
  show_query()
#> copy(`_DT9`)[, `:=`(z = 1)][copy(`_DT8`)[, `:=`(y = 1)][, .SD[sum(x + 
#>    y) > 2], keyby = .(g)], on = .(x), allow.cartesian = TRUE]

Then the assignment still has to come first:

DT2 <- copy(`_DT1`)[, `:=`(y = 1)]
copy(`_DT2`)[, `:=`(z = 1)][DT2[DT2[, .I[sum(x + y) > 2], .keyby = .(g)]], on = .(x), allow.cartesian = TRUE]

Although if dtplyr supported intermediate assignment, you'd probably want to use another one to improve readability:

DT2 <- copy(`_DT1`)[, `:=`(y = 1)]
DT3 <- DT2[DT2[, .I[sum(x + y) > 2], .keyby = .(g)]]
copy(`_DT2`)[, `:=`(z = 1)][DT3, on = .(x), allow.cartesian = TRUE]

hadley · 2021-01-31T16:12:55Z

Hmmm, maybe this wouldn't be too hard to implement? dtplyr_step() would need a new assignments field which step_assign() would use. The dt_assignments() generic would work like dt_sources() (aggregating across both parents for the two-table verbs), and dt_eval() would be execute the assignments before evaluating the dt_call(). The print method would also need to show them.

Would need to think about how to generate the names for the temporary objects, and confirm that we can still collapse a select after a grouped file/slice.

hadley · 2021-02-01T21:55:48Z

Have this working:

dt %>% 
  group_by(cyl) %>%
  filter(vs > mean(vs)) %>% 
  select(mpg:cyl) %>%
  show_query()
#>  `_DT2`[`_DT2`[, .I[vs > mean(vs)], keyby = .(cyl)]$V1, .(mpg, cyl)]

If the grouped filter happens on top of existing computation, it first creates an intermediate variable:

dt %>% 
  filter(cyl > 4) %>% 
  group_by(cyl) %>%
  filter(vs > mean(vs)) %>% 
  select(mpg:cyl) 
#> _DT2 <- `_DT2`[cyl > 4]
#> `_DT2`[`_DT2`[, .I[vs > mean(vs)], keyby = .(cyl)]$V1, .(mpg, cyl)]

hadley · 2021-02-01T21:57:19Z

I guess in this case I don't strictly need the intermediate variable, but using it doesn't seem too harmful.

myoung3 · 2021-02-01T22:00:32Z

I think you do in fact need the intermediate variable to avoid doing the cyl> 4 subset twice. The following would not work, because the indices returned by `_DT2`[, .I[vs > mean(vs)], keyby = .(cyl)] wouldn't correspond to the indices of `_DT2`[cyl > 4]:

`_DT2`[cyl > 4][`_DT2`[, .I[vs > mean(vs)], keyby = .(cyl)]$V1, .(mpg, cyl)]

Defining the intermediate variable is still fast. Interfacing with dtplyr has made me think about how data.table might optimize its own .SD[i] subsets better. More timing here: Rdatatable/data.table#4886

It actually turns out that .SD[1L], by and last(.SD),by are optimized, but they still aren't as fast as the .I solution

myoung3 · 2021-02-01T22:02:13Z

Which means that the .I approach is the best, regardless of whether it's an optimized .SD[i] statement or not. (I was worried we'd need different dtplyr logic depending on whether .SD[i] was optimized or not but the timings in the data.table PR show it's not a concern.)

hadley · 2021-02-01T23:08:54Z

@myoung3 oh yeah, good point about the indices changing. I guess you could maybe do something like

`_DT2`[intersect(which(cyl > 4), .I[vs > mean(vs)], keyby = .(cyl)]$V1), .(mpg, cyl)]

but that doesn't have any benefits over the intermediate approach.

myoung3 · 2021-02-01T23:37:41Z

It occurred to be to be worried about the intermediate table being a reference copy when the subset resulted in an identity subset, but that doesn't seem to be the case:

library(data.table)
dt <- data.table(x=1:10)
dt2 <- dt[1:10]
dt3 <- dt[1:5]

tracemem(dt)
#> [1] "<0000000015543880>"
tracemem(dt2)
#> [1] "<0000000015BF5510>"
tracemem(dt3)
#> [1] "<0000000015218D18>"

dt2[, test:=1L]
dt
#>      x
#>  1:  1
#>  2:  2
#>  3:  3
#>  4:  4
#>  5:  5
#>  6:  6
#>  7:  7
#>  8:  8
#>  9:  9
#> 10: 10

^{Created on 2021-02-01 by the reprex package (v0.3.0)}

For our purposes this is good/simplifies things (since mutability is not the default for dtplyr).

hadley · 2021-02-02T00:00:03Z

@myoung3 fwiw even that was a problem, dtplyr objects already contain all the metadata to automatically create a single copy when needed.

This comment has been minimized.

Sign in to view

hadley changed the title ~~Speed of select() after grouped-filter()~~ Improve translation of grouped filter() Jan 30, 2021

myoung3 mentioned this issue Feb 1, 2021

.SD[i] could be optimized better and more generally Rdatatable/data.table#4886

Open

hadley mentioned this issue Feb 2, 2021

Grouped filter #177

Merged

3 tasks

hadley closed this as completed in #177 Feb 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve translation of grouped filter() #176

Improve translation of grouped filter() #176

myoung3 commented Jan 30, 2021 •

edited

Loading

This comment has been minimized.

hadley commented Jan 30, 2021

hadley commented Jan 31, 2021

myoung3 commented Jan 31, 2021 •

edited by hadley

Loading

hadley commented Jan 31, 2021

hadley commented Jan 31, 2021

hadley commented Jan 31, 2021

hadley commented Feb 1, 2021

hadley commented Feb 1, 2021

myoung3 commented Feb 1, 2021 •

edited

Loading

myoung3 commented Feb 1, 2021 •

edited

Loading

hadley commented Feb 1, 2021 •

edited

Loading

myoung3 commented Feb 1, 2021 •

edited

Loading

hadley commented Feb 2, 2021

Improve translation of grouped filter() #176

Improve translation of grouped filter() #176

Comments

myoung3 commented Jan 30, 2021 • edited Loading

This comment has been minimized.

hadley commented Jan 30, 2021

hadley commented Jan 31, 2021

myoung3 commented Jan 31, 2021 • edited by hadley Loading

hadley commented Jan 31, 2021

hadley commented Jan 31, 2021

hadley commented Jan 31, 2021

hadley commented Feb 1, 2021

hadley commented Feb 1, 2021

myoung3 commented Feb 1, 2021 • edited Loading

myoung3 commented Feb 1, 2021 • edited Loading

hadley commented Feb 1, 2021 • edited Loading

myoung3 commented Feb 1, 2021 • edited Loading

hadley commented Feb 2, 2021

myoung3 commented Jan 30, 2021 •

edited

Loading

myoung3 commented Jan 31, 2021 •

edited by hadley

Loading

myoung3 commented Feb 1, 2021 •

edited

Loading

myoung3 commented Feb 1, 2021 •

edited

Loading

hadley commented Feb 1, 2021 •

edited

Loading

myoung3 commented Feb 1, 2021 •

edited

Loading