Generate data tables lazily #38

shapenaji · 2016-12-12T14:53:50Z

posted this as a dplyr issue (since it's technically a different way to do a dplyr calculation, a way to integrate data.tables rather than supporting data.tables) but migrating it here:

I'm an avid user of data.table,
but dplyr has a syntax which is much more accessible, and when I first learned, it was dplyr that made that possible. However, with larger tables it became harder to justify its usage, so I switched and started using data.table.

If you have two technologies which accomplish the same things, and one is faster, but the other is more readable, you should be able to wrap the fast one to produce the readable one.

So I attempted to make dplyr verbs construct a data.table call, and added one additional verb,
calculate(), which evaluates the current state of the call (to mirror data.table's functionality of doing several things at the same time)

It's still extremely rough (supporting the basics), but the actual construction of the call isn't all that messy. dplyr and data.table syntax are really quite close to one another.

I'm wondering if something like this could work
(fully aware I'm overwriting the verb_ functions, that needs to change, this was just intended to be proof of concept) :

Functions Definition:

library(magrittr)
library(data.table)

# Calculate the current version of the call
calculate <- function(.call) {
  .call <- check_call(.call)
  out <- do.call(function(data,...) `[`(data,...), .call)
  out
}

# If it's a data object, turn it into a list containing a copy of that object
check_call <- function(.call) {
  if(is.data.table(.call) |
     is.data.frame(.call)) {
    .call <- list(data = setDT(copy(.call)))
  }
  .call
}

# Extract the data portion of the call, and pass the arguments down to group_by_
group_by <- function(.call,...) {
  .call <- check_call(.call)

  call <- sys.call()[-2]
  call[1] <- expression(group_by_)
  c(.call,eval(call))
}

# take the group variables, use data.table `.()` syntax, assign to `by`
group_by_ <- function(...) {
  call <- sys.call()
  call[1] <- expression(.)
  list(by = call)
}

# Extract the data portion of the call, and pass the arguments down to mutate_
mutate <- function(.call,...) {
  .call <- check_call(.call)

  call <- sys.call()[-2]
  call[1] <- expression(mutate_)
  c(.call, eval(call))
}

# take the assignment columns, use `:=` syntax, assign to `j`
mutate_ <- function(...) {
  call <- sys.call()
  call[1] <- expression(`:=`)
  list(j = call)
}

# Extract the data portion of the call, and pass the arguments down to summarise_
summarise <- function(.call,...) {
  .call <- check_call(.call)

  call <- sys.call()[-2]
  call[1] <- expression(summarise_)
  c(.call, eval(call))
}

# take the assignment columns, use `.()` syntax, assign to `j`
summarise_ <- function(...) {
  call <- sys.call()
  call[1] <- expression(`.`)
  list(j = call)
}

# Extract the data portion of the call, and pass the arguments down to filter_
filter <- function(.call,...) {
  .call <- check_call(.call)

  call <- sys.call()[-2]
  call[1] <- expression(filter_)
  c(.call, eval(call))
}

# take the filter columns, use `()` syntax, assign to `i`
filter_ <- function(...) {
  call <- sys.call()
  call[1] <- expression(`(`)
  list(i = call)
}

# Evaluates expression up till this point, then select columns
select <- function(.call,...) {
  call <- sys.call()[-2]
  call[1] <- expression(`.`)
  do.call(function(...) `[`(calculate(.call),...), list(j = call))
}

# Evaluates expression up till this point, then arranges columns
arrange <- function(.call,...) {
  call <- sys.call()[-2]
  call[1] <- expression(`order`)
  do.call(function(...) `[`(calculate(.call),...), list(i = call))
}

Example:

mtcars %>%
  filter(cyl > 5) %>%
  group_by(cyl, gear) %>%
  summarise(avgMPG = mean(mpg)) %>%
  calculate

Output:

   cyl gear avgMPG
1:   6    4 21.000
2:   4    4 26.925
3:   6    3 21.400
4:   4    3 21.500
5:   4    5 28.200

The text was updated successfully, but these errors were encountered:

asardaes · 2019-05-06T05:31:43Z

Over the weekend I played with this idea but using rlang. Funnily enough, I did it before finding this repo. ~~and even used the same name!~~ Here's what I have so far. It seems doable, although it feels a bit hacky at the moment.

EDIT: changed the name to avoid confusion.

hadley · 2019-05-25T13:43:36Z

This would make dtplyr behave much very similar to dbplyr. The advantage of working lazily is that you can get much faster performance because the query optimiser in the underlying implementation has more information to work with. The disadvantage is that you now need an explicit collect() to trigger computation and return the result.

Changing this behaviour in dtplyr is also likely to break existing code; but I don't think dtplyr is used very frequently, and this break might be worth it because it would yield substantially improved performance.

hadley · 2019-06-13T19:14:43Z

I'm a little surprised at how much difference this makes, even for the simple example above:

library(data.table)

dt <- data.table(mtcars[rep(1:32, times = 1e4),])

bench::mark(
  two = dt[cyl > 5,][, .(mpg = mean(mpg)), by = c("cyl", "gear")],
  one = dt[cyl > 5, .(mpg = mean(mpg)), by = c("cyl", "gear")] 
)[1:6]
#> # A tibble: 2 x 6
#>   expression      min     mean   median      max `itr/sec`
#>   <chr>      <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl>
#> 1 two          40.9ms   45.5ms   46.8ms   49.3ms      22.0
#> 2 one          19.1ms   22.9ms   23.5ms   24.8ms      43.7

^{Created on 2019-06-13 by the reprex package (v0.2.1.9000)}

That implies switching to a lazy approach is probably more important than I anticipated.

asardaes · 2019-06-13T21:32:43Z

I'm not sure if it's entirely relevant for dtplyr, but just FYI, I did end up releasing table.express. It's not really a one-to-one match to dplyr, but it's based on it. I'm currently working on the joining verbs.

hadley · 2019-06-14T11:53:51Z

@asardaes thanks for sharing! Do you have any thoughts on when you need to start a new [? I was thinking you just need to track i and j and then when both have been filled, you start a new one.

I also realised that laziness is the only way dtplyr will be able to do the minimal number of copies — a pipeline can track if a mutate is used, and if it is, create a single copy just before beginning the transformation.

asardaes · 2019-06-14T12:14:43Z

I keep track of i, j and by, but I start a new [ only when i or j are being replaced, so right before a verb tries to add an expression to a "slot" that's already set.

hadley · 2019-06-14T15:50:01Z

Yeah, that's what I was thinking — thanks for confirming!

hadley · 2019-06-14T16:09:58Z

How do you convert df %>% mutate(x2 = x * 2, x4 = x2 * 2)? dt[, x2 := x * 2][, x4 := x2 * 2][] ?

asardaes · 2019-06-14T16:18:21Z

That basically wouldn't work in my case. I try to guess as little as possible, so if something requires two [ in data.table notation, it would require two mutate calls. Since it's a modification by reference anyway, I suppose that's negligible overhead, if at all. Although it could indeed be converted as you showed, but analyzing the expressions to figure out if they're using newly created variables could be messy, I think.

hadley · 2019-06-14T16:21:40Z

I am already doing that in dbplyr, so I could port the approach. But I suspect it's not so important here because the mutate-in-place should avoid most of the performance penalty (whereas generating SQL subqueries is expensive).

hadley · 2019-06-18T20:11:21Z

One other thing to bear in mind: i is executed before j, so you can't collapse a filter after a summarise:

df %>% group_by(g) %>% mutate(x = sum(x)) %>% filter(x > 10)

# becomes
df[, .(x = sum(x)), by = g][x > 10]
# not
df[x > 10, .(x = sum(x)), by = g]

That means you always have to start a new [ after j is filled.

DavidArenburg · 2019-06-19T13:12:30Z

I guess it will be resolved when (and if?) this Rdatatable/data.table#788 will be implemented

hadley · 2019-06-25T22:27:57Z

See progress in http://dtplyr.tidyverse.org/articles/translation.html

asardaes · 2019-08-02T23:38:14Z

I don't know if this will be an issue for dtplyr, but I think it might, and maybe you can give me some insight. Here's something I found based on asardaes/table.express#20.

If a package defines a data.table method for dplyr verbs, things might break. Take for example:

irisDT <- as.data.table(iris) %>%
  .[, nest(.SD), by = Species]

irisDT[, unnest(.SD, data)]

The data.frame method for tidyr::unnest has a step with the call: dplyr::transmute(dplyr::ungroup(data), !!!quos). Usually, transmute dispatches to the data.frame method from dplyr and it works. If a data.table method for transmute is registered, and that is now chosen within unnest, it will stop working.

If you do options(datatable.verbose = TRUE), you'll see something starting with cedta decided 'tidyr' wasn't data.table aware. Here is call stack with [[1L]] applied:. I don't know the details, but indeed any package that doesn't explicitly import data.table into its namespace cannot execute data.table operations.

So any package that uses dplyr generics which dispatch to data.table won't work if said package doesn't also import data.table?

hadley added the feature a feature request or enhancement label May 25, 2019

hadley changed the title ~~Assembling a DT[i, j, by] with dplyr verbs~~ Generate data tables lazily May 25, 2019

hadley added the performance 🚀 label Jun 13, 2019

hadley mentioned this issue Jun 14, 2019

Unify tbl_dt and grouped_dt #69

Closed

This comment has been minimized.

Sign in to view

hadley closed this as completed in 39c630e Jun 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate data tables lazily #38

Generate data tables lazily #38

shapenaji commented Dec 12, 2016 •

edited by hadley

Loading

Functions Definition:

asardaes commented May 6, 2019 •

edited

Loading

hadley commented May 25, 2019

hadley commented Jun 13, 2019

asardaes commented Jun 13, 2019

hadley commented Jun 14, 2019

asardaes commented Jun 14, 2019

hadley commented Jun 14, 2019

hadley commented Jun 14, 2019

asardaes commented Jun 14, 2019 •

edited

Loading

hadley commented Jun 14, 2019

This comment has been minimized.

hadley commented Jun 18, 2019 •

edited

Loading

DavidArenburg commented Jun 19, 2019 •

edited

Loading

hadley commented Jun 25, 2019

asardaes commented Aug 2, 2019

Generate data tables lazily #38

Generate data tables lazily #38

Comments

shapenaji commented Dec 12, 2016 • edited by hadley Loading

Functions Definition:

Example:

asardaes commented May 6, 2019 • edited Loading

hadley commented May 25, 2019

hadley commented Jun 13, 2019

asardaes commented Jun 13, 2019

hadley commented Jun 14, 2019

asardaes commented Jun 14, 2019

hadley commented Jun 14, 2019

hadley commented Jun 14, 2019

asardaes commented Jun 14, 2019 • edited Loading

hadley commented Jun 14, 2019

This comment has been minimized.

hadley commented Jun 18, 2019 • edited Loading

DavidArenburg commented Jun 19, 2019 • edited Loading

hadley commented Jun 25, 2019

asardaes commented Aug 2, 2019

shapenaji commented Dec 12, 2016 •

edited by hadley

Loading

asardaes commented May 6, 2019 •

edited

Loading

asardaes commented Jun 14, 2019 •

edited

Loading

hadley commented Jun 18, 2019 •

edited

Loading

DavidArenburg commented Jun 19, 2019 •

edited

Loading