Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate data tables lazily #38

Closed
shapenaji opened this issue Dec 12, 2016 · 15 comments
Closed

Generate data tables lazily #38

shapenaji opened this issue Dec 12, 2016 · 15 comments
Labels
feature a feature request or enhancement performance 🚀

Comments

@shapenaji
Copy link

shapenaji commented Dec 12, 2016

posted this as a dplyr issue (since it's technically a different way to do a dplyr calculation, a way to integrate data.tables rather than supporting data.tables) but migrating it here:

I'm an avid user of data.table,
but dplyr has a syntax which is much more accessible, and when I first learned, it was dplyr that made that possible. However, with larger tables it became harder to justify its usage, so I switched and started using data.table.

If you have two technologies which accomplish the same things, and one is faster, but the other is more readable, you should be able to wrap the fast one to produce the readable one.

So I attempted to make dplyr verbs construct a data.table call, and added one additional verb,
calculate(), which evaluates the current state of the call (to mirror data.table's functionality of doing several things at the same time)

It's still extremely rough (supporting the basics), but the actual construction of the call isn't all that messy. dplyr and data.table syntax are really quite close to one another.

I'm wondering if something like this could work
(fully aware I'm overwriting the verb_ functions, that needs to change, this was just intended to be proof of concept) :

Functions Definition:

library(magrittr)
library(data.table)

# Calculate the current version of the call
calculate <- function(.call) {
  .call <- check_call(.call)
  out <- do.call(function(data,...) `[`(data,...), .call)
  out
}

# If it's a data object, turn it into a list containing a copy of that object
check_call <- function(.call) {
  if(is.data.table(.call) |
     is.data.frame(.call)) {
    .call <- list(data = setDT(copy(.call)))
  }
  .call
}

# Extract the data portion of the call, and pass the arguments down to group_by_
group_by <- function(.call,...) {
  .call <- check_call(.call)

  call <- sys.call()[-2]
  call[1] <- expression(group_by_)
  c(.call,eval(call))
}

# take the group variables, use data.table `.()` syntax, assign to `by`
group_by_ <- function(...) {
  call <- sys.call()
  call[1] <- expression(.)
  list(by = call)
}

# Extract the data portion of the call, and pass the arguments down to mutate_
mutate <- function(.call,...) {
  .call <- check_call(.call)

  call <- sys.call()[-2]
  call[1] <- expression(mutate_)
  c(.call, eval(call))
}

# take the assignment columns, use `:=` syntax, assign to `j`
mutate_ <- function(...) {
  call <- sys.call()
  call[1] <- expression(`:=`)
  list(j = call)
}

# Extract the data portion of the call, and pass the arguments down to summarise_
summarise <- function(.call,...) {
  .call <- check_call(.call)

  call <- sys.call()[-2]
  call[1] <- expression(summarise_)
  c(.call, eval(call))
}

# take the assignment columns, use `.()` syntax, assign to `j`
summarise_ <- function(...) {
  call <- sys.call()
  call[1] <- expression(`.`)
  list(j = call)
}

# Extract the data portion of the call, and pass the arguments down to filter_
filter <- function(.call,...) {
  .call <- check_call(.call)

  call <- sys.call()[-2]
  call[1] <- expression(filter_)
  c(.call, eval(call))
}

# take the filter columns, use `()` syntax, assign to `i`
filter_ <- function(...) {
  call <- sys.call()
  call[1] <- expression(`(`)
  list(i = call)
}

# Evaluates expression up till this point, then select columns
select <- function(.call,...) {
  call <- sys.call()[-2]
  call[1] <- expression(`.`)
  do.call(function(...) `[`(calculate(.call),...), list(j = call))
}

# Evaluates expression up till this point, then arranges columns
arrange <- function(.call,...) {
  call <- sys.call()[-2]
  call[1] <- expression(`order`)
  do.call(function(...) `[`(calculate(.call),...), list(i = call))
}

Example:

mtcars %>%
  filter(cyl > 5) %>%
  group_by(cyl, gear) %>%
  summarise(avgMPG = mean(mpg)) %>%
  calculate

Output:

   cyl gear avgMPG
1:   6    4 21.000
2:   4    4 26.925
3:   6    3 21.400
4:   4    3 21.500
5:   4    5 28.200
@asardaes
Copy link

asardaes commented May 6, 2019

Over the weekend I played with this idea but using rlang. Funnily enough, I did it before finding this repo. and even used the same name! Here's what I have so far. It seems doable, although it feels a bit hacky at the moment.

EDIT: changed the name to avoid confusion.

@hadley hadley added the feature a feature request or enhancement label May 25, 2019
@hadley hadley changed the title Assembling a DT[i, j, by] with dplyr verbs Generate data tables lazily May 25, 2019
@hadley
Copy link
Member

hadley commented May 25, 2019

This would make dtplyr behave much very similar to dbplyr. The advantage of working lazily is that you can get much faster performance because the query optimiser in the underlying implementation has more information to work with. The disadvantage is that you now need an explicit collect() to trigger computation and return the result.

Changing this behaviour in dtplyr is also likely to break existing code; but I don't think dtplyr is used very frequently, and this break might be worth it because it would yield substantially improved performance.

@hadley
Copy link
Member

hadley commented Jun 13, 2019

I'm a little surprised at how much difference this makes, even for the simple example above:

library(data.table)

dt <- data.table(mtcars[rep(1:32, times = 1e4),])

bench::mark(
  two = dt[cyl > 5,][, .(mpg = mean(mpg)), by = c("cyl", "gear")],
  one = dt[cyl > 5, .(mpg = mean(mpg)), by = c("cyl", "gear")] 
)[1:6]
#> # A tibble: 2 x 6
#>   expression      min     mean   median      max `itr/sec`
#>   <chr>      <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl>
#> 1 two          40.9ms   45.5ms   46.8ms   49.3ms      22.0
#> 2 one          19.1ms   22.9ms   23.5ms   24.8ms      43.7

Created on 2019-06-13 by the reprex package (v0.2.1.9000)

That implies switching to a lazy approach is probably more important than I anticipated.

@asardaes
Copy link

I'm not sure if it's entirely relevant for dtplyr, but just FYI, I did end up releasing table.express. It's not really a one-to-one match to dplyr, but it's based on it. I'm currently working on the joining verbs.

@hadley
Copy link
Member

hadley commented Jun 14, 2019

@asardaes thanks for sharing! Do you have any thoughts on when you need to start a new [? I was thinking you just need to track i and j and then when both have been filled, you start a new one.

I also realised that laziness is the only way dtplyr will be able to do the minimal number of copies — a pipeline can track if a mutate is used, and if it is, create a single copy just before beginning the transformation.

@asardaes
Copy link

I keep track of i, j and by, but I start a new [ only when i or j are being replaced, so right before a verb tries to add an expression to a "slot" that's already set.

@hadley
Copy link
Member

hadley commented Jun 14, 2019

Yeah, that's what I was thinking — thanks for confirming!

@hadley
Copy link
Member

hadley commented Jun 14, 2019

How do you convert df %>% mutate(x2 = x * 2, x4 = x2 * 2)? dt[, x2 := x * 2][, x4 := x2 * 2][] ?

@asardaes
Copy link

asardaes commented Jun 14, 2019

That basically wouldn't work in my case. I try to guess as little as possible, so if something requires two [ in data.table notation, it would require two mutate calls. Since it's a modification by reference anyway, I suppose that's negligible overhead, if at all. Although it could indeed be converted as you showed, but analyzing the expressions to figure out if they're using newly created variables could be messy, I think.

@hadley
Copy link
Member

hadley commented Jun 14, 2019

I am already doing that in dbplyr, so I could port the approach. But I suspect it's not so important here because the mutate-in-place should avoid most of the performance penalty (whereas generating SQL subqueries is expensive).

@shapenaji

This comment has been minimized.

@hadley
Copy link
Member

hadley commented Jun 18, 2019

One other thing to bear in mind: i is executed before j, so you can't collapse a filter after a summarise:

df %>% group_by(g) %>% mutate(x = sum(x)) %>% filter(x > 10)

# becomes
df[, .(x = sum(x)), by = g][x > 10]
# not
df[x > 10, .(x = sum(x)), by = g]

That means you always have to start a new [ after j is filled.

@DavidArenburg
Copy link

DavidArenburg commented Jun 19, 2019

I guess it will be resolved when (and if?) this Rdatatable/data.table#788 will be implemented

@hadley
Copy link
Member

hadley commented Jun 25, 2019

@hadley hadley closed this as completed in 39c630e Jun 27, 2019
@asardaes
Copy link

asardaes commented Aug 2, 2019

I don't know if this will be an issue for dtplyr, but I think it might, and maybe you can give me some insight. Here's something I found based on asardaes/table.express#20.

If a package defines a data.table method for dplyr verbs, things might break. Take for example:

irisDT <- as.data.table(iris) %>%
  .[, nest(.SD), by = Species]

irisDT[, unnest(.SD, data)]

The data.frame method for tidyr::unnest has a step with the call: dplyr::transmute(dplyr::ungroup(data), !!!quos). Usually, transmute dispatches to the data.frame method from dplyr and it works. If a data.table method for transmute is registered, and that is now chosen within unnest, it will stop working.

If you do options(datatable.verbose = TRUE), you'll see something starting with cedta decided 'tidyr' wasn't data.table aware. Here is call stack with [[1L]] applied:. I don't know the details, but indeed any package that doesn't explicitly import data.table into its namespace cannot execute data.table operations.

So any package that uses dplyr generics which dispatch to data.table won't work if said package doesn't also import data.table?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement performance 🚀
Projects
None yet
Development

No branches or pull requests

4 participants