-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate data tables lazily #38
Comments
Over the weekend I played with this idea but using EDIT: changed the name to avoid confusion. |
This would make dtplyr behave much very similar to dbplyr. The advantage of working lazily is that you can get much faster performance because the query optimiser in the underlying implementation has more information to work with. The disadvantage is that you now need an explicit Changing this behaviour in dtplyr is also likely to break existing code; but I don't think dtplyr is used very frequently, and this break might be worth it because it would yield substantially improved performance. |
I'm a little surprised at how much difference this makes, even for the simple example above: library(data.table)
dt <- data.table(mtcars[rep(1:32, times = 1e4),])
bench::mark(
two = dt[cyl > 5,][, .(mpg = mean(mpg)), by = c("cyl", "gear")],
one = dt[cyl > 5, .(mpg = mean(mpg)), by = c("cyl", "gear")]
)[1:6]
#> # A tibble: 2 x 6
#> expression min mean median max `itr/sec`
#> <chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl>
#> 1 two 40.9ms 45.5ms 46.8ms 49.3ms 22.0
#> 2 one 19.1ms 22.9ms 23.5ms 24.8ms 43.7 Created on 2019-06-13 by the reprex package (v0.2.1.9000) That implies switching to a lazy approach is probably more important than I anticipated. |
I'm not sure if it's entirely relevant for |
@asardaes thanks for sharing! Do you have any thoughts on when you need to start a new I also realised that laziness is the only way dtplyr will be able to do the minimal number of copies — a pipeline can track if a mutate is used, and if it is, create a single copy just before beginning the transformation. |
I keep track of |
Yeah, that's what I was thinking — thanks for confirming! |
How do you convert |
That basically wouldn't work in my case. I try to guess as little as possible, so if something requires two |
I am already doing that in dbplyr, so I could port the approach. But I suspect it's not so important here because the mutate-in-place should avoid most of the performance penalty (whereas generating SQL subqueries is expensive). |
This comment has been minimized.
This comment has been minimized.
One other thing to bear in mind: df %>% group_by(g) %>% mutate(x = sum(x)) %>% filter(x > 10)
# becomes
df[, .(x = sum(x)), by = g][x > 10]
# not
df[x > 10, .(x = sum(x)), by = g] That means you always have to start a new |
I guess it will be resolved when (and if?) this Rdatatable/data.table#788 will be implemented |
See progress in http://dtplyr.tidyverse.org/articles/translation.html |
I don't know if this will be an issue for If a package defines a irisDT <- as.data.table(iris) %>%
.[, nest(.SD), by = Species]
irisDT[, unnest(.SD, data)] The If you do So any package that uses |
posted this as a dplyr issue (since it's technically a different way to do a dplyr calculation, a way to integrate data.tables rather than supporting data.tables) but migrating it here:
I'm an avid user of
data.table
,but
dplyr
has a syntax which is much more accessible, and when I first learned, it wasdplyr
that made that possible. However, with larger tables it became harder to justify its usage, so I switched and started usingdata.table
.If you have two technologies which accomplish the same things, and one is faster, but the other is more readable, you should be able to wrap the fast one to produce the readable one.
So I attempted to make dplyr verbs construct a data.table call, and added one additional verb,
calculate(), which evaluates the current state of the call (to mirror data.table's functionality of doing several things at the same time)
It's still extremely rough (supporting the basics), but the actual construction of the call isn't all that messy.
dplyr
anddata.table
syntax are really quite close to one another.I'm wondering if something like this could work
(fully aware I'm overwriting the verb_ functions, that needs to change, this was just intended to be proof of concept) :
Functions Definition:
Example:
Output:
The text was updated successfully, but these errors were encountered: