-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic branching #685
Comments
One implementation idea that seems as though it could be "simple" to implement (says the guy who hasn't implemented anything in My thought is:
|
A similar idea was proposed in #304. I would actually prefer to avoid nested plans because of the complexity and pre-planning they would require. |
Fair enough. If I think of something else, I'll post it. |
I want to point out the way Snakemake currently handles this, as a possible inspiration:
The second part is quite idiosyncratic to Python, so I wouldn't suggest it be implemented in the same way, but it seems easier to make the user explicitly mark the cases where dynamic branching needs to happen, than to try to detect it from the structure of their dependencies. Using your example: library(dplyr)
library(drake)
drake_plan(
summaries = mtcars %>%
group_by(cyl) %>%
summarize(mean_mpg = mean(mpg)),
individual_summary = target(
filter(summaries, cyl == cyl_value),
transform = cross(cyl_value = dynamic(summaries$cyl))
)
) One clear difference is that, using an R-based framework rather than a file-based framework, the output of As I wrote that example, I realized that it's actually much more similar to how Snakemake used to do it. In the end, they decided against that way, so maybe it would be good to know why and learn from that. Perhaps it's not relevant in |
Discussion from #233 carries over to this thread. |
Users really want this flexibility, and often just assume The more I think about it, the more wisdom I see in @krlmlr's thinking behind #304. Possible compromise: a new |
Update: we now have |
A more expedient approachAfter talking with @dgkf at SDSS last week, I am no longer as reluctant as in #685 (comment). We can avoid a mess if we give dynamic branching its own DSL that works in tandem with the existing transformation DSL. This new dynamic DSL is just the transformation DSL invoked at runtime. Proposallibrary(drake)
plan <- drake_plan(
vector_of_settings = target(
f(x),
transform = map(x = c(1, 2))
),
analysis = target(
g(x, y),
transform = map(x),
dynamic = map(y = vector_of_settings)
)
)
print(plan)
#> # A tibble: 4 x 3
#> target command dynamic
#> <chr> <expr> <list>
#> 1 vector_of_settings_1 f(1) <lgl [1]>
#> 2 vector_of_settings_2 f(2) <lgl [1]>
#> 3 analysis_1 g(1, y) <language>
#> 4 analysis_2 g(2, y) <language>
print(plan$dynamic)
#> [[1]]
#> [1] NA
#>
#> [[2]]
#> [1] NA
#>
#> [[3]]
#> map(y = vector_of_settings_1)
#>
#> [[4]]
#> map(y = vector_of_settings_2)
drake_plan_source(plan)
#> drake_plan(
#> vector_of_settings_1 = f(1),
#> vector_of_settings_2 = f(2),
#> analysis_1 = target(
#> command = g(1, y),
#> dynamic = map(y = vector_of_settings_1)
#> ),
#> analysis_2 = target(
#> command = g(2, y),
#> dynamic = map(y = vector_of_settings_2)
#> )
#> ) Created on 2019-06-03 by the reprex package (v0.3.0) When we create new targets, we probably do not need to register them in
cache <- drake_cache() # successor of get_cache()
cache$get_hash("analysis_1_a")
#> [1] "1d5108bacae437a0"
cache$get_hash("analysis_1_b")
#> [1] "17b1fbe1609400b9"
readd(analysis_1)
#> target hash
#> 1 analysis_1_a 1d5108bacae437a0
#> 2 analysis_1_b 17b1fbe1609400b9 Remarks
ThanksThis idea, along with the original DSL, were inspired by @krlmlr in #233 |
Hmm... what about targets downstream of |
Easy, actually: just give a special attribute (maybe an S3 class) to the |
We also need to think about how the new target names and splits are constructed. If |
Come to think of it, we probably need a trace (drake_plan(trace = TRUE)) in those special data frames so that combine(.by) still works. |
The useful cases seem to be:
If users want the hash behavior, then they can use (and perhaps name_by_hash <- function(x, ...) {
n <- vapply(x, digest::digest, "", ...)
names(x) <- n
x
} Alternatively, always default to integer indices, and if the user wants something smarter, they can specify it with |
Hi, I just wanted to describe another use case that would greatly benefit from dynamic branching. In my case, I have a very large data frame somewhat like this:
I'd like to be able to split It would be a game-changer to be able to use drake like this, since more often than not I can think of a splitting scheme that would effectively partition the data into stale and up-to-date splits. |
@dpmccabe, I see what you mean. I just encountered a very similar situation for a project at work. I am realizing that #685 (comment) has serious problems:
An alternative is @brendanf's suggestion of checkpointing (#685 (comment)). For drake_plan(
vector_of_settings_1 = f(1),
vector_of_settings_2 = f(2),
analysis_1 = target(
command = g(1, y),
transform = map(y = vector_of_settings_1)
),
analysis_2 = target(
command = g(2, y),
transform = map(y = vector_of_settings_2)
)
) It is already natural for users to think about it as two separate plans: drake_plan(
vector_of_settings_1 = f(1),
vector_of_settings_2 = f(2)
)
drake_plan(
analysis_1 = target(
command = g(1, y),
transform = map(y = vector_of_settings_1)
),
analysis_2 = target(
command = g(2, y),
transform = map(y = vector_of_settings_2)
)
) Maybe |
A couple notes:
|
As I attempt an implementation, I am finding that because I am trying to avoid saving metadata lists, I have to reinvent a lot of internal machinery. Maybe it's better to save that metadata for dynamic sub-targets. The internal overhaul may not be as catastrophic, and we still gain efficiency because we do not need to actually check the metadata as often. |
Yeah, we will need metadata for things like seeds and warnings. But we will still see performance gains in other ways. |
It is coming time to work on |
On second thought, let's hold off on |
Thoughts on dynamic triggering:
|
On second thought, let's leave the drake_plan(
x = seq_len(4),
y = target(x, trigger = trigger(condition = x > 2), dynamic = map(x)),
z = target(x, trigger = trigger(change = x), dynamic = map(x)),
) |
To avoid duplicating code over various HPC backends, let's have |
Registering dynamic sub-targets requires us to modify |
Unfortunately, dynamic branching is currently slower than static branching when it comes to actually building targets. library(drake)
plan_dynamic <- drake_plan(
x = seq_len(1e4),
y = target(x, dynamic = map(x))
)
plan_static <- drake_plan(
z = target(w, transform = map(w = !!seq_len(1e4)))
)
cache_dynamic <- storr::storr_rds(tempfile())
cache_static <- storr::storr_rds(tempfile())
system.time(
config_dynamic <- drake_config(
plan_dynamic,
cache = cache_dynamic,
verbose = 0L
)
)
#> user system elapsed
#> 0.026 0.003 0.030
system.time(
config_static <- drake_config(
plan_static,
cache = cache_static,
verbose = 0L
)
)
#> user system elapsed
#> 1.904 0.004 1.910
system.time(
suppressWarnings( # different issue
make(config = config_dynamic)
)
)
#> user system elapsed
#> 78.014 3.630 81.767
system.time(
suppressWarnings(
make(config = config_static)
)
)
#> user system elapsed
#> 32.712 3.195 36.049 Created on 2019-11-02 by the reprex package (v0.3.0) |
The good news is that library(drake)
library(profile)
library(jointprof)
plan_dynamic <- drake_plan(
x = seq_len(1e4),
y = target(x, dynamic = map(x))
)
plan_static <- drake_plan(
z = target(w, transform = map(w = !!seq_len(1e4)))
)
cache_dynamic <- storr::storr_rds(tempfile())
cache_static <- storr::storr_rds(tempfile())
system.time(
config_dynamic <- drake_config(
plan_dynamic,
cache = cache_dynamic,
verbose = 0L
)
)
#> user system elapsed
#> 0.027 0.003 0.032
system.time(
config_static <- drake_config(
plan_static,
cache = cache_static,
verbose = 0L
)
)
#> user system elapsed
#> 3.525 0.004 3.530
Rprof(filename = "dynamic.rprof")
suppressWarnings(
system.time(make(config = config_dynamic), gcFirst = FALSE)
)
#> user system elapsed
#> 99.096 3.656 102.928
Rprof(NULL)
data <- read_rprof("dynamic.rprof")
write_pprof(data, "dynamic.pprof")
Rprof(filename = "static.rprof")
suppressWarnings(
system.time(make(config = config_static), gcFirst = FALSE)
)
#> user system elapsed
#> 52.112 3.708 55.916
Rprof(NULL)
data <- read_rprof("static.rprof")
write_pprof(data, "static.pprof")
suppressWarnings(
system.time(make(config = config_dynamic), gcFirst = FALSE)
)
#> user system elapsed
#> 3.239 0.164 3.418
suppressWarnings(
system.time(make(config = config_static), gcFirst = FALSE)
)
#> user system elapsed
#> 13.847 0.472 14.347
file.copy("dynamic.pprof", "~/Downloads")
#> [1] TRUE
file.copy("static.pprof", "~/Downloads")
#> [1] TRUE Created on 2019-11-02 by the reprex package (v0.3.0) |
I used those It looks like the main hangup is loading sub-target dependencies and registering sub-targets. Not too surprising. Speeding this up is going to be another slow-going long-term project. If you have more examples that demonstrate slowness, please post them. It took a long time to get static branching as fast as it is now, and I expect the same for dynamic branching. |
Corrections to #685 (comment)The implementation in #1042 is different from #685 (comment). In particular, the flowchart in https://user-images.githubusercontent.com/1580860/66722470-27ede180-eddc-11e9-97ea-930c5a93d287.png. Procedure for sub-targetsThe procedure for sub-targets is actually simpler than I had originally planned.
Procedure for dynamic targets as a wholeEach dynamic target has its own value alongside the values of the sub-targets. We recompute this value if
Why (2)? Because in some situations, we already have all the sub-targets, but we use fewer of them. library(drake)
plan <- drake_plan(
x = seq_len(3),
y = target(x, dynamic = map(x))
)
make(plan)
#> target x
#> subtarget y_0b3474bd
#> subtarget y_b2a5c9b8
#> subtarget y_71f311ad
# readd() and loadd() understand dynamic targets.
readd(y)
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 2
#>
#> [[3]]
#> [1] 3
# But a dynamic target is really just a vector of hashes.
cache <- drake_cache()
cache$get("y")
#> [1] "3908fe5069df3c28" "16b3cb68bd4872ed" "1a3b3c0d06147d80"
#> attr(,"class")
#> [1] "drake_dynamic"
# What if we shorten y?
plan <- drake_plan(
x = seq_len(2),
y = target(x, dynamic = map(x))
)
# y needs to change, but we leave the sub-targets alone.
make(plan)
#> target x
# readd() and loadd() understand dynamic targets.
readd(y)
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 2
# But a dynamic target is really just a vector of hashes.
cache$get("y")
#> [1] "3908fe5069df3c28" "16b3cb68bd4872ed"
#> attr(,"class")
#> [1] "drake_dynamic" Created on 2019-11-02 by the reprex package (v0.3.0) Why the cryptic sub-target names?The sub-target names are ugly (e.g.
library(drake)
plan <- drake_plan(
x = c("a", "b"),
y = target(x, dynamic = map(x))
)
make(plan)
#> In drake, consider r_make() instead of make(). r_make() runs make() in a fresh R session for enhanced robustness and reproducibility.
#> target x
#> subtarget y_89ca58a1
#> subtarget y_38e75e51
plan <- drake_plan(
x = c("a", "inserted_element", "b"),
y = target(x, dynamic = map(x))
)
# Only one sub-target needs to build.
make(plan)
#> target x
#> subtarget y_06d53fef
# Permute x.
plan <- drake_plan(
x = c("inserted_element", "b", "a"),
y = target(x, dynamic = map(x))
)
# All sub-targets are still up to date!
make(plan)
#> target x Created on 2019-11-02 by the reprex package (v0.3.0) |
Implemented in #1042. |
Also noteworthy: mapping over rows: #1042 (comment) |
New chapter in the manual: https://ropenscilabs.github.io/drake-manual/dynamic.html |
One source of overhead I overlooked: computing the hashes of sub-values that go into the names of sub-targets. Unavoidable, but not terrible. |
Dynamic parent targets are already vectors of hashes, so we can avoid this overhead if the dynamic dependency is itself dynamic: 5a07f67. Otherwise, we need to compute the hashes of all the sub-values. |
Update: dynamic branching just got a huge speed boost in #1089 thanks to help from @billdenney and @eddelbuettel. With improvements both in development |
We want to declare targets and modify the dependency graph while
make()
is running. Sometimes, we do not know what the targets should be until we see the values of previous targets. The following plan sketches the idea.Issues:
outdated()
work now? Do we have to read the targets back into memory to check if the downstream stuff is up to date?drake
has faced. Hopefully the work will migrate to theworkers
package.The text was updated successfully, but these errors were encountered: