Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meta-make #304

Closed
krlmlr opened this issue Mar 6, 2018 · 32 comments
Closed

Meta-make #304

krlmlr opened this issue Mar 6, 2018 · 32 comments

Comments

@krlmlr
Copy link
Collaborator

krlmlr commented Mar 6, 2018

I'd like to propose three very simple verbs that might get us halfway towards the goal of a DSL (#233). This proposal addresses the low-level technical part, which I think is required for any DSL.

The proposal is too long for an issue (and issues aren't version-controlled), I've added it to the wiki: https://github.com/ropensci/drake/wiki/Meta-make. Perhaps a 5 to 10 minute read.

I haven't closely followed previous related discussions, there might be overlap I'm not aware of. In particular, #79 might be a similar approach. Sharing this early for feedback and discussion, CC @AlexAxthelm @dapperjapper @rkrug @kendonB.

@rkrug
Copy link
Contributor

rkrug commented Mar 7, 2018

I like the idea of having a wiki page to discuss possible verbs.

I am not to clear about your verbs - could you please add the tibble / plan you would expect at the end?

@krlmlr
Copy link
Collaborator Author

krlmlr commented Mar 7, 2018

Thanks for your feedback. I updated the wiki and pinged you in the commit message, would be interested to learn if you received a notification. Please let me know if things are clearer now.

@rkrug
Copy link
Contributor

rkrug commented Mar 7, 2018

Thanks. Makes things much easier.

The pack() and unpack() sound extremely useful.

Looking at meta_plan(), it would address one question I raised at the workshop: the iterative plans. But now there might be one important issue coming in: the uniqueness of target names. Unique to cache, directory, or plan?

meta_plan() would be extremely powerful, but also very error prone due to the uniqueness of the target names. Don't get me wrong - I would love a feature like that, but there needs to be some mechanism to deal with the same target names in the meta_plan - maybe adding a suffix to the target names of the meta-plans?

@rkrug
Copy link
Contributor

rkrug commented Mar 7, 2018

In the classical make, the target name has to be unique to the makefile - and I think drake should really aim at achieving the same.

@krlmlr
Copy link
Collaborator Author

krlmlr commented Mar 7, 2018

Done adding reprexes. Switching to rmarkdown would have been smarter ;-)

I'd suggest that drake just refuses to run plans with duplicate names, just the way it works currently. That feels like the safest option.

Iterative plans will be covered by the DSL. This is just about low-level tooling to make the DSL possible.

@ropensci ropensci deleted a comment from rkrug Mar 7, 2018
@ropensci ropensci deleted a comment from rkrug Mar 7, 2018
@rkrug
Copy link
Contributor

rkrug commented Mar 7, 2018

@krlmlr OK - so meta_plan() does not know anything about the main plan - makes perfect sense.

But than the creation of the plan should include the check, not the make(). If make() gets a plan with duplicate target names, it should refuse to run it - that is obvious. B

But the error message should come during creation of the plan in the drake_plan() step. It is a helper function anyway, to populate a simple tibble / data.frame.

@krlmlr
Copy link
Collaborator Author

krlmlr commented Mar 7, 2018

Yes, meta_plan() will usually receive a tracked target, or something computed from tracked targets, as argument.

@wlandau-lilly
Copy link
Collaborator

Thanks so much, Kirill! I read your proposal, but it will take me more time to come up with feedback.

@AlexAxthelm
Copy link
Collaborator

I'm really liking the modularity of the design here, and I want to see if I'm understanding correctly. If I have a problem along the lines of "take a big data set, split by an unknown set of factors, and run a arbitrart code on each" (something for which I would use dplyr::do), woudl that be something along the lines of:

drake_plan(
  data = generic_data_importing_function(file_in("data.csv")),
  factor_levels = unique(pull(data, "factor_column") ),
  meta_plan = drake_plan(
    this_factor = unpack(factor_levels),
    data_subset = filter(data, factor_column == this_factor),
    analysis = analysis_function(data_subset)
  ),
  results = meta_make(meta_plan)
)

or would this require some sort of 4th verb (along the lines of a drake_foreach)?

@krlmlr
Copy link
Collaborator Author

krlmlr commented Mar 7, 2018

Thanks. I don't intend to change the semantics of drake_plan() for this proposal. Your analysis might work with the following plan:

drake_plan(
  data = generic_data_importing_function(file_in("data.csv")),
  splits = data %>% nest(-factor_column) %>% deframe(),
  meta_plan = tibble(
    target = paste0("analyzed_", names(splits)),
    command = paste0("analysis_function(splits[[\"", names(splits), "\"]])")
  ),
  results = meta_make(meta_plan)
)

This will create targets of the form analyzed_xxx in the meta_plan, where xxx come from the factor levels. If you'd like to test the plan on a real example (iris?), you can double-check the embedded meta_plan. Does this make sense?

The targets in meta_plan will only be available as components of results, they are only brought into the main plan with unpack(), if necessary.

The command to create meta_plan is utterly awkward, that's where a DSL might help.

@wlandau-lilly
Copy link
Collaborator

Should make() just be "meta" natively? I have played around with

plan <- drake::drake_plan(
  x = 1,
  y = 2,
  meta_plan = drake_plan(
    a = x,
    b = y
  ),
  results = drake::make(meta_plan)
)
make(plan)

It fails because target a could not find its dependency x. Solving this problem might be simpler and cleaner than implementing new meta_make(), especially since meta_make() may need to deal with self-reference anyway.

plan1 <- drake_plan(
  x1 = 1,
  results1 = make(plan1),
  plan1 = drake_plan(
    x2 = x1,
    results2 = make(plan2),
    plan2 = drake_plan(
      x3 = x2,
      results3 = make(plan3),
      plan3 = drake_plan(
        ...
      )
    )
  )
)

In any case, I do like where this is headed. The "meta" piece will make the hardest parts of the DSL possible.

@wlandau-lilly
Copy link
Collaborator

I think pack() is similar to gather_plan(). pack() and unpack() are great choices of words, but I also wonder if the standard gather() and spread() would be more recognizable.

@krlmlr
Copy link
Collaborator Author

krlmlr commented Mar 8, 2018

Thanks. I'd rather give an informative error message if make() is called from within a plan. We can see later if we can let make() converge towards meta_make(), or the other way round.

meta_make() is declarative only, it returns a data structure that contains the new plan but doesn't execute anything. Therefore, nesting shouldn't be a problem. There should be only one scheduler, even when some plans are constructed dynamically as part of the execution of the main plan.

The way spread() and gather() are defined in tidyr seems to be fairly different from the intended semantics of pack() and unpack(). I'm entirely open to other names.

@wlandau
Copy link
Member

wlandau commented Mar 8, 2018

meta_make() is declarative only, it returns a data structure that contains the new plan but doesn't execute anything. Therefore, nesting shouldn't be a problem. There should be only one scheduler, even when some plans are constructed dynamically as part of the execution of the main plan.

Ah, so make() and meta_make() are fundamentally different anyway.

The way spread() and gather() are defined in tidyr seems to be fairly different from the intended semantics of pack() and unpack(). I'm entirely open to other names.

Sure, that's fine. If spread() and gather() are not the right words, I think pack() and unpack() are great.

@violetcereza
Copy link
Contributor

I think it would be nice to unify the making interface under one function make, but I agree with krlmlr that they serve different purposes for now...

I'm confused about the return value of meta_make(). What does the data structure look like? When do the meta-targets (?) get made? And how might one loadd a meta-target?

Maybe I need some clarity around what declarative means and how this affects working with drake targets both within plans and outside of them.

Going off of your analogy in the wiki, meta_make targets return a list of made targets when loadded. Thus you would access meta-targets like drake::readd(results)$a. However, with the targets I use, it would use wayyy too much memory to load every target within a meta_make just to access one of them.

@krlmlr
Copy link
Collaborator Author

krlmlr commented Mar 8, 2018

@dapperjapper: Thanks. I've added an implementation sketch to the wiki (https://github.com/ropensci/drake/wiki/Meta-make#implementation-sketch), and also tweaked a bit -- see differences. Does this help answer your questions?

@krlmlr
Copy link
Collaborator Author

krlmlr commented Mar 8, 2018

If we follow the implementation sketch, it feels easiest to start with pack(), because meta_make() will apply very similar ideas (or maybe even call a version of pack() under the hood), and unpack() alone doesn't feel that useful without meta_make().

@rkrug
Copy link
Contributor

rkrug commented Mar 9, 2018

@krlmlr I am confused a little bit now - if meta_make() does not execute anything and returns a plan, what is than the difference to drake_plan()? If there is a difference (I assume the execution in an isolated environment), why not simply add an argument subplan = TRUE to drake_plan() which is FALSE by default and can be set when used in this context here?

@krlmlr
Copy link
Collaborator Author

krlmlr commented Mar 9, 2018

@rkrug: Great question. I think we want to give users a way to create "sub-plans" from home-grown tibbles created outside of drake_plan(), as in

meta_make(tibble(
  target = ...
  command = ...
))

@rkrug
Copy link
Contributor

rkrug commented Mar 9, 2018

So meta_make() would take a tibble (which might be homegrown or created by drake_plan() and turns it into a sub-plan - correct? But in then why is it called _make, as it is doing the same as drake_plan() - wouldn't drake_subplan() than be the better word?

Actually - do we really need this? Isnt't it possible to simply use drake_plan() inside drake_plan() which than automatically creates a sub-plan? I mean a n argument subplan could be set automatically when the function is called iteratively?

@krlmlr
Copy link
Collaborator Author

krlmlr commented Mar 9, 2018

I want to implement this in a transparent way, using as little magic as possible. I don't see how meta_make() does the same as drake_plan() does, it takes a drake plan as input.?

I'm open to other names for the meta_make() operation, but it should match the effect, which is that it effectively executes commands defined dynamically when used inside make(). (At least this is the perceived effect; the real effect is that it creates a proxy object that is interpreted differently from normal targets by make(). It feels that we need to start writing documentation or tutorials before we can really come up with a good way to teach/communicate this idea, and the naming choice could be based on that.)

@rkrug
Copy link
Contributor

rkrug commented Mar 9, 2018

I don't see how meta_make() does the same as drake_plan() does, it takes a drake plan as input?

True - but the output is identical to a plan, only that it has an attribute sub-plan?

....[quite a bit of thinking in the background ....]

Ahhhh - I think I see what you are aiming at. This would make the assembling of complex plans much easier, as they can be split into sub-plans and assembled at a later stage. OK - that is a different thing (and you are right - we have to talk about the naming at the later stage).

Using these sub-plans is very powerful, as you say, but one question needs to be addressed, which is the naming of the targets.

target names have to be unique within each plan, within each sub-plan (which is obvious), but I think a sub-plan in a plan should be able to have the same names.

plan <- drake::drake_plan(
  x = 1,
  y = 2,
  meta_plan = drake_plan(
    x = x * 2,
    y = y / x,
    z = p(x)
  ),
  results = meta_make(meta_plan)
)

should be possible. The targets would be x, y, results$x, results$y, result$z.
In addition, the x in the subplan should refer to the x in the sub-plan.
While p(x) would refer to the target x in the parent plan.

@rkrug
Copy link
Contributor

rkrug commented Mar 9, 2018

It seems easiest if we impose the restriction that the meta-plan can only access outside targets.

I think that would limit the usefulness quite a bit and lead to the combination of many targets into a single one in the meta-plan. To have be able to have inside the meta-plan a tree structure of dependent targets (inside as well as parent plan) would be, as I see it, the main advantage of a meta-plan. One could encapsulate e.g. in one meta-plan the simulation, cleaning and different analysis of the results, while the plan contains the generation of the parameter, which are used by the meta-plan.
So changing the parameter, would only require changing the (relatively simple) plan, and not the (possibly complex) meta-plan.

@krlmlr
Copy link
Collaborator Author

krlmlr commented Mar 9, 2018

I hear you, but I'd prefer to keep complexity as low as possible at this point -- even the simplest solution will take some time and careful thought to implement.

Could you please share a concrete example, maybe a plan that would work for one particular parameter value? We can use this to discuss how this can be implemented based on the current proposal.

@rkrug
Copy link
Contributor

rkrug commented Mar 9, 2018

OK - here comes one (I hope this is what you are looking for - if yes, I could throw in the missing function definitions):

Assuming this one function to create a parameter set:

parms <- function(n0,  r) {
  return(
    list(
      n0 = n0,
      r = r,
      K = 100
    )
  )
}

to simulate one has to do:

all1 <- drake_plan(
  parameter = parms(10, 1.5),
  sim = sim(parameter),
  analysis1 = analysis1(sim),
  analysis2 = analysis2(sim),
  compare = comp(analysis1, analysis2)    
)

So if one want's to chage the parameter, one has to edit a longish plan.

In contrast:

simAndAnalysis <- drake_plan(
  sim = sim(parameter),
  analysis1 = analysis1(sim),
  analysis2 = analysis2(sim),
  compare = comp(analysis1, analysis2)    
)

all2 <- drake_plan(
  parameter = parms(10, 1.5),
  result = meta_make(simAndAnalysis)
)

here one only has to edit the all2 plan, while the logic of the simulation and analysis stays untouched, therefore much less errorprone.

But the requirement is to have access to the all the targets.

@krlmlr
Copy link
Collaborator Author

krlmlr commented Mar 9, 2018

No need to provide implementations, we can also use functions from cooking ;-)

With the current proposal, implementing this plan with multiple parameters is likely to involve multiple meta_make() targets constructed in a very ugly way:

drake_plan(
  parameters = list(paramA = parms(...), paramB = parms(...), ...),
  sim = meta_make(sim_from_parameters(parameters)),
  analysis1 = meta_make(analysis1_from_sim(sim)),
  analysis2 = meta_make(analysis2_from_sim(sim)),
  compare = meta_make(compare_from_analyses(analysis1, analysis2))
)

sim_from_parameters <- function(parameters) {
  tibble(
    target = paste0("sim_", names(parameters)),
    command = paste0("sim(parameters[[\"", names(parameters), "\"]])")
  )
}

analysis1_from_sim <- function(sim) {
  tibble(
    target = paste0("analysis1_", names(sim)),
    command = paste0("analysis1(sim[[\"", names(sim), "\"]])")
  )
}

...

Only the DSL (#233) aims at providing a nice syntax.

@krlmlr
Copy link
Collaborator Author

krlmlr commented Mar 22, 2018

We could also use meta-plans to handle knitr_in()! Exploring the dependencies becomes an intermediate step:

drake_plan(
  rmarkdown::render(knitr_in("report.Rmd"), ...)
)

then becomes (through an internal transformation)

  `_knitr_deps_report.Rmd` = analyze_knitr_in("report.Rmd"),
  { pack(`_knitr_deps_report.Rmd`); rmarkdown::render(file_in("report.Rmd"), ...) }

Would that simplify or complicate things?

@wlandau
Copy link
Member

wlandau commented Mar 22, 2018

Yes, I think this is one of the problems the DSL is meant to solve. One edge case I have been thinking about is if the user has commands to write the Rmd source.

drake_plan(
  write_my_report(); file_out("report.Rmd"),
  rmarkdown::render(knitr_in("report.Rmd"), file_out("report.html"))
)

If report.Rmd does not already exist before make() is called, drake cannot analyze the report for dependencies. And once the first make() creates the report and those dependencies are found, the target "report.html" is invalidated. In other words, both the first and second make()s do stuff. If the DSL's internal transformation is delayed, this should no longer be a problem.

@wlandau
Copy link
Member

wlandau commented Jan 17, 2019

I wonder, do we still need this nested plan structure? Could a lazy application of the transformation/grouping interface from #233 (comment) accomplish the dynamic branching we want?

@wlandau
Copy link
Member

wlandau commented Jan 18, 2019

We may come back to meta-make eventually, but I think #233 (comment) brings us closer to dynamic branching and job groups.

@wlandau wlandau closed this as completed Jan 18, 2019
@wlandau
Copy link
Member

wlandau commented Jan 18, 2019

See #685.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants