Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dependencies in formula not recognised #525

Closed
richardjtelford opened this issue Sep 30, 2018 · 6 comments
Closed

Dependencies in formula not recognised #525

richardjtelford opened this issue Sep 30, 2018 · 6 comments

Comments

@richardjtelford
Copy link

Drake is not recognising dependencies in a formula. In many regression models, one shouldn't have dependencies in the formula, but with a CCA and some other regression methods the response will always be a dependency (but the predictors should all be found in the data argument). There might be some methods like a PLS regression with a dependency as the predictor.

In my case, I can use the non-formula version of cca, but that wouldn't always work.

library("drake")
library("vegan")

dp <- drake_plan(
  dune = get(data("dune")),
  dune.env = get(data("dune.env")),
  CCA = cca(dune ~ ., data = dune.env),
  strings_in_dots = "literals"
)

make(dp)

fail CCA
Error: Target CCA failed. Call diagnose(CCA) for details. Error message:
object 'dune' not found

@wlandau
Copy link
Member

wlandau commented Sep 30, 2018

Thanks for the reminder on this. I gave up on formulas back when all drake's code analysis was done by codetools (ref: https://github.com/ropenscilabs/drake-manual/blob/a49be0b8428bf202f37401cf79878c520cc06608/caution.Rmd#commands-are-not-perfectly-flexible). But now, since drake does so much custom code analysis, this may yet be possible. I will see what I can do.

However, there is a drawback: if your formula's symbols are columns in your data frame, the dependency relationships may be wrong. In the following example, drake will think x depends the target named cyl (which is 1234) and not a column of mtcars.

dp <- drake_plan(
  cyl = 1234,
  x = fit_my_model(mpg ~ cyl, data = mtcars)
)

I still think your use case is worth revisiting formulas, but please keep the above in mind.

@wlandau
Copy link
Member

wlandau commented Sep 30, 2018

Sorry, but I am unlikely to change drake's behavior here

After further reflection, I do not actually think drake should detect dependencies from formula symbols. These symbols are really local variables that are specific to the formula, and drake has no way to dive into the data argument of cca() beforehand to know which variables are in arguments like dune.env. In my experience, overaggressive dependency detection tends to make workflows brittle and unpredictable. I think it is better to fail early and predictably.

What you can do now

For your project, I recommend putting dune in dune.env.

library(drake)
dp <- drake_plan(
  dune = get(data("dune")),
  dune.env = {
    out <- get(data("dune.env"))
    dune.env$dune <- dune
    out
  },
  CCA = cca(dune ~ ., data = dune.env),
  strings_in_dots = "literals"
)
config <- drake_config(dp)
vis_drake_graph(config)

capture

What it would take to depend on symbols from formulas

To analyze code for dependencies, drake uses an internal function called code_dependencies(), which is based on recurse_call() from http://adv-r.had.co.nz/Expressions.html#ast-funs. At this point in code_dependencies(), results$globals has all the symbols from any formulas in the expression. But unfortunately, that leaves us with too many symbols. (For example, if the expression is a function or a for loop, results$globals contains function arguments and locally-defined variables). Those local variables could conflict with targets and imports with the same names, which could potentially fool drake into creating false dependency relationships. That's why drake uses codetools::findGlobals() (via drake:::find_globals() in the very next line) to get rid of those local variables.

I think we need #41 to do all this properly and still keep local variables from formulas. If we recurse over the expression just once using CodeDepends, we might be able to take more control over the code analysis and keep symbols we find in formulas.

@wlandau wlandau closed this as completed Sep 30, 2018
@bart1
Copy link

bart1 commented Oct 2, 2018

I just encountered an issue slightly related to this, I have some covariates that are quite complicated to calculate. They are smooths were the position of the knots depends on the data. I encountered the problem that formulas can contain dependencies. This is a minimal example that reproduces the problem:

make(drake_plan(m=lm(y~x, data=d),xx=d$x, d=data.frame(x=rnorm(10),y=rnorm(10))))

This is more like the situation I'm encountering where asdf then again depents on other variables

require(mgcv)
make(drake_plan(m=gam(y~s(dd, k=asdf),data=d), 
                           d=data.frame(y=rnorm(10), dd=rnorm(10)), 
                           asdf=3))
Error: Target `m` failed. Call `diagnose(m)` for details. Error message:
  object 'asdf' not found

I guess this might be a work around but is not very elegant:

make(drake_plan(m={asdf2<-asdf;gam(y~s(dd, k=asdf2),data=d)}, 
           d=data.frame(y=rnorm(10), dd=rnorm(10)), 
           asdf=3))

@wlandau
Copy link
Member

wlandau commented Oct 2, 2018

Now that's a case where it is not enough to put asdf inside d. Here, you could just mention asdf somewhere outside the formula in the command of m. That is enough to create a dependency relationship.

plan <- drake_plan(
  m = {
    asdf
    gam(y ~ s(dd, k = asdf), data = d)
  }, 
  d = data.frame(y = rnorm(10), dd = rnorm(10)),
  asdf = 3
)

@bart1
Copy link

bart1 commented Oct 3, 2018

Thanks that makes things a bit cleaner.

@wlandau
Copy link
Member

wlandau commented Dec 9, 2018

Update: in #573, I am trying to move drake away from codetools, so later we might get to keep symbols from formulas after all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants