Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to add .R files to drake_plan() #193

Closed
pat-s opened this issue Jan 22, 2018 · 21 comments
Closed

How to add .R files to drake_plan() #193

pat-s opened this issue Jan 22, 2018 · 21 comments

Comments

@pat-s
Copy link
Member

pat-s commented Jan 22, 2018

Hi there,

maybe I overlooked something but I fail to properly add .R files to drake_plan so that the dependencies are detected.

Reprex:

analyze <- drake_plan(
  test.md = knit('03_scripts/01_test.Rmd', quiet = TRUE),
  test2 = source('03_scripts/test2.R'),

And in test2.R I have the following:

loadd(test.md)

However, when visualizing with

config <- drake_config(analyze)
vis_drake_graph(config, width = "100%", height = "500px")

dependencies are not detected for the .R file.
If I use a .Rmd file, everything works as expected.

Tried a few different approaches but could not successfully add a .R script. Help 😄

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Jan 22, 2018

Great question, @pat-s! Questions like these are starting to come up a lot, and yours is the first in an FAQ I am starting.

EDIT

The following examples show how to set up the files for drake projects:

Get the code with

drake_example("basic")
drake_example("gsp")
drake_example("packages")

Each of the above writes a folder with code files. To make sure drake_example() outputs what you need, please be sure to use drake version 5.0.1.9000 or later.

Original response

The feature you want is designed for knitr source documents only, such as *.Rmd and *.Rnw files (although the file extension itself does not matter). Drake decides to analyze a file when it sees knit() or render() in the command, so nothing special happens if you use source() or your file has a .R extension.

Why not detect loadd()/readd() dependencies in R scripts source()'d from commands? Because with drake, the focus is on your R session, not your R scripts. The idea is to source() the files beforehand so that you have a bunch of functions and small data objects in your workspace. Drake treats these objects as "imports", and it detects and analyzes them once they are in your environment. For example:

ls()

## character(0)

source("my_functions.R")

ls() # The simulate() function is defined in my_functions.R and treated as an import.

## [1] simulate

Drake looks inside the bodies of imported functions for non-file dependencies.

simulate

## function(n) {
##   data.frame(x = stats::rnorm(n), y = rpois(n, 1))
## }

deps(simulate)

## [1] "data.frame"   "rpois"        "stats::rnorm"

Now that you loaded your imports with source("my_functions.R"), you can make() your targets.

analyze <- data.frame(
  target = c("test_dataset", "'test.md'"),
  command = c(
    "simulate(5)",
    "knit('03_scripts/01_test.Rmd', output = \"test.md\", quiet = TRUE)"
  )
)

analyze

##         target                                                          command
## 1 test_dataset                                                      simulate(5)
## 2    'test.md' knit('03_scripts/01_test.Rmd', output = "test.md", quiet = TRUE)

make(analyze)

where 01_test.Rmd might look like this.


---
title: "Test Report"
author: Patrick Schratz
output: html_document
---

This report depends on `test_dataset`.

```{r example_chunk}
drake::readd(test_dataset)
```

A second make(analyze) will rebuild test.md if you have changed

  • the simulate() function (except for changes to whitespace or comments) or
  • either command in the analyze data frame (except for whitespace or comments) or
  • the contents of 03_scripts/01_test.Rmd.

Besides knitr reports, the loadd() and readd() functions are only meant for informally exploring your results. If they are embedded in your commands or imported functions, this could create a dangerous circularity in your workflow.

Does that help? Does this way of doing things meet your needs, or does your use case require you to source() an R file from a command?

@pat-s
Copy link
Member Author

pat-s commented Jan 22, 2018

Hi @wlandau-lilly, thanks for the quick and extensive answer!

So if I understand correctly, all analysis scripts should be Rmarkdown files while .R scripts should only contain functions that are used as imports and then analyzed for their dependencies?

Does that help? Does this way of doing things meet your needs, or does your use case require you to source() an R file from a command?

Right, in my case, I have a mixture of .Rmd and .R files.
The .Rmd files contain all the preprocessing and EDA stuff including post-analysis of modeling results.

However, the modeling stuff is stored in .R scripts as these can easily be run from the command line (e.g. using Rscript), also avoiding all the knitr output/overhead.
The modeling scripts are running on a server in parallel.
While I use .Rmd reports a lot, I still find the good old .R scripts easier to handle on servers.

As I said, these files contain modeling code and therefore depend on the preprocessing .Rmd file.
Hence, I wanted to insert something like load(preprocessing) so that drake recognizes that the scripts depend on the preprocessing script. Subsequently, if something changes in the preprocessing script, it should first rerun preprocessing.Rmd.

Afterwards, there are again additional .Rmd reports (for the post analysis) that depend on the .R files.
So my current approach looks like this:

analyze <- drake_plan(
  preprocessing.md = knit('03_scripts/01_preprocessing.Rmd', quiet = TRUE), # 1

  EDA.md = knit('03_scripts/02_EDA.Rmd', quiet = TRUE), # 2
  study_area.md = knit('03_scripts/03_study_area.Rmd', quiet = TRUE), # 2

  brt_sp_non = source('03_scripts/server/brt_sp_non.R'), # 3
  [...]
  brt_nsp_nsp = source('03_scripts/server/brt_nsp_nsp.R'), # 3

  cv_vis.md = knit('03_scripts/04_spcv_vis.Rmd', quiet = TRUE) # 4
)

where

  • #4 depends on #3 and #1
  • #3 depends on #1
  • #2 depends on #1

Hope that does not confuse you too much 😄 .
I think there will be plenty of users out there using either only .R files or a mixture of .Rmd and .R for their analysis.
Would be great if both file types could be used together.

@tiernanmartin
Copy link
Contributor

Just wanted to chime in my support for accommodating a mixture of .R and .Rmd files.

Similar to @pat-s , the distinction for my projects is usually:

  • .R for computationally expensive processes (in my use case these are operations on medium-size spatial data, some of which need to be run on a server)
  • .Rmd for simple data operations, reporting, and documentation

drake has helped me overcome a major pain point in my project workflow, so thanks a lot for your effort on this package! 👏

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Jan 23, 2018

@pat-s and @tiernanmartin, thank you for clarifying. I think I can do a better job of explaining now.

So if I understand correctly, all analysis scripts should be Rmarkdown files while .R scripts should only contain functions that are used as imports and then analyzed for their dependencies?

You don't need any analysis scripts. R Markdown files are just a special accommodation, and *.R files are just a convenient way to store code. Drake focuses on your R session rather than your files. To drake, your "script" is all the functions in your environment and the commands in your workflow plan data frame. How you load things into your environment is up to you. This is a very unusual, even disconcerting way of thinking, but I think it makes projects cleaner and smoother in the long run.

For example, instead of

analyze <- drake_plan(
  ...
  preprocessing.md = knit('03_scripts/01_preprocessing.Rmd', quiet = TRUE),
  brt_sp_non = source('03_scripts/server/brt_sp_non.R'),
  ...
)
make(analyze)

you might try something like

source("03_scripts/server/brt_sp_non.R")
analyze <- drake_plan(
  ...,
  preprocessed_data = preprocess_data(read_my_data('data.csv')),
  brt_sp_non = build_brt_sp_non(preprocessed_data),
  ...
)
make(analyze)

and define functions preprocess_data(), read_my_data(), and build_brt_sp_non() in "03_scripts/server/brt_sp_non.R". Drake knows that brt_sp-non depends on preprocessed_data and build_brt_sp_non() because these symbols are part of the command.

The commands in your workflow plan data frame are just arbitrary chunks of R code that return values. Other than the special accommodations for knit(), render(), and single-quoted file targets, you can treat commands like ordinary bits of R code.

Here is an example of a workflow that juggles a bunch of numbers.

my_plan <- drake_plan(
  a = 1 + 1,
  b = {
    x <- pi + a
    y <- sqrt(x)
    rand <- rnorm(10, sd = y)
    mean(rand)
  },
  c = a - 5,
  d = c(b, c)
)
config <- drake_config(my_plan)
vis_drake_graph(config)

math

make(my_plan)
readd(d)

## [1]  1.886363 -3.000000

Now, if you rely on R Markdown reports for sharing results, you have the option to create one and then loadd()/readd() non-file targets or imports into code chunks. But that is up to you.

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Jan 23, 2018

I forgot to mention: drake analyzes functions and commands for dependencies, and it knows those functions can be nested. For example, drake knows g() is nested inside f().

library(drake)

my_plan <- drake_plan(
  a = f(1 + 1)
)

f <- function(x){
  g(x + 1)
}

g <- function(x){
  x + sqrt(4)
}

config <- drake_config(my_plan)
vis_drake_graph(config)

simple

make(my_plan)

## ...
## target a

readd(a)

## [1] 5

make(my_plan)

## ...
## All targets are already up to date.

# Let's change function g().
g <- function(x){
  x + 3
}

# f() depends on g(), and target `a` depends on f().
# So target `a` is out of date, and `make()` recomputes it.
make(my_plan)

## ...
## target a

# The value of `a` changed.
readd(a)

## [1] 6

@pat-s
Copy link
Member Author

pat-s commented Jan 23, 2018

Thanks @wlandau-lilly, I tried your suggestion with the following setup now:

source("03_scripts/01_preprocessing.R")
source("03_scripts/server/modeling_functions.R")
methods <- drake_plan(
  preprocessed_data = preprocess(
    pathogens = "/data/patrick/raw/survey_data/diseases240112_mod.csv", 
    points = "/data/patrick/mod/survey_data/points_mod.shp", 
    slope = "/data/patrick/mod/DEM/slope/slope_5m.tif", 
    ph = new("GDALReadOnlyDataset", "/data/patrick/raw/ph_europe/ph_cacl2"), 
    lithology = "/data/patrick/raw/lithology/CT_LITOLOGICO_25000_ETRS89.shp", 
    hail = "/data/patrick/raw/hail/Prob_GAM_square_area.tif", 
    elevation = "/data/patrick/mod/DEM/dem_5m.tif", 
    soil = "/data/patrick/raw/soil/ISRIC_world_soil_information/TAXNWRB_250m_ll.tif", 
    study_area = "/data/patrick/raw/boundaries/basque-country/Study_area.shp"),
  
  brt_sp_non = brt_sp_nsp(data = preprocessed_data, iterations = 200),
  preprocess.md = knit("03_scripts/01_preprocessing.Rmd"),
  strings_in_dots = "literals"

I put the .Rmd file 01_preprocessing.Rmd into a function called preprocess() that gets imported via source("03_scripts/01_preprocessing.R").

I generate preprocessed_data which serves as input for brt_sp_nsp() which itself is defined in source("03_scripts/server/modeling_functions.R").

However, when following your quick example, to generate a report for all the preprocessing I have to add something like preprocess.md = knit("03_scripts/01_preprocessing.Rmd") that includes
readd() or loadd() for the datasets, .e.g. readd(pathogens), because they already exist in the cache.
If I wouldn't do so, dependencies are not detected, right?

So currently I would need an .R file defining my function (e.g. 01_preprocessing.R) and an additional .Rmd file for the report, both with 99% redundant code?

Hm, this really gets kinda complicated now and it seems that have to modify all my scripts..
Maybe I overcomplicate things here?
Actually, my analysis structure is "very simple" and I only would need a tool that takes care of the following structure.

  • 01_preprocessing.Rmd #1
  • 02_EDA.Rmd (depends on #1)
  • 03_study_area.Rmd (depends on #1)
  • 04_scripts/20 files here.R (depend on #1)
  • 05_spcvis.Rmd (depends on #1 and #4)
  • 06_post_analysis.Rmd (depends on #1 and #4)

drake is very function orientated as far as I see and less focused on resolving simple dependencies of scripts? I understand that there needs to be a clear link between all the scripts to be able to connect them.
However, wouldn't be something like the example below also desirable?

drake_plan(
1 = report("01_preprocessing.Rmd", depends_on = NULL),
2 = report("02_EDA.Rmd", depends_on = 1),
3 = report("03_study_area.Rmd", depends_on = 1),
4 = script(list("04_scripts/20 files here.R"), depends_on = 1)
5 = report("05_spcvis.Rmd", depends on = c(1, 4)
)

@wlandau-lilly
Copy link
Collaborator

I think we're making progress here.

However, when following your quick example, to generate a report for all the preprocessing I have to add something like preprocess.md = knit("03_scripts/01_preprocessing.Rmd") that includes
readd() or loadd() for the datasets, .e.g. readd(pathogens), because they already exist in the cache.
If I wouldn't do so, dependencies are not detected, right?

So currently I would need an .R file defining my function (e.g. 01_preprocessing.R) and an additional .Rmd file for the report, both with 99% redundant code?

No additional .Rmd report should be necessary. In the quckstart vignette you mentioned, the report.Rmd file is for post-processing, not preprocessing. Its purpose is to display and inspect the results at the very end of the analysis. I intend to make that clearer very soon. I recommend against .Rmd files for preprocessing since all you get is an final .md file rather than targets that would be useful to future commands. If I were you, I would completely let go of 01_preprocessing.Rmd.

drake is very function orientated as far as I see and less focused on resolving simple dependencies of scripts?

Yes, drake resolves dependencies of functions and commands, not scripts. R Markdown files are a special exception because people like to use them as final reports. I would encourage you to write your workflow kind of like this.

drake_plan(
  preprocessed_data = preprocessing_01(),
  EDA = EDA_02(preprocessed_data),
  study_area = study_area_03(preprocessed_data),
  other_work = other_work_04(preprocessed_data),
  spcvis = spcvis_05(preprocessed_data, other_work)
)

You don't need anything like depends_on because drake knows the targets depend on the arguments supplied to functions in commands. For example, drake knows that spcvis depends on preprocessed_data and other_work because these two targets are passed as arguments to your spcvis_05() function.

wlandau-lilly added a commit that referenced this issue Jan 23, 2018
There was some confusion about the role of
R Markdown reports in drake workflows.
They are not actually necessary, and
the only targets they generate are report files
such as `.md` and `.html` files. I have changed
the report in the basic example to try to explain.
@pat-s, I hope this helps.
@wlandau-lilly
Copy link
Collaborator

Something else I forgot to clarify: if one of your commands is knit('some_report.Rmd') or render('some_report.Rmd'), your target will be another file such as some_report.md or some_report.html. You will not be able to extract any of the data objects you make inside the code chunks of some_report.Rmd. In other words, some_report.Rmd is a dead end.

wlandau-lilly added a commit that referenced this issue Jan 23, 2018
@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Jan 23, 2018

@pat-s please see the new best practices guide. I try to lay out the issue there as best I can, and I will likely add more in the future. I think I have done enough to close here, but let's keep talking on the thread.

@wlandau
Copy link
Member

wlandau commented Jan 26, 2018

I have been thinking a lot more about this thread, and I have restructured drake_example("basic"), drake_example("gsp"), and drake_example("packages") to demonstrate how to set up the files. Didn't make it to CRAN v5.0.0, but there's always next time. The files are on GitHub now. If I were you, I would peruse the following to learn more about setting up the files for drake projects.

@pat-s
Copy link
Member Author

pat-s commented Jan 26, 2018

Always good to hear when a discussion ends fruitful! Haven't had time in the last days to take a look in detail, will do it later and report my thoughts!

@wlandau
Copy link
Member

wlandau commented Jan 26, 2018

And I look forward to learning what you think. This is such an important piece for drake's docs to get right.

@pat-s
Copy link
Member Author

pat-s commented Feb 4, 2018

Read it and briefly wanna say that its more clear to me now than before. Maybe you can still add a section only focusing on the differences in the function oriented vs the script oriented approach.

Imo this would fit into the "Get started" page so that new readers understand the difference right from the start.

Congrats for being accepted at ropensci 👍

@wlandau
Copy link
Member

wlandau commented Feb 5, 2018

Thanks, @pat-s! And I am glad this is making more sense. I would like to keep the "Get started" and README.md documents short, but I think there is room to mention the issue and cross reference the best practices vignette.

@henningsway
Copy link

henningsway commented May 1, 2018

For me it seems nice, to be able to have the output of .Rmd as possible targets. I often use .Rmd to query databases via the {sql}-engine. It would also seem nice, to have .Rmd-reports in the middle of a pipeline to have a more streamlined "report" into some intermediate results maybe.

With the current functionality, design - how would I best incorporate existing rmarkdown scripts into the drake_plan?

Just started experimenting with the package though (and I am often shy to put my scripts into functions anyways, so far, which will probably change :))

@wlandau
Copy link
Member

wlandau commented May 1, 2018

With the current functionality, design - how would I best incorporate existing rmarkdown scripts into the drake_plan?

For existing R Markdown scripts, make sure those files exist, and then use knitr_in() to declare them in your drake_plan() (or imported functions, if applicable).

plan <- drake_plan(
  rmarkdown::render(
    knitr_in("report.Rmd"),
    output_file = file_out("report.html"),
    quiet = TRUE
  ),
  ...
)
make(plan)

Inside the active code chunks in your report, you can use loadd() and readd() to (1) load other drake targets into your report, and (2) tell drake that the output file depends on those targets. See this example for details. In the GitHub development version of drake, you can get the code files with drake_example("main").

It would also seem nice, to have .Rmd-reports in the middle of a pipeline to have a more streamlined "report" into some intermediate results maybe.

You can generate .Rmd files as output targets, but then drake will not analyze the active code chunks for dependencies. For that, I think we need #304 (ref: #304 (comment)). Right now, drake figures out all the dependencies of everything at the beginning, so it can only analyze the .Rmd reports that already exist before you call make(). Definitely a goal, but unfortunately not likely to happen anytime soon.

@ablack3
Copy link

ablack3 commented Oct 28, 2018

Hi @wlandau,
Perhaps this question has been answered elsewhere but this thread seems like it might be a fitting place for it. What are your thoughts on all the scratch work that goes into an analysis and how does that fit in with using drake? Should all my trial and error code ultimately be left out of the drake plan? Rmarkdown files are nice for trial and error because I can keep a readable record of all the things I tried when working on an analysis. drake reminds me a bit of writing math proofs; you do all your scratch/development work on the side (i.e. .Rmd files) and the eventually submit a pristine proof that cleanly goes from start to finish (i.e. a drake plan). What do you think?

@wlandau
Copy link
Member

wlandau commented Oct 28, 2018

@ablack3 great question. In the latest release (version 6.1.0), you can actually start with scratch work in a script or notebook and then use code_to_plan() to convert it into a plan when you are ready. Let's take a notebook called scratch.Rmd.

1

We're fix our code (the variable names should have capital letters).

2

And maybe we add some more code. You can put multi-line commands in curly braces.

3

And then when we are ready, we can convert it to a drake plan. First we need to make sure we have the CodeDepends package.

install.packges("BiocManager")
BiocManager::install("CodeDepends")

Then we call code_to_plan().

library(drake)
plan <- code_to_plan("scratch.Rmd")
config <- drake_config(plan)
vis_drake_graph(config)

make(plan)
#> target model
#> target coef
#> target pos_coef

Created on 2018-10-28 by the reprex package (v0.2.1)

If we find a mistake later, we can convert the plan into a notebook or a script to go back and tinker with stuff again.

# install.packages("styler")
plan_to_notebook(plan, "my_notebook.Rmd")
plan_to_code(plan, "my_script.R")

@wlandau
Copy link
Member

wlandau commented Oct 28, 2018

A couple alternatives to this approach are drake_build() and drake_debug(). drake_build() builds a single target in an already established plan (after loading the dependencies). drake_debug() is similar, but it runs the command in debug mode (see ?debug for details).

@pat-s
Copy link
Member Author

pat-s commented Dec 5, 2018

I finally converted my projects to use drake. Awesome work you did there. 🎉
I've rarely seen a pkg that got so much love.

I think the code_to_plan() function is really a breakthrough for most people not using the "function" approach (even though I also do it now to reduce the nodes of my graph).

I think it would be worth to have a dedicated section (in the manual?) describing its behavior because I think people might be scared when first reading this long issue.
A lot of people might arrive here after searching the issue tracker for "R scripts". Not sure what the best way to redirect is (maybe a new, locked issue with just a link to the respective manual page?).

@wlandau
Copy link
Member

wlandau commented Dec 6, 2018

So glad to hear drake and code_to_plan() are working well for you!

I think the code_to_plan() function is really a breakthrough for most people not using the "function" approach (even though I also do it now to reduce the nodes of my graph).

Even so, part of my intent is to nudge people to use functions because it makes data analysis code cleaner and easier to maintain.

I think it would be worth to have a dedicated section (in the manual?) describing its behavior because I think people might be scared when first reading this long issue.

Is this section sufficient?

A lot of people might arrive here after searching the issue tracker for "R scripts". Not sure what the best way to redirect is (maybe a new, locked issue with just a link to the respective manual page?).

I tagged this issue with the "frequently asked question" label. Whenever the manual is deployed, the build script scrapes the issue tracker and lists all these labeled issues in this FAQ. Details here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants