How to add .R files to drake_plan() #193

pat-s · 2018-01-22T17:16:16Z

Hi there,

maybe I overlooked something but I fail to properly add .R files to drake_plan so that the dependencies are detected.

Reprex:

analyze <- drake_plan(
  test.md = knit('03_scripts/01_test.Rmd', quiet = TRUE),
  test2 = source('03_scripts/test2.R'),

And in test2.R I have the following:

loadd(test.md)

However, when visualizing with

config <- drake_config(analyze)
vis_drake_graph(config, width = "100%", height = "500px")

dependencies are not detected for the .R file.
If I use a .Rmd file, everything works as expected.

Tried a few different approaches but could not successfully add a .R script. Help 😄

The text was updated successfully, but these errors were encountered:

wlandau-lilly · 2018-01-22T18:56:30Z

Great question, @pat-s! Questions like these are starting to come up a lot, and yours is the first in an FAQ I am starting.

EDIT

The following examples show how to set up the files for drake projects:

Get the code with

drake_example("basic")
drake_example("gsp")
drake_example("packages")

Each of the above writes a folder with code files. To make sure drake_example() outputs what you need, please be sure to use drake version 5.0.1.9000 or later.

Original response

The feature you want is designed for knitr source documents only, such as *.Rmd and *.Rnw files (although the file extension itself does not matter). Drake decides to analyze a file when it sees knit() or render() in the command, so nothing special happens if you use source() or your file has a .R extension.

Why not detect loadd()/readd() dependencies in R scripts source()'d from commands? Because with drake, the focus is on your R session, not your R scripts. The idea is to source() the files beforehand so that you have a bunch of functions and small data objects in your workspace. Drake treats these objects as "imports", and it detects and analyzes them once they are in your environment. For example:

ls()

## character(0)

source("my_functions.R")

ls() # The simulate() function is defined in my_functions.R and treated as an import.

## [1] simulate

Drake looks inside the bodies of imported functions for non-file dependencies.

simulate

## function(n) {
##   data.frame(x = stats::rnorm(n), y = rpois(n, 1))
## }

deps(simulate)

## [1] "data.frame"   "rpois"        "stats::rnorm"

Now that you loaded your imports with source("my_functions.R"), you can make() your targets.

analyze <- data.frame(
  target = c("test_dataset", "'test.md'"),
  command = c(
    "simulate(5)",
    "knit('03_scripts/01_test.Rmd', output = \"test.md\", quiet = TRUE)"
  )
)

analyze

##         target                                                          command
## 1 test_dataset                                                      simulate(5)
## 2    'test.md' knit('03_scripts/01_test.Rmd', output = "test.md", quiet = TRUE)

make(analyze)

where 01_test.Rmd might look like this.


---
title: "Test Report"
author: Patrick Schratz
output: html_document
---

This report depends on `test_dataset`.

```{r example_chunk}
drake::readd(test_dataset)
```

A second make(analyze) will rebuild test.md if you have changed

the simulate() function (except for changes to whitespace or comments) or
either command in the analyze data frame (except for whitespace or comments) or
the contents of 03_scripts/01_test.Rmd.

Besides knitr reports, the loadd() and readd() functions are only meant for informally exploring your results. If they are embedded in your commands or imported functions, this could create a dangerous circularity in your workflow.

Does that help? Does this way of doing things meet your needs, or does your use case require you to source() an R file from a command?

pat-s · 2018-01-22T19:41:34Z

Hi @wlandau-lilly, thanks for the quick and extensive answer!

So if I understand correctly, all analysis scripts should be Rmarkdown files while .R scripts should only contain functions that are used as imports and then analyzed for their dependencies?

Does that help? Does this way of doing things meet your needs, or does your use case require you to source() an R file from a command?

Right, in my case, I have a mixture of .Rmd and .R files.
The .Rmd files contain all the preprocessing and EDA stuff including post-analysis of modeling results.

However, the modeling stuff is stored in .R scripts as these can easily be run from the command line (e.g. using Rscript), also avoiding all the knitr output/overhead.
The modeling scripts are running on a server in parallel.
While I use .Rmd reports a lot, I still find the good old .R scripts easier to handle on servers.

As I said, these files contain modeling code and therefore depend on the preprocessing .Rmd file.
Hence, I wanted to insert something like load(preprocessing) so that drake recognizes that the scripts depend on the preprocessing script. Subsequently, if something changes in the preprocessing script, it should first rerun preprocessing.Rmd.

Afterwards, there are again additional .Rmd reports (for the post analysis) that depend on the .R files.
So my current approach looks like this:

analyze <- drake_plan(
  preprocessing.md = knit('03_scripts/01_preprocessing.Rmd', quiet = TRUE), # 1

  EDA.md = knit('03_scripts/02_EDA.Rmd', quiet = TRUE), # 2
  study_area.md = knit('03_scripts/03_study_area.Rmd', quiet = TRUE), # 2

  brt_sp_non = source('03_scripts/server/brt_sp_non.R'), # 3
  [...]
  brt_nsp_nsp = source('03_scripts/server/brt_nsp_nsp.R'), # 3

  cv_vis.md = knit('03_scripts/04_spcv_vis.Rmd', quiet = TRUE) # 4
)

where

#4 depends on #3 and #1
#3 depends on #1
#2 depends on #1

Hope that does not confuse you too much 😄 .
I think there will be plenty of users out there using either only .R files or a mixture of .Rmd and .R for their analysis.
Would be great if both file types could be used together.

tiernanmartin · 2018-01-22T20:40:38Z

Just wanted to chime in my support for accommodating a mixture of .R and .Rmd files.

Similar to @pat-s , the distinction for my projects is usually:

.R for computationally expensive processes (in my use case these are operations on medium-size spatial data, some of which need to be run on a server)
.Rmd for simple data operations, reporting, and documentation

drake has helped me overcome a major pain point in my project workflow, so thanks a lot for your effort on this package! 👏

wlandau-lilly · 2018-01-23T01:50:45Z

@pat-s and @tiernanmartin, thank you for clarifying. I think I can do a better job of explaining now.

So if I understand correctly, all analysis scripts should be Rmarkdown files while .R scripts should only contain functions that are used as imports and then analyzed for their dependencies?

You don't need any analysis scripts. R Markdown files are just a special accommodation, and *.R files are just a convenient way to store code. Drake focuses on your R session rather than your files. To drake, your "script" is all the functions in your environment and the commands in your workflow plan data frame. How you load things into your environment is up to you. This is a very unusual, even disconcerting way of thinking, but I think it makes projects cleaner and smoother in the long run.

For example, instead of

analyze <- drake_plan(
  ...
  preprocessing.md = knit('03_scripts/01_preprocessing.Rmd', quiet = TRUE),
  brt_sp_non = source('03_scripts/server/brt_sp_non.R'),
  ...
)
make(analyze)

you might try something like

source("03_scripts/server/brt_sp_non.R")
analyze <- drake_plan(
  ...,
  preprocessed_data = preprocess_data(read_my_data('data.csv')),
  brt_sp_non = build_brt_sp_non(preprocessed_data),
  ...
)
make(analyze)

and define functions preprocess_data(), read_my_data(), and build_brt_sp_non() in "03_scripts/server/brt_sp_non.R". Drake knows that brt_sp-non depends on preprocessed_data and build_brt_sp_non() because these symbols are part of the command.

The commands in your workflow plan data frame are just arbitrary chunks of R code that return values. Other than the special accommodations for knit(), render(), and single-quoted file targets, you can treat commands like ordinary bits of R code.

Here is an example of a workflow that juggles a bunch of numbers.

my_plan <- drake_plan(
  a = 1 + 1,
  b = {
    x <- pi + a
    y <- sqrt(x)
    rand <- rnorm(10, sd = y)
    mean(rand)
  },
  c = a - 5,
  d = c(b, c)
)
config <- drake_config(my_plan)
vis_drake_graph(config)

make(my_plan)
readd(d)

## [1]  1.886363 -3.000000

Now, if you rely on R Markdown reports for sharing results, you have the option to create one and then loadd()/readd() non-file targets or imports into code chunks. But that is up to you.

wlandau-lilly · 2018-01-23T02:09:35Z

I forgot to mention: drake analyzes functions and commands for dependencies, and it knows those functions can be nested. For example, drake knows g() is nested inside f().

library(drake)

my_plan <- drake_plan(
  a = f(1 + 1)
)

f <- function(x){
  g(x + 1)
}

g <- function(x){
  x + sqrt(4)
}

config <- drake_config(my_plan)
vis_drake_graph(config)

make(my_plan)

## ...
## target a

readd(a)

## [1] 5

make(my_plan)

## ...
## All targets are already up to date.

# Let's change function g().
g <- function(x){
  x + 3
}

# f() depends on g(), and target `a` depends on f().
# So target `a` is out of date, and `make()` recomputes it.
make(my_plan)

## ...
## target a

# The value of `a` changed.
readd(a)

## [1] 6

pat-s · 2018-01-23T11:16:31Z

Thanks @wlandau-lilly, I tried your suggestion with the following setup now:

source("03_scripts/01_preprocessing.R")
source("03_scripts/server/modeling_functions.R")
methods <- drake_plan(
  preprocessed_data = preprocess(
    pathogens = "/data/patrick/raw/survey_data/diseases240112_mod.csv", 
    points = "/data/patrick/mod/survey_data/points_mod.shp", 
    slope = "/data/patrick/mod/DEM/slope/slope_5m.tif", 
    ph = new("GDALReadOnlyDataset", "/data/patrick/raw/ph_europe/ph_cacl2"), 
    lithology = "/data/patrick/raw/lithology/CT_LITOLOGICO_25000_ETRS89.shp", 
    hail = "/data/patrick/raw/hail/Prob_GAM_square_area.tif", 
    elevation = "/data/patrick/mod/DEM/dem_5m.tif", 
    soil = "/data/patrick/raw/soil/ISRIC_world_soil_information/TAXNWRB_250m_ll.tif", 
    study_area = "/data/patrick/raw/boundaries/basque-country/Study_area.shp"),
  
  brt_sp_non = brt_sp_nsp(data = preprocessed_data, iterations = 200),
  preprocess.md = knit("03_scripts/01_preprocessing.Rmd"),
  strings_in_dots = "literals"

I put the .Rmd file 01_preprocessing.Rmd into a function called preprocess() that gets imported via source("03_scripts/01_preprocessing.R").

I generate preprocessed_data which serves as input for brt_sp_nsp() which itself is defined in source("03_scripts/server/modeling_functions.R").

However, when following your quick example, to generate a report for all the preprocessing I have to add something like preprocess.md = knit("03_scripts/01_preprocessing.Rmd") that includes
readd() or loadd() for the datasets, .e.g. readd(pathogens), because they already exist in the cache.
If I wouldn't do so, dependencies are not detected, right?

So currently I would need an .R file defining my function (e.g. 01_preprocessing.R) and an additional .Rmd file for the report, both with 99% redundant code?

Hm, this really gets kinda complicated now and it seems that have to modify all my scripts..
Maybe I overcomplicate things here?
Actually, my analysis structure is "very simple" and I only would need a tool that takes care of the following structure.

01_preprocessing.Rmd #1
02_EDA.Rmd (depends on #1)
03_study_area.Rmd (depends on #1)
04_scripts/20 files here.R (depend on #1)
05_spcvis.Rmd (depends on #1 and #4)
06_post_analysis.Rmd (depends on #1 and #4)

drake is very function orientated as far as I see and less focused on resolving simple dependencies of scripts? I understand that there needs to be a clear link between all the scripts to be able to connect them.
However, wouldn't be something like the example below also desirable?

drake_plan(
1 = report("01_preprocessing.Rmd", depends_on = NULL),
2 = report("02_EDA.Rmd", depends_on = 1),
3 = report("03_study_area.Rmd", depends_on = 1),
4 = script(list("04_scripts/20 files here.R"), depends_on = 1)
5 = report("05_spcvis.Rmd", depends on = c(1, 4)
)

wlandau-lilly · 2018-01-23T13:36:01Z

I think we're making progress here.

However, when following your quick example, to generate a report for all the preprocessing I have to add something like preprocess.md = knit("03_scripts/01_preprocessing.Rmd") that includes
readd() or loadd() for the datasets, .e.g. readd(pathogens), because they already exist in the cache.
If I wouldn't do so, dependencies are not detected, right?

So currently I would need an .R file defining my function (e.g. 01_preprocessing.R) and an additional .Rmd file for the report, both with 99% redundant code?

No additional .Rmd report should be necessary. In the quckstart vignette you mentioned, the report.Rmd file is for post-processing, not preprocessing. Its purpose is to display and inspect the results at the very end of the analysis. I intend to make that clearer very soon. I recommend against .Rmd files for preprocessing since all you get is an final .md file rather than targets that would be useful to future commands. If I were you, I would completely let go of 01_preprocessing.Rmd.

drake is very function orientated as far as I see and less focused on resolving simple dependencies of scripts?

Yes, drake resolves dependencies of functions and commands, not scripts. R Markdown files are a special exception because people like to use them as final reports. I would encourage you to write your workflow kind of like this.

drake_plan(
  preprocessed_data = preprocessing_01(),
  EDA = EDA_02(preprocessed_data),
  study_area = study_area_03(preprocessed_data),
  other_work = other_work_04(preprocessed_data),
  spcvis = spcvis_05(preprocessed_data, other_work)
)

You don't need anything like depends_on because drake knows the targets depend on the arguments supplied to functions in commands. For example, drake knows that spcvis depends on preprocessed_data and other_work because these two targets are passed as arguments to your spcvis_05() function.

@pat-s

There was some confusion about the role of R Markdown reports in drake workflows. They are not actually necessary, and the only targets they generate are report files such as `.md` and `.html` files. I have changed the report in the basic example to try to explain. @pat-s, I hope this helps.

wlandau-lilly · 2018-01-23T14:06:34Z

Something else I forgot to clarify: if one of your commands is knit('some_report.Rmd') or render('some_report.Rmd'), your target will be another file such as some_report.md or some_report.html. You will not be able to extract any of the data objects you make inside the code chunks of some_report.Rmd. In other words, some_report.Rmd is a dead end.

Supports #193

wlandau-lilly · 2018-01-23T17:01:06Z

@pat-s please see the new best practices guide. I try to lay out the issue there as best I can, and I will likely add more in the future. I think I have done enough to close here, but let's keep talking on the thread.

wlandau · 2018-01-26T15:25:46Z

I have been thinking a lot more about this thread, and I have restructured drake_example("basic"), drake_example("gsp"), and drake_example("packages") to demonstrate how to set up the files. Didn't make it to CRAN v5.0.0, but there's always next time. The files are on GitHub now. If I were you, I would peruse the following to learn more about setting up the files for drake projects.

pat-s · 2018-01-26T16:41:54Z

Always good to hear when a discussion ends fruitful! Haven't had time in the last days to take a look in detail, will do it later and report my thoughts!

wlandau · 2018-01-26T16:50:08Z

And I look forward to learning what you think. This is such an important piece for drake's docs to get right.

pat-s · 2018-02-04T11:40:08Z

Read it and briefly wanna say that its more clear to me now than before. Maybe you can still add a section only focusing on the differences in the function oriented vs the script oriented approach.

Imo this would fit into the "Get started" page so that new readers understand the difference right from the start.

Congrats for being accepted at ropensci 👍

wlandau · 2018-02-05T03:15:24Z

Thanks, @pat-s! And I am glad this is making more sense. I would like to keep the "Get started" and README.md documents short, but I think there is room to mention the issue and cross reference the best practices vignette.

@krlmlr

cc @krlmlr @noamross @AlexAxthelm @dapperjapper @gmbecker

henningsway · 2018-05-01T01:20:27Z

For me it seems nice, to be able to have the output of .Rmd as possible targets. I often use .Rmd to query databases via the {sql}-engine. It would also seem nice, to have .Rmd-reports in the middle of a pipeline to have a more streamlined "report" into some intermediate results maybe.

With the current functionality, design - how would I best incorporate existing rmarkdown scripts into the drake_plan?

Just started experimenting with the package though (and I am often shy to put my scripts into functions anyways, so far, which will probably change :))

wlandau · 2018-05-01T17:24:21Z

With the current functionality, design - how would I best incorporate existing rmarkdown scripts into the drake_plan?

For existing R Markdown scripts, make sure those files exist, and then use knitr_in() to declare them in your drake_plan() (or imported functions, if applicable).

plan <- drake_plan(
  rmarkdown::render(
    knitr_in("report.Rmd"),
    output_file = file_out("report.html"),
    quiet = TRUE
  ),
  ...
)
make(plan)

Inside the active code chunks in your report, you can use loadd() and readd() to (1) load other drake targets into your report, and (2) tell drake that the output file depends on those targets. See this example for details. In the GitHub development version of drake, you can get the code files with drake_example("main").

It would also seem nice, to have .Rmd-reports in the middle of a pipeline to have a more streamlined "report" into some intermediate results maybe.

You can generate .Rmd files as output targets, but then drake will not analyze the active code chunks for dependencies. For that, I think we need #304 (ref: #304 (comment)). Right now, drake figures out all the dependencies of everything at the beginning, so it can only analyze the .Rmd reports that already exist before you call make(). Definitely a goal, but unfortunately not likely to happen anytime soon.

ablack3 · 2018-10-28T04:07:22Z

Hi @wlandau,
Perhaps this question has been answered elsewhere but this thread seems like it might be a fitting place for it. What are your thoughts on all the scratch work that goes into an analysis and how does that fit in with using drake? Should all my trial and error code ultimately be left out of the drake plan? Rmarkdown files are nice for trial and error because I can keep a readable record of all the things I tried when working on an analysis. drake reminds me a bit of writing math proofs; you do all your scratch/development work on the side (i.e. .Rmd files) and the eventually submit a pristine proof that cleanly goes from start to finish (i.e. a drake plan). What do you think?

wlandau · 2018-10-28T13:00:10Z

@ablack3 great question. In the latest release (version 6.1.0), you can actually start with scratch work in a script or notebook and then use code_to_plan() to convert it into a plan when you are ready. Let's take a notebook called scratch.Rmd.

We're fix our code (the variable names should have capital letters).

And maybe we add some more code. You can put multi-line commands in curly braces.

And then when we are ready, we can convert it to a drake plan. First we need to make sure we have the CodeDepends package.

install.packges("BiocManager")
BiocManager::install("CodeDepends")

Then we call code_to_plan().

library(drake)
plan <- code_to_plan("scratch.Rmd")
config <- drake_config(plan)
vis_drake_graph(config)

make(plan)
#> target model
#> target coef
#> target pos_coef

^{Created on 2018-10-28 by the reprex package (v0.2.1)}

If we find a mistake later, we can convert the plan into a notebook or a script to go back and tinker with stuff again.

# install.packages("styler")
plan_to_notebook(plan, "my_notebook.Rmd")
plan_to_code(plan, "my_script.R")

wlandau · 2018-10-28T13:04:44Z

A couple alternatives to this approach are drake_build() and drake_debug(). drake_build() builds a single target in an already established plan (after loading the dependencies). drake_debug() is similar, but it runs the command in debug mode (see ?debug for details).

pat-s · 2018-12-05T21:52:46Z

I finally converted my projects to use drake. Awesome work you did there. 🎉
I've rarely seen a pkg that got so much love.

I think the code_to_plan() function is really a breakthrough for most people not using the "function" approach (even though I also do it now to reduce the nodes of my graph).

I think it would be worth to have a dedicated section (in the manual?) describing its behavior because I think people might be scared when first reading this long issue.
A lot of people might arrive here after searching the issue tracker for "R scripts". Not sure what the best way to redirect is (maybe a new, locked issue with just a link to the respective manual page?).

wlandau · 2018-12-06T03:26:51Z

So glad to hear drake and code_to_plan() are working well for you!

I think the code_to_plan() function is really a breakthrough for most people not using the "function" approach (even though I also do it now to reduce the nodes of my graph).

Even so, part of my intent is to nudge people to use functions because it makes data analysis code cleaner and easier to maintain.

I think it would be worth to have a dedicated section (in the manual?) describing its behavior because I think people might be scared when first reading this long issue.

Is this section sufficient?

A lot of people might arrive here after searching the issue tracker for "R scripts". Not sure what the best way to redirect is (maybe a new, locked issue with just a link to the respective manual page?).

I tagged this issue with the "frequently asked question" label. Whenever the manual is deployed, the build script scrapes the issue tracker and lists all these labeled issues in this FAQ. Details here.

wlandau-lilly added the type: faq label Jan 22, 2018

wlandau-lilly added question and removed question labels Jan 23, 2018

wlandau-lilly added a commit that referenced this issue Jan 23, 2018

Add a new best practices vignette

63d33ea

Supports #193

wlandau-lilly closed this as completed Jan 23, 2018

wlandau-lilly mentioned this issue Jan 23, 2018

drake (R package) ropensci/software-review#156

Closed

15 tasks

wlandau referenced this issue Feb 18, 2018

Update NEWS.md: finish a proposed solution to #232

f84e98e

cc @krlmlr @noamross @AlexAxthelm @dapperjapper @gmbecker

wlandau mentioned this issue Oct 28, 2018

Assess the feasibility of CodeDepends for all the static code analysis #41

Closed

wlandau mentioned this issue Aug 24, 2019

Accommodation of script-based imperative workflows #994

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to add .R files to drake_plan() #193

How to add .R files to drake_plan() #193

pat-s commented Jan 22, 2018

wlandau-lilly commented Jan 22, 2018 •

edited by wlandau

Loading

pat-s commented Jan 22, 2018

tiernanmartin commented Jan 22, 2018

wlandau-lilly commented Jan 23, 2018 •

edited

Loading

wlandau-lilly commented Jan 23, 2018 •

edited

Loading

pat-s commented Jan 23, 2018

wlandau-lilly commented Jan 23, 2018

wlandau-lilly commented Jan 23, 2018

wlandau-lilly commented Jan 23, 2018 •

edited by wlandau

Loading

wlandau commented Jan 26, 2018

pat-s commented Jan 26, 2018

wlandau commented Jan 26, 2018

pat-s commented Feb 4, 2018

wlandau commented Feb 5, 2018

henningsway commented May 1, 2018 •

edited

Loading

wlandau commented May 1, 2018

ablack3 commented Oct 28, 2018

wlandau commented Oct 28, 2018 •

edited

Loading

wlandau commented Oct 28, 2018

pat-s commented Dec 5, 2018

wlandau commented Dec 6, 2018

How to add .R files to drake_plan() #193

How to add .R files to drake_plan() #193

Comments

pat-s commented Jan 22, 2018

wlandau-lilly commented Jan 22, 2018 • edited by wlandau Loading

EDIT

Original response

pat-s commented Jan 22, 2018

tiernanmartin commented Jan 22, 2018

wlandau-lilly commented Jan 23, 2018 • edited Loading

wlandau-lilly commented Jan 23, 2018 • edited Loading

pat-s commented Jan 23, 2018

wlandau-lilly commented Jan 23, 2018

wlandau-lilly commented Jan 23, 2018

wlandau-lilly commented Jan 23, 2018 • edited by wlandau Loading

wlandau commented Jan 26, 2018

pat-s commented Jan 26, 2018

wlandau commented Jan 26, 2018

pat-s commented Feb 4, 2018

wlandau commented Feb 5, 2018

henningsway commented May 1, 2018 • edited Loading

wlandau commented May 1, 2018

ablack3 commented Oct 28, 2018

wlandau commented Oct 28, 2018 • edited Loading

wlandau commented Oct 28, 2018

pat-s commented Dec 5, 2018

wlandau commented Dec 6, 2018

wlandau-lilly commented Jan 22, 2018 •

edited by wlandau

Loading

wlandau-lilly commented Jan 23, 2018 •

edited

Loading

wlandau-lilly commented Jan 23, 2018 •

edited

Loading

wlandau-lilly commented Jan 23, 2018 •

edited by wlandau

Loading

henningsway commented May 1, 2018 •

edited

Loading

wlandau commented Oct 28, 2018 •

edited

Loading