[FR] `*.knit.md` should have a unique name to allow parallelization #2454

salim-b · 2023-02-22T16:43:56Z

rmarkdown / knitr / Pandoc can be slow when generating 1000s of PDFs, especially when using xelatex instead of the default pdflatex renderer. An easy and straightforward way to speed things up on multicore systems is to rely on the means provided by the awesome future framework, especially the furrr package, a drop-in replacement for purrr.

Now if one tries to combine rmarkdown::render() with say future_pwalk() and the multisession backend, the rendering will most likely fail due to the individual workers interfering each other by trying to read and write from/to the same intermediary *.knit.md file.

This can be worked around by giving the input filename a unique name in the map/walk function before feeding it to rmarkdown::render() (or setting a different intermediates_dir; but varying the intermediates dir is discouraged as far as I understand and introduced additional bugs in my tests).

A solution at the core of rmarkdown / knitr for this issue would be to give the intermediate *.knit.md a unique filename from the beginning. Ideally one that is based on its actual content (a fast hash function like the ones from the xxHash family would be suitable I guess; rlang::hash_file() provides XXH128). add a sufficiently large random string to all intermediate output filenames (per rmarkdown::render() invocation).

What do you think?

_{As I realized after submission, this issue could be considered a duplicate of #499.}

The text was updated successfully, but these errors were encountered:

cderv · 2023-05-24T14:14:21Z

This seems like a good idea. I wonder what it could break in other tools pipeline doing this 🤔 (also having Quarto in mind)

But is renaming the *.knit.md ? There will still be the problem of shared intermediate_dir, isn't it ? It is build upon the original input file name.

In your testing does changing the knit output name was only thing to do ?

Current possible small hack if you: the knit output extension can be changed for now using rmarkdown.knit.ext option. but not the meta part.

if hash is enough, we could do something like rmarkdown.knit.meta_ext options, where as a user (or external tool) you could provide the function you want ``

diff --git a/R/render.R b/R/render.R
index 9d73610d..60045dee 100644
--- a/R/render.R
+++ b/R/render.R
@@ -395,8 +395,12 @@ render <- function(input,
   # as `.md~` will be ignored.
   input <- basename(input)
   knit_input <- input
+  # input_based meta
+  meta_ext_fun <- getOption("rmarkdown.knit.meta_ext_fun", function(input) "knit")
+  if (!is.function(meta_ext_fun))
+    stop2("'rmarkdown.knit.meta_ext_fun' has been provided but it should be a function with 'input' as argument")
   knit_output <- intermediates_loc(
-    file_with_meta_ext(input, "knit", getOption("rmarkdown.knit.ext", "md"))
+    file_with_meta_ext(input, meta_ext_fun(input), getOption("rmarkdown.knit.ext", "md"))
   )
   intermediates <- c(intermediates, knit_output)

This way you could do options(rmarkdown.knit.meta_ext_fun = rlang::hash_file) to get an knit output file with a hash in the name (e.g mydoc.Rmd -> myDoc.61675d71e5e752429bd53f0824874f3d.md)

Just an idea if really changing the knit output filename helps.

salim-b · 2023-05-24T15:39:25Z

But is renaming the *.knit.md ? There will still be the problem of shared intermediate_dir, isn't it ? It is build upon the original input file name.

I actually didn't test only renaming the *.knit.md. Instead I gave the input file a unique name which worked. You're probably right about the rest of the intermediate files.

In your testing does changing the knit output name was only thing to do ?

No (didn't test that as stated above).

if hash is enough (...)

Actually, I don't think using a hash that is based solely on the input file content is a good idea (sorry for my initial suggestion). If we choose to render the exact same input file multiple times in parallel (e.g. with different params), we would run into the same interference again. So instead I'd recommend to use a sufficiently large random string derived from R's PRNG state, so there won't be any clashes in the first place.

Thanks about the code suggestion with a new rmarkdown.knit.meta_ext_fun option. But as written above, changing only the knit output filename will most likely not be enough. Instead all intermediate output files should contain the random part. I've update my initial post accordingly.

cderv · 2023-05-24T17:02:59Z

If we choose to render the exact same input file multiple times in parallel (e.g. with different params), we would run into the same interference again.

Ok - I agree with that and was suprised with the suggestion. Thanks for confirming it is not enough.

So instead I'd recommend to use a sufficiently large random string derived from R's PRNG state, so there won't be any clashes in the first place.

We could try that, but I believe all the intermediate directories generation should be changed. This is what #499 is about and not easy to do - this part is really sensible to breaking change. And also there is the specific case of LaTeX which expect a lot of its intermediates to be local to the input file.

Though I agree this would be very interesting to solve issues with parallel rendering.
Currently, rmarkdown way of working implies that the parallelization usage requires to create wrapper around render that copy assets, creates some directories etc...

Quite a challenge that requires some dedicated time on this specifically, and probably lots of testings.

salim-b · 2023-05-24T17:12:53Z

We could try that, but I believe all the intermediate directories generation should be changed. This is what #499 is about and not easy to do - this part is really sensible to breaking change. And also there is the specific case of LaTeX which expect a lot of its intermediates to be local to the input file.

Well, instead of using a unique/isolated intermediate_dir per invocation of rmarkdown::render(), we could stick to the default one (the parent directory of the input file) but still append a random part to all intermediate files.

I.e. rendering my_file.Rmd would produce the intermediate files my_file-XXXXXXXX.md, my_file-XXXXXXXX.tex,my_file-XXXXXXXX.log, my_file-XXXXXXXX.aux and whatnot, where XXXXXXXX would correspond to the random string generated for that particular invocation of rmarkdown::render().

No?

cderv · 2023-05-24T17:55:03Z

Probably. I wonder how much we rely on the stem for the input file. Like myFile.Rmd -> using myFile for a lot of intermediate resource.

but we could probably track every occurence and change this. I just don't know how LaTeX would behave with this. Probably ok as the .tex file would be with the id in the name...

this worth trying for sure. We just think this is no an easy change - @yihui has more experiece about path handling over the years in R Markdown

cderv mentioned this issue Mar 9, 2023

Parallel creation of parameterized reports fails due to file conflicts quarto-dev/quarto-cli#4730

Open

4 tasks

cderv added the theme: paths path related improvment / issue label May 24, 2023

cderv mentioned this issue Jul 26, 2023

Shiny prerendered dependencies are sometimes written twice into prerendered HTML #2499

Closed

cderv mentioned this issue Jul 3, 2024

rmarkdown::render crashes in parallel with parallel::makeForkCluster on macOS and Apple Silicon #2561

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FR] `*.knit.md` should have a unique name to allow parallelization #2454

[FR] `*.knit.md` should have a unique name to allow parallelization #2454

salim-b commented Feb 22, 2023 •

edited

Loading

cderv commented May 24, 2023

salim-b commented May 24, 2023 •

edited

Loading

cderv commented May 24, 2023

salim-b commented May 24, 2023

cderv commented May 24, 2023 •

edited

Loading

[FR] *.knit.md should have a unique name to allow parallelization #2454

[FR] *.knit.md should have a unique name to allow parallelization #2454

Comments

salim-b commented Feb 22, 2023 • edited Loading

cderv commented May 24, 2023

salim-b commented May 24, 2023 • edited Loading

cderv commented May 24, 2023

salim-b commented May 24, 2023

cderv commented May 24, 2023 • edited Loading

[FR] `*.knit.md` should have a unique name to allow parallelization #2454

[FR] `*.knit.md` should have a unique name to allow parallelization #2454

salim-b commented Feb 22, 2023 •

edited

Loading

salim-b commented May 24, 2023 •

edited

Loading

cderv commented May 24, 2023 •

edited

Loading