Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch the metadata and console messages #1055

Closed
2 tasks done
wlandau opened this issue Apr 27, 2023 · 6 comments
Closed
2 tasks done

Batch the metadata and console messages #1055

wlandau opened this issue Apr 27, 2023 · 6 comments

Comments

@wlandau
Copy link
Member

wlandau commented Apr 27, 2023

Prework

Proposal

I often use this _targets.R file for benchmarking/profiling studies:

# _targets.R file
library(targets)
list(
  tar_target(x, seq_len(1000)),
  tar_target(y, x, pattern = map(x))
)

I profile it with proffer:

# R console
px <- proffer::pprof(targets::tar_make(callr_function = NULL, reporter = "silent"), host = "0.0.0.0", browse = FALSE)

As of 4757d06, the flame graph on my local Macbook looks like this:

profiling

There is a clear bottleneck because of how targets writes metadata and progress data one line at a time to the text files in _targets/meta/. base::write() itself isn't the bottleneck because I tried a C implementation which is only slightly faster. I also tried supplying a persistent file connection instead of the file path, but that too was only slightly faster.

I think I will try batching metadata lines into groups and then writing them only at regular intervals, e.g. a few times a second. A simpler version of this helped skipped messages print a lot faster.

@wlandau
Copy link
Member Author

wlandau commented Apr 27, 2023

I got this working for the metadata files in https://github.com/ropensci/targets/tree/1055, and those specific bottlenecks are removed (as long as the new seconds_interval argument is not too low). I just need to do the same thing for the reporters to eliminate the lag from hitting stdout too often.

@ginolhac
Copy link

what you are doing, restless, is just amazing. Thanks so much William.

@wlandau
Copy link
Member Author

wlandau commented Apr 28, 2023

Thanks for the kind words.

I just merged the current updates into main. Next I will work on the reporters.

@wlandau
Copy link
Member Author

wlandau commented Apr 28, 2023

Should be fixed now. Really glad these bottlenecks are removed! #1056 is all the more noticeable now, and I will tackle that next.

@wlandau wlandau closed this as completed Apr 28, 2023
@wlandau
Copy link
Member Author

wlandau commented Aug 17, 2023

When I originally implemented batched updating for metadata files, the saved metadata lagged behind the current in-memory metadata, and this caused #1063. At the time, I patched #1063 by disabling batched writes for new rows in _targets/meta/meta (i.e. I reverted to writing 1 row at a time for newly built targets). Looking back, I am not satisfied with this solution. Frequent hits to _targets/meta/meta not only degrade the performance of a single pipeline, they may slow down large network file mounts on shared systems (cc @rpodcast).

As far as I know, literate programming integration is the only scenario where #1063 comes up (although I will need to trawl tarchetypes to be sure). So it makes sense to think about workarounds that would allow batched metadata updates there. Ideas:

  1. For tarchetypes::tar_render(), my first thought is to ship the metadata in memory with target object (either inside or alongside it).
  2. Add a new update_meta or needs_meta argument to tar_target() which if TRUE forces an update to _targets/meta/meta and _targets/meta/progress just before the target runs (in the prepare() method; need to check if that method is always local). Should default to FALSE.

(1) would be efficient, but I do not think it is feasible with tar_quarto() because Quarto runs in a separate process. (2) might be less efficient, but it should make Quarto work, and the efficiency loss / IOPS burden won't be so bad because Quarto targets write files to disk anyway.

@wlandau wlandau reopened this Aug 17, 2023
@wlandau wlandau mentioned this issue Aug 17, 2023
3 tasks
@wlandau
Copy link
Member Author

wlandau commented Aug 21, 2023

Also need to worry about hacks like #830 (comment) which require tar_read(). I don't generally encourage tar_read() inside a target, but sometimes there is no other way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants