Skip to content

Major improvements to robustness, speed, and {crew} integration

Compare
Choose a tag to compare
@wlandau wlandau released this 22 May 17:04
· 758 commits to main since this release

targets 1.1.0

Bug fixes

  • Send targets to the appropriate controller in a controller group when crew is used.

General improvements

  • Call gc() more appropriately when garbage_collection is TRUE in tar_target().
  • Add garbage_collection arguments to tar_make(), tar_make_clustermq(), and tar_make_future() to add optional garbage collection before targets are sent to workers. This is different and independent from the garbage_collection argument of tar_target(). In high-performance computing scenarios, the former controls what happens on the main controlling process, whereas the latter controls what happens on the worker.
  • Add garbage_collection and seconds_interval arguments to tar_make(), tar_make_clustermq(), tar_make_future(), and tar_config_set().
  • Downsize the tar_runtime object.
  • Remove the 100 Kb file size cutoff for determining whether to trust the file timestamp or recompute the hash when checking if a file is up to date (#1062). Instate the "file_fast" format and the trust_object_timestamps option in tar_option_set() as safer alternatives.
  • Consolidate store constructors.
  • Allow crew controller groups (#1065, @mglev1n).
  • Expose more exponential backoff configuration parameters through tar_backoff(). The backoff argument of tar_option_set() now accepts output from tar_backoff(), and supplying a numeric is deprecated.
  • Fix the exponential backoff rules in the crew scheduling algorithm.
  • Implement tar_resources_network() to configure retries and timeouts for internal HTTP/HTTPS requests in specialized targets with format = "url", repository = "aws", and repository = "gcp". Also applies to syncing target files across network file systems in the case of storage = "worker" or format = "file", which previously had a hard-coded seconds_interval = 0.1 and seconds_timeout = 60.
  • Deprecate seconds_interval and seconds_timeout in tar_resources_url() in favor of the new equivalent arguments of tar_resources_network()
  • Safely withhold a target from its crew controller when the controller is saturated (#1074, @mglev1n).
  • Use exponential backoff when appending a target back to the queue in the case of a saturated crew controller.

Speedups

  • Cache info about all of _targets/objects/ in tar_callr_inner_try() and update the cache as targets are saved to _targets/objects/ to avoid the overhead of repeated calls to file.exists() and file.info() (#1056).
  • Trust the timestamps by default when checking whether files in _targets/objects/ are up to date (#1062). tar_option_set(trust_object_timestamps = FALSE) ignores the timestamps and recomputes the hashes.
  • Write to _targets/meta/meta and _targets/meta/progress in timed batches instead of line by line (#1055).
  • Reporters now print progress messages in timed batches instead of line by line (#1055).
  • The summary and forecast reporters are much faster because they avoid going through data frames.
  • Avoid tempfile() when working with the scratch directory.
  • Use nanonext::mclock() instead of proc.time() when there is no risk of forked processes.
  • Replace withr with slightly faster/leaner base R alternatives.
  • Efficiently catch changes to the working directory instead of overburdening the pipeline with calls to setwd() (#1057).
  • Invoke tar_options methods in the internals instead of tar_option_get().
  • Avoid gsub() in store_init().
  • Avoid repeated calls to meta$get_record() in builder_should_run().
  • Mock the store object when creating a record from a metadata row.
  • Avoid cli::col_none() to reduce the number of ANSI characters printed to the R console.