Backtracking for missing Digests (aka: remove "eager fetch") #11331

stuhood · 2020-12-17T20:57:25Z

When a Digest is missing from the Store, we don't currently have a way to backtrack far enough to re-compute it.

Backtracking the appropriate amount is challenging to do via exception handling -- which @rules don't have currently, but which is still useful to discuss. No particular @rule has enough information on its stack to know the true source of a Digest (which might be deeply nested below one of its await Gets or arguments), and even if it did determine which dependency was the source of the bad Digest, the @rule would need an API to invalidate the relevant subgraph.

Instead, we should lean further in to the fact that the NodeOutput type can report the Digests that it computed, and trigger Graph::invalidate_from_roots for the Node(s) that produced the missing Digest. This will naturally dirty the nodes that depend on those Digests, causing them to be canceled and re-run.

From an implementation perspective, some alternatives:

We could insert error handling in either our generic Node wrapping code or around calls to Intrinsics that would consume a struct MissingDigests(Vec<Digest>) variant of Failure that would be produced inside relevant intrinsics. The error handling would call Graph::invalidate_from_roots to invalidate all Nodes with matching digests, and then return Failure::Invalidated to kill itself.
It might be possible for the Graph to natively support invalidating arbitrary other Nodes when a Node fails. This might look like adding a bit more structure to NodeError and/or NodeContext to decide which errors should trigger additional logic to match and invalidate Nodes.

TODO: A blocker for this issue being a solution to the "fall back to local execution" problem is determining what will happen differently on the second run of a Process, and how that different behavior will be triggered. Simply re-running the original Node again is likely to result in the exact same output: for example, if the Digest was produced by a cache lookup, re-looking up in the cache will probably produce the same Digest again.

So when invalidating, we'd also need to decide how to affect the second run of the Process. A few potential options there:

Could record per-Process attempt histories (or even just an "attempt count") somewhere (on the Session, or in the CommandRunner itself?) most likely, and let CommandRunners interact with the history to decide whether to skip caches, etc if need be. For example, a cache CommandRunner might skip the cache for a second attempt.
- This is potentially complicated, because it would mash up our "stacked/nested CommandRunners" with the history, meaning that what happens in CommandRunner::run would be dependent both on how they had been stacked/composed, and on the history.
- This might be made less complicated if we moved the "stacking/nesting" of CommandRunners that we currently do during construction, into a method that would run per-attempt to run a Process. While that method would be complicated, it would be less complicated than thinking about both the interactions of lots of nested runners, and an attempt history
Could globally reconfigure the CommandRunners when we encounter a missing Digest to "drop back to local only".
- This is probably too large a hammer... while it would likely work, it would have the effect of completely disabling remoting/remote-caching after the first error. And relying on zero-error runs would be fragile.

The text was updated successfully, but these errors were encountered:

stuhood · 2020-12-28T20:05:08Z

The minor caveat that I had added in my previous comment is actually fairly substantial: I've expanded the section of the ticket description below the line with a big TODO. #11330 is probably not a prerequisite for this one, but I'm confident that it is much smaller, and so a good idea to do first.

Closes #11330. As explained there, lazy fetching is best for performance, but as currently factored, it can result in unrecoverable errors when remote caching fails, e.g. a digest is missing. By eagerly fetching, we limit the surface area of where remote caching is used; this allows for our error catching logic to always be used, which gracefully falls back to re-running the process locally. This is likely a temporary solution and this option will be likely deprecated in the future. The more robust solution is to implement #11331.

stuhood · 2021-07-27T20:42:57Z

It would also be good to justify implementing backtracking by confirming that it reduces the total number of bytes downloaded by clients.

stuhood · 2022-03-14T19:58:31Z

It would also be good to justify implementing backtracking by confirming that it reduces the total number of bytes downloaded by clients.

Based on production experience, this aspect is unlikely to be a significant benefit, since users are seeing relatively little actually being downloaded in most cases. The larger benefit will be reducing the total number of requests, which are concurrency bounded, and introduce round trip time.

And in particular, it looks like the concurrency limit for access to the Store does actually impact total runtime for fully cached usecases, because the eager fetch of cache entries ends up taking a non-trivial amount of time.

stuhood · 2022-05-18T18:23:11Z

To gauge the benefit of this change, I set remote_cache_eager_fetch=false in pantsbuild/pants CI and another private repository.

With eager_fetch=False:

checking the remote cache consistently only requires 1 load_bytes_with call (to load the Tree for the result), which meant ~20% and ~40% fewer calls overall (but this is low for the reason below)
in the pantsbuild/pants repo, we downloaded 80% fewer bytes from the remote (~84MB vs ~14MB), but in the private repo, we ended up downloading more data, which was strange (see below).

Investigating the strange number of bytes downloaded in the private repo, I determined that the reason for the increased byte count is that we don't dedupe load_bytes_with calls (which we did for uploads in #12087). I've opened #15524 about that.

Lazily fetching bytes means that we won't fetch the outputs of processes, but we still do need to fetch the inputs of processes in order to speculate... and in cases where lots of processes depend on a large input, that could mean lots of redundant fetching (whereas with eager fetching, all downstream processes wait for the producing process to complete and download its output). It also means more total calls to load_bytes_with: the number of unique load_bytes_with calls with eager_fetch=False is ~50% lower.

EDIT: Oh, and perhaps most importantly: pantsbuild/pants CI completed 15% faster with eager_fetch=False.

In order to begin lazily fetching digests for #11331, we need to be able to differentiate errors caused by missing digests. This change introduces `{StoreError,ProcessError,Failure}::MissingDigest`, which are bubbled up to the level of graph `Node`s, and which will allow us to introduce retry in a followup change. `MissingDigest` is produced by `load_bytes_with` and a few other low-level `Store` methods (which now return an `Err(MissingDigest)` rather than `Option`). There should not be any behavior changes. [ci skip-build-wheels]

stuhood · 2022-06-07T17:15:34Z

#15761 takes a first step here. The rest of the change should be roughly as described above, with one other detail: in order to retry the synchronous methods of interface.rs which are used in @goal_rules, we will actually need to propagate MissingDigest through Python code, and re-catch it at the @rule boundary. Should be possible, but I didn't immediately see how to add a field to a custom exception type with pyo3.

The `WrappedNode::run_wrapped_node` method was never actually called "via" `trait WrappedNode`, because it was only called immediately after matching a particular `NodeKey` implementation. So having the method was a bit misleading, but it also meant that: 1. all node implementations needed to have the same shape for their implementation method * 9 out of 11 of them had an unused parameter. * #11331 will be introducing another parameter for the `ExecuteProcess` node, which would have been unused for 10/11 nodes 2. `run_wrapped_node` was unnecessarily boxed due to `WrappedNode` needing to be marked `async_trait`. Removing the boxing means one fewer allocation per node. This change removes `WrappedNode::run_wrapped_node`, and replaces it with a convention of having a `run_node` method. [ci skip-build-wheels]

stuhood · 2022-06-16T22:20:32Z

Ok, the full stack of PRs for backtracking is now posted:

While #15524 will be important for performance, it doesn't affect the safety of eager_fetch=False, and should be cherry-pickable.

To prepare for #11331, it's necessary to clarify the building of the "stacks" of command runners, in order to progressively remove caches for each attempt to backtrack. Both for performance reasons (to allow an attempt to execute a process to proceed concurrently with checking the cache), and to allow us to more easily conditionally remove remote caching from remote execution, this change removes the synchronous cache lookup from the `remote::CommandRunner` by wrapping in `remote_cache::CommandRunner`. Additionally, refactors the `CommandRunner` stacking, and produces stacks per backtrack attempt for #11331.

…ore resilient (#15850) As described in #11331, in order to avoid having to deal with missing remote content later in the pipeline, `--remote-cache-eager-fetch` currently defaults to true. This means that before calling a cache hit a hit, we fully download the output of the cache entry. In warm-cache situations, this can mean downloading a lot more than is strictly necessary. In theory, you could imagine `eager_fetch=False` downloading only stdio and no file content at all for a 100% cache hit rate run of tests. In practice, high hitrate runs [see about 80% fewer bytes downloaded, and 50% fewer RPCs](#11331 (comment)) than with `eager_fetch=True`. To begin moving toward disabling `eager_fetch` by default (and eventually, ideally, removing the flag entirely), this change begins "backtracking" when missing digests are encountered. Backtracking is implemented by "catching" `MissingDigest` errors (introduced in #15761), and invalidating their source `Node` in the graph. When a `Node` that produced a missing digest re-runs, it does so using progressively fewer caches (as introduced in #15854), in order to cache bust both local and remote partial cache entries. `eager_fetch=False` was already experimental, in that any `MissingDigest` error encountered later in the run would kill the entire run. Backtracking makes `eager_fetch=False` less experimental, in that we are now very likely to recover from a `MissingDigest` error. But it is still the case with `eager_fetch=False` that persistent remote infrastructure errors (those that last longer than our retry budget or timeout) could kill a run. Given that, we will likely want to gain more experience and further tune timeouts and retries before changing the default. Fixes #11331. [ci skip-build-wheels]

…te stdio-specific methods (#15890) As described in #15887: `StreamingWorkunit` plugins have never been able to set thread-local `WorkunitStore` state, but that apparently didn't matter until #11331 made it possible for the `StreamingWorkunitContext` file-fetching methods to encounter data which had not yet been fetched (and thus needed to create a workunit for the fetching). This change updates and "deprecates" the existing `stdio_thread_[gs]et_destination` methods (although it doesn't have access to a decorator to do that), and introduces generic thread-local state methods which include all thread-local state required by engine APIs. Fixes #15887. [ci skip-build-wheels]

…te stdio-specific methods (pantsbuild#15890) As described in pantsbuild#15887: `StreamingWorkunit` plugins have never been able to set thread-local `WorkunitStore` state, but that apparently didn't matter until pantsbuild#11331 made it possible for the `StreamingWorkunitContext` file-fetching methods to encounter data which had not yet been fetched (and thus needed to create a workunit for the fetching). This change updates and "deprecates" the existing `stdio_thread_[gs]et_destination` methods (although it doesn't have access to a decorator to do that), and introduces generic thread-local state methods which include all thread-local state required by engine APIs. Fixes pantsbuild#15887. [ci skip-build-wheels]

…te stdio-specific methods (Cherry-pick of #15890) (#15907) As described in #15887: `StreamingWorkunit` plugins have never been able to set thread-local `WorkunitStore` state, but that apparently didn't matter until #11331 made it possible for the `StreamingWorkunitContext` file-fetching methods to encounter data which had not yet been fetched (and thus needed to create a workunit for the fetching). This change updates and "deprecates" the existing `stdio_thread_[gs]et_destination` methods (although it doesn't have access to a decorator to do that), and introduces generic thread-local state methods which include all thread-local state required by engine APIs. Fixes #15887. [ci skip-build-wheels]

…te stdio-specific methods (pantsbuild#15890) As described in pantsbuild#15887: `StreamingWorkunit` plugins have never been able to set thread-local `WorkunitStore` state, but that apparently didn't matter until pantsbuild#11331 made it possible for the `StreamingWorkunitContext` file-fetching methods to encounter data which had not yet been fetched (and thus needed to create a workunit for the fetching). This change updates and "deprecates" the existing `stdio_thread_[gs]et_destination` methods (although it doesn't have access to a decorator to do that), and introduces generic thread-local state methods which include all thread-local state required by engine APIs. Fixes pantsbuild#15887. [ci skip-build-wheels]

…te stdio-specific methods (Cherry-pick of #15890) (#15916) As described in #15887: `StreamingWorkunit` plugins have never been able to set thread-local `WorkunitStore` state, but that apparently didn't matter until #11331 made it possible for the `StreamingWorkunitContext` file-fetching methods to encounter data which had not yet been fetched (and thus needed to create a workunit for the fetching). This change updates and "deprecates" the existing `stdio_thread_[gs]et_destination` methods (although it doesn't have access to a decorator to do that), and introduces generic thread-local state methods which include all thread-local state required by engine APIs. Fixes #15887. [ci skip-build-wheels]

stuhood added the remoting and caching label Dec 17, 2020

This comment has been minimized.

Sign in to view

stuhood mentioned this issue Jan 5, 2021

Add --remote-cache-eager-fetch to reduce unreliability #11396

Merged

benjyw removed the remoting and caching label Sep 9, 2021

stuhood changed the title ~~Backtracking for missing Digests~~ Backtracking for missing Digests (aka: remove "eager fetch") Mar 14, 2022

stuhood mentioned this issue Mar 14, 2022

High total runtime for test, even with very high cache rate toolchainlabs/issues#12

Closed

stuhood mentioned this issue May 18, 2022

Dedupe load_bytes_with calls #15524

Closed

stuhood self-assigned this Jun 2, 2022

stuhood mentioned this issue Jun 7, 2022

Differentiate and propagate missing digest errors #15761

Merged

stuhood mentioned this issue Jun 13, 2022

Remove unnecessary WrappedNode method indirection #15813

Merged

This was referenced Jun 16, 2022

Remove synchronous remote cache lookup from remote execution #15854

Merged

Backtrack execution for missing digests to make eager_fetch=false more resilient #15850

Merged

stuhood closed this as completed in #15850 Jun 17, 2022

stuhood mentioned this issue Jun 22, 2022

Introduce a plugin API to provide all thread local state, and deprecate stdio-specific methods #15890

Merged

stuhood mentioned this issue Jun 23, 2022

Introduce a plugin API to provide all thread local state, and deprecate stdio-specific methods (Cherry-pick of #15890) #15907

Merged

stuhood mentioned this issue Jun 24, 2022

Introduce a plugin API to provide all thread local state, and deprecate stdio-specific methods (Cherry-pick of #15890) #15916

Merged

stuhood mentioned this issue Aug 26, 2022

Backtrack for network issues as well as missing digests #16667

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backtracking for missing Digests (aka: remove "eager fetch") #11331

Backtracking for missing Digests (aka: remove "eager fetch") #11331

stuhood commented Dec 17, 2020 •

edited

Loading

This comment has been minimized.

stuhood commented Dec 28, 2020

stuhood commented Jul 27, 2021

stuhood commented Mar 14, 2022 •

edited

Loading

stuhood commented May 18, 2022 •

edited

Loading

stuhood commented Jun 7, 2022

stuhood commented Jun 16, 2022 •

edited

Loading

Backtracking for missing Digests (aka: remove "eager fetch") #11331

Backtracking for missing Digests (aka: remove "eager fetch") #11331

Comments

stuhood commented Dec 17, 2020 • edited Loading

This comment has been minimized.

stuhood commented Dec 28, 2020

stuhood commented Jul 27, 2021

stuhood commented Mar 14, 2022 • edited Loading

stuhood commented May 18, 2022 • edited Loading

stuhood commented Jun 7, 2022

stuhood commented Jun 16, 2022 • edited Loading

stuhood commented Dec 17, 2020 •

edited

Loading

stuhood commented Mar 14, 2022 •

edited

Loading

stuhood commented May 18, 2022 •

edited

Loading

stuhood commented Jun 16, 2022 •

edited

Loading