Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differentiate and propagate missing digest errors #15761

Merged

Conversation

stuhood
Copy link
Member

@stuhood stuhood commented Jun 7, 2022

In order to begin lazily fetching digests for #11331, we need to be able to differentiate errors caused by missing digests.

This change introduces {StoreError,ProcessError,Failure}::MissingDigest, which are bubbled up to the level of graph Nodes, and which will allow us to introduce retry in a followup change. MissingDigest is produced by load_bytes_with and a few other low-level Store methods (which now return an Err(MissingDigest) rather than Option).

There should not be any behavior changes.

…ocess_execution` and `nodes`.

[ci skip-build-wheels]
@stuhood stuhood added the category:internal CI, fixes for not-yet-released features, etc. label Jun 7, 2022
src/rust/engine/fs/store/src/lib.rs Outdated Show resolved Hide resolved
src/rust/engine/fs/store/src/lib.rs Outdated Show resolved Hide resolved
@stuhood stuhood enabled auto-merge (squash) June 7, 2022 16:09
@stuhood stuhood merged commit 1c495dd into pantsbuild:main Jun 7, 2022
@stuhood stuhood deleted the stuhood/propagate-missing-digest-errors branch June 7, 2022 17:05
stuhood added a commit that referenced this pull request Jun 17, 2022
…ore resilient (#15850)

As described in #11331, in order to avoid having to deal with missing remote content later in the pipeline, `--remote-cache-eager-fetch` currently defaults to true. This means that before calling a cache hit a hit, we fully download the output of the cache entry.

In warm-cache situations, this can mean downloading a lot more than is strictly necessary. In theory, you could imagine `eager_fetch=False` downloading only stdio and no file content at all for a 100% cache hit rate run of tests. In practice, high hitrate runs [see about 80% fewer bytes downloaded, and 50% fewer RPCs](#11331 (comment)) than with `eager_fetch=True`.

To begin moving toward disabling `eager_fetch` by default (and eventually, ideally, removing the flag entirely), this change begins "backtracking" when missing digests are encountered. Backtracking is implemented by "catching" `MissingDigest` errors (introduced in #15761), and invalidating their source `Node` in the graph. When a `Node` that produced a missing digest re-runs, it does so using progressively fewer caches (as introduced in #15854), in order to cache bust both local and remote partial cache entries.

`eager_fetch=False` was already experimental, in that any `MissingDigest` error encountered later in the run would kill the entire run. Backtracking makes `eager_fetch=False` less experimental, in that we are now very likely to recover from a `MissingDigest` error. But it is still the case with `eager_fetch=False` that persistent remote infrastructure errors (those that last longer than our retry budget or timeout) could kill a run. Given that, we will likely want to gain more experience and further tune timeouts and retries before changing the default.

Fixes #11331.

[ci skip-build-wheels]
stuhood added a commit that referenced this pull request Jun 29, 2022
…15979)

As reported in #15954, backtracking doesn't currently work when a synchronous method like `Workspace.write_digests` is the source of a `MissingDigest` error. This was due to a TODO left behind in #15761, where we did not propagate the `Failure` type "through" `@rule` bodies.

To fix this, we add a conversion from `Failure` to `PrErr` which wraps in a well known exception type, and then look for that type when converting back from `PyErr` to `Failure`.

Fixes #15954.
tdyas pushed a commit to tdyas/pants that referenced this pull request Jun 29, 2022
…antsbuild#15979)

As reported in pantsbuild#15954, backtracking doesn't currently work when a synchronous method like `Workspace.write_digests` is the source of a `MissingDigest` error. This was due to a TODO left behind in pantsbuild#15761, where we did not propagate the `Failure` type "through" `@rule` bodies.

To fix this, we add a conversion from `Failure` to `PrErr` which wraps in a well known exception type, and then look for that type when converting back from `PyErr` to `Failure`.

Fixes pantsbuild#15954.
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
stuhood added a commit to stuhood/pants that referenced this pull request Jun 29, 2022
…antsbuild#15979)

As reported in pantsbuild#15954, backtracking doesn't currently work when a synchronous method like `Workspace.write_digests` is the source of a `MissingDigest` error. This was due to a TODO left behind in pantsbuild#15761, where we did not propagate the `Failure` type "through" `@rule` bodies.

To fix this, we add a conversion from `Failure` to `PrErr` which wraps in a well known exception type, and then look for that type when converting back from `PyErr` to `Failure`.

Fixes pantsbuild#15954.
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
stuhood added a commit that referenced this pull request Jun 29, 2022
…Cherry-pick of #15979) (#16001)

As reported in #15954, backtracking doesn't currently work when a synchronous method like `Workspace.write_digests` is the source of a `MissingDigest` error. This was due to a TODO left behind in #15761, where we did not propagate the `Failure` type "through" `@rule` bodies.

To fix this, we add a conversion from `Failure` to `PrErr` which wraps in a well known exception type, and then look for that type when converting back from `PyErr` to `Failure`.

Fixes #15954.

[ci skip-build-wheels]

Co-authored-by: Stu Hood <stuhood@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category:internal CI, fixes for not-yet-released features, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants