-
-
Notifications
You must be signed in to change notification settings - Fork 636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove synchronous remote cache lookup from remote execution #15854
Remove synchronous remote cache lookup from remote execution #15854
Conversation
[ci skip-build-wheels]
Commits are useful to review independently. |
[ci skip-build-wheels]
[ci skip-build-wheels]
bf65f48
to
679a2c8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good.
For the 4th commit, did the actual behavior change? Will things still work for remote execution users if they don't enable remote caching, per our deprecation policy?
For the 5th commit, I'm not following why it's necessary. Maybe that would be best in a dedicated PR?
// Use all enabled caches on the first attempt. | ||
maybe_local_cached_runner, | ||
// Remove local caching on the second attempt. | ||
maybe_remote_cached_runner, | ||
// Remove all caching on the third attempt. | ||
Some(leaf_runner), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I fully understand this. The same leaf_runner
will be used multiple times right? But because we use Arc
, it's the same value every time? If so, maybe highlighting the role of Arc
in a comment would help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I do think this would be helpful to explicitly say in a comment. Totally fine in a follow up PR
Will the user have to set |
No: see the new help string. |
Things will work, but they will be slower: 1. we'll eager fetch to local disk during remote execution, 2. we won't remote cache by default. The first item seems fine as a deprecation, because whether or not we fetched to local disk during remote execution was undefined before. For the second item I could maybe see an argument that
Sure. |
TBH - I don't think adding those two options is needed. I know it defaults to the store address if it is not set, but still, it is another option which adds to the noise. |
Yes: it is used in the followup PR, which is posted as a draft: #15850 We have separate stub servers for the CAS and for the cache, and there isn't any particular reason that they need to be hosted on the same host/domain/port in production either. I'll split that commit out as a separate PR though, as @Eric-Arellano mentioned. |
679a2c8
to
8278eca
Compare
I don't agree that a pants test/integration test is a "real world" scenario, hence I think my comment still stands. |
I've split out the new |
Note that we already require you to set And if you do want to still land it, I think it would be best as a dedicated PR marked "User API change". Otherwise this PR looks good. |
Setting Some more details on the deprecations (which I can put in the messages if it makes sense):
So: in both cases, these are deprecating particular permutations of flags in order to be explicit, and leave open the possibility of allowing those combinations of flags in the future... rather than conditionally computing values for them (i.e. conditionally setting eager_fetch based on |
Okay, thanks for explaining that. Sounds reasonable. For I also still think it's better for the deprecations to be a dedicated PR marked "User API change". Otherwise, people reading the changelog won't see this change. |
The behavior is changing in this PR: if this were split into two PRs, then between the first and second PR, you would have the behavior change, but no warning of the behavior change. I can change the title and label to make it clearer that this is a user API change?
Ok... that's probably reasonable. |
If you do that, then there is no behavior change. Right? My proposal is to land this PR as solely a performance change -- add special casing so RE implies RC, as before. Then, a quick followup w/ the deprecation. |
…enable remote caching. [ci skip-build-wheels]
8278eca
to
fc9c014
Compare
I've pushed a change to do this. But FWIW, this feels like artificially contorting logical units of work for the purposes of the changelog. I like having an automated changelog as much of the rest of us, but when the result is two PRs which cannot be reverted independent of one another, it's not clear that it's a win. |
I haven't put any time into the proposal, but I actually am not a fan of the automated changelog because it indeed contorts changes like this. I would rather have a changelog that we manually update with each PR, like PyO3 does. |
…ore resilient (#15850) As described in #11331, in order to avoid having to deal with missing remote content later in the pipeline, `--remote-cache-eager-fetch` currently defaults to true. This means that before calling a cache hit a hit, we fully download the output of the cache entry. In warm-cache situations, this can mean downloading a lot more than is strictly necessary. In theory, you could imagine `eager_fetch=False` downloading only stdio and no file content at all for a 100% cache hit rate run of tests. In practice, high hitrate runs [see about 80% fewer bytes downloaded, and 50% fewer RPCs](#11331 (comment)) than with `eager_fetch=True`. To begin moving toward disabling `eager_fetch` by default (and eventually, ideally, removing the flag entirely), this change begins "backtracking" when missing digests are encountered. Backtracking is implemented by "catching" `MissingDigest` errors (introduced in #15761), and invalidating their source `Node` in the graph. When a `Node` that produced a missing digest re-runs, it does so using progressively fewer caches (as introduced in #15854), in order to cache bust both local and remote partial cache entries. `eager_fetch=False` was already experimental, in that any `MissingDigest` error encountered later in the run would kill the entire run. Backtracking makes `eager_fetch=False` less experimental, in that we are now very likely to recover from a `MissingDigest` error. But it is still the case with `eager_fetch=False` that persistent remote infrastructure errors (those that last longer than our retry budget or timeout) could kill a run. Given that, we will likely want to gain more experience and further tune timeouts and retries before changing the default. Fixes #11331. [ci skip-build-wheels]
…}` with `--remote-execution` (#15900) #15854 moved to using the `remote_cache` runner wrapped around the `remote` runner. That opened the door to using different values of various `--remote-cache` settings, but to preserve existing behavior, we implicitly enabled those settings with remote execution. This change deprecates the implicit settings, asking that users set the previous defaults manually. [ci skip-build-wheels]
To prepare for #11331, it's necessary to clarify the building of the "stacks" of command runners, in order to progressively remove caches for each attempt to backtrack.
Both for performance reasons (to allow an attempt to execute a process to proceed concurrently with checking the cache), and to allow us to more easily conditionally remove remote caching from remote execution, this change removes the synchronous cache lookup from the
remote::CommandRunner
by wrapping inremote_cache::CommandRunner
.Additionally, refactors the
CommandRunner
stacking, and produces stacks per backtrack attempt for #11331.