Remove synchronous remote cache lookup from remote execution #15854

stuhood · 2022-06-16T19:22:34Z

To prepare for #11331, it's necessary to clarify the building of the "stacks" of command runners, in order to progressively remove caches for each attempt to backtrack.

Both for performance reasons (to allow an attempt to execute a process to proceed concurrently with checking the cache), and to allow us to more easily conditionally remove remote caching from remote execution, this change removes the synchronous cache lookup from the remote::CommandRunner by wrapping in remote_cache::CommandRunner.

Additionally, refactors the CommandRunner stacking, and produces stacks per backtrack attempt for #11331.

[ci skip-build-wheels]

stuhood · 2022-06-16T19:33:07Z

Commits are useful to review independently.

[ci skip-build-wheels]

Eric-Arellano

Overall looks good.

For the 4th commit, did the actual behavior change? Will things still work for remote execution users if they don't enable remote caching, per our deprecation policy?

For the 5th commit, I'm not following why it's necessary. Maybe that would be best in a dedicated PR?

Eric-Arellano · 2022-06-16T20:00:50Z

src/rust/engine/src/context.rs

+        // Use all enabled caches on the first attempt.
+        maybe_local_cached_runner,
+        // Remove local caching on the second attempt.
+        maybe_remote_cached_runner,
+        // Remove all caching on the third attempt.
+        Some(leaf_runner),


I don't think I fully understand this. The same leaf_runner will be used multiple times right? But because we use Arc, it's the same value every time? If so, maybe highlighting the role of Arc in a comment would help

That's correct.

Okay, I do think this would be helpful to explicitly say in a comment. Totally fine in a follow up PR

src/python/pants/option/global_options.py

tdyas · 2022-06-16T20:34:46Z

Will the user have to set --remote-cache-address now in addition to --remote-store-address to use a remote cache?

src/python/pants/option/global_options.py

stuhood · 2022-06-16T20:45:33Z

Will the user have to set --remote-cache-address now in addition to --remote-store-address to use a remote cache?

No: see the new help string.

stuhood · 2022-06-16T20:52:32Z

For the 4th commit, did the actual behavior change? Will things still work for remote execution users if they don't enable remote caching, per our deprecation policy?

Things will work, but they will be slower: 1. we'll eager fetch to local disk during remote execution, 2. we won't remote cache by default.

The first item seems fine as a deprecation, because whether or not we fetched to local disk during remote execution was undefined before. For the second item I could maybe see an argument that --remote-execution should just imply --remote-cache-* without any new settings? But the challenge with that is that I do think that in the long term we should support enabling/disabling use of the remote cache with remote execution (even though the remote execution server might still try to do it internally).

For the 5th commit, I'm not following why it's necessary. Maybe that would be best in a dedicated PR?

Sure.

asherf · 2022-06-16T20:58:54Z

TBH - I don't think adding those two options is needed. I know it defaults to the store address if it is not set, but still, it is another option which adds to the noise.
Is there an actual use case for it ?
I don't think we should be adding more options unless there is an actual real world use case for it.

stuhood · 2022-06-16T21:00:43Z

TBH - I don't think adding those two options is needed. I know it defaults to the store address if it is not set, but still, it is another option which adds to the noise. Is there an actual use case for it ? I don't think we should be adding more options unless there is an actual real world use case for it.

Yes: it is used in the followup PR, which is posted as a draft: #15850 We have separate stub servers for the CAS and for the cache, and there isn't any particular reason that they need to be hosted on the same host/domain/port in production either.

I'll split that commit out as a separate PR though, as @Eric-Arellano mentioned.

asherf · 2022-06-16T21:13:06Z

TBH - I don't think adding those two options is needed. I know it defaults to the store address if it is not set, but still, it is another option which adds to the noise. Is there an actual use case for it ? I don't think we should be adding more options unless there is an actual real world use case for it.

Yes: it is used in the followup PR, which is posted as a draft: #15850 We have separate stub servers for the CAS and for the cache, and there isn't any particular reason that they need to be hosted on the same host/domain/port in production either.

I'll split that commit out as a separate PR though, as @Eric-Arellano mentioned.

I don't agree that a pants test/integration test is a "real world" scenario, hence I think my comment still stands.
just my 2c. there are tons of options.. especially in the global scope. I don't think it is a good thing.

stuhood · 2022-06-16T21:36:15Z

I've split out the new --remote-cache-address flag over here: #15856. Let's move all discussion of that flag there.

Eric-Arellano · 2022-06-17T00:02:58Z

For the second item I could maybe see an argument that --remote-execution should just imply --remote-cache-* without any new settings? But the challenge with that is that I do think that in the long term we should support enabling/disabling use of the remote cache with remote execution (even though the remote execution server might still try to do it internally).

Note that we already require you to set remote_store_address, so commit 4 feels redundant. What's the motivation?

And if you do want to still land it, I think it would be best as a dedicated PR marked "User API change".

Otherwise this PR looks good.

stuhood · 2022-06-17T03:28:09Z

For the second item I could maybe see an argument that --remote-execution should just imply --remote-cache-* without any new settings? But the challenge with that is that I do think that in the long term we should support enabling/disabling use of the remote cache with remote execution (even though the remote execution server might still try to do it internally).

Note that we already require you to set remote_store_address, so commit 4 feels redundant. What's the motivation?

Setting --remote-store-address doesn't enable remote execution (--remote-execution) or remote caching (--remote-cache-{read,write}), so I'm not sure what you mean here.

Some more details on the deprecations (which I can put in the messages if it makes sense):

Using --remote-execution without --remote-cache-read and --remote-cache-write.
- These options used to be mutually exclusive, because when you enabled --remote-execution you implicitly also enabled remote caching (it was "built in" to the remote::CommandRunner). Commit 3 in this PR removed that built-in-ness (and is the namesake change for this PR, since it is probably the only bit that qualifies as user facing). Now that caching isn't built in to remote::CommandRunner, you need to set --remote-cache-{read,write} to preserve the old behavior.
Setting --remote-execution at the same time as --remote-cache-eager-fetch.
- Because remote caching was previously baked-in to remote::CommandRunner, it ignored the eager_fetch setting. Now that caching is provided by remote_cache::CommandRunner, the eager_fetch setting is in play. But folks using --remote-execution will almost certainly want to disable it: otherwise they will fetch everything from the remote cache (which doesn't match the previous behavior).
- Another option for this one would be to ignore the eager_fetch setting when remote_cache::CommandRunner is wrapped around remote::CommandRunner, but I can also imagine situations in the long run where a user might want to force fetching data (either to work around bugs, or to populate the local store).

So: in both cases, these are deprecating particular permutations of flags in order to be explicit, and leave open the possibility of allowing those combinations of flags in the future... rather than conditionally computing values for them (i.e. conditionally setting eager_fetch based on --remote-execution).

Eric-Arellano · 2022-06-17T14:14:40Z

Okay, thanks for explaining that. Sounds reasonable.

For --remote-execution implying --remote-cache, how about we keep that behavior for this release, but deprecate it not being explicit? That way, we don't violate our deprecation policy. We do take away the ability to use RE w/o RC for this release, but that wasn't possible to begin with!

I also still think it's better for the deprecations to be a dedicated PR marked "User API change". Otherwise, people reading the changelog won't see this change.

stuhood · 2022-06-17T15:41:14Z

I also still think it's better for the deprecations to be a dedicated PR marked "User API change". Otherwise, people reading the changelog won't see this change.

The behavior is changing in this PR: if this were split into two PRs, then between the first and second PR, you would have the behavior change, but no warning of the behavior change.

I can change the title and label to make it clearer that this is a user API change?

For --remote-execution implying --remote-cache, how about we keep that behavior for this release, but deprecate it not being explicit? That way, we don't violate our deprecation policy. We do take away the ability to use RE w/o RC for this release, but that wasn't possible to begin with!

Ok... that's probably reasonable.

Eric-Arellano · 2022-06-17T15:48:41Z

For --remote-execution implying --remote-cache, how about we keep that behavior for this release, but deprecate it not being explicit?

Ok... that's probably reasonable.

If you do that, then there is no behavior change. Right?

My proposal is to land this PR as solely a performance change -- add special casing so RE implies RC, as before. Then, a quick followup w/ the deprecation.

…enable remote caching. [ci skip-build-wheels]

stuhood · 2022-06-17T16:23:31Z

My proposal is to land this PR as solely a performance change -- add special casing so RE implies RC, as before. Then, a quick followup w/ the deprecation.

I've pushed a change to do this. But FWIW, this feels like artificially contorting logical units of work for the purposes of the changelog. I like having an automated changelog as much of the rest of us, but when the result is two PRs which cannot be reverted independent of one another, it's not clear that it's a win.

Eric-Arellano · 2022-06-17T16:25:47Z

I like having an automated changelog as much of the rest of us, but when the result is two PRs which cannot be reverted independent of one another, it's not clear that it's a win.

I haven't put any time into the proposal, but I actually am not a fan of the automated changelog because it indeed contorts changes like this. I would rather have a changelog that we manually update with each PR, like PyO3 does.

…ore resilient (#15850) As described in #11331, in order to avoid having to deal with missing remote content later in the pipeline, `--remote-cache-eager-fetch` currently defaults to true. This means that before calling a cache hit a hit, we fully download the output of the cache entry. In warm-cache situations, this can mean downloading a lot more than is strictly necessary. In theory, you could imagine `eager_fetch=False` downloading only stdio and no file content at all for a 100% cache hit rate run of tests. In practice, high hitrate runs [see about 80% fewer bytes downloaded, and 50% fewer RPCs](#11331 (comment)) than with `eager_fetch=True`. To begin moving toward disabling `eager_fetch` by default (and eventually, ideally, removing the flag entirely), this change begins "backtracking" when missing digests are encountered. Backtracking is implemented by "catching" `MissingDigest` errors (introduced in #15761), and invalidating their source `Node` in the graph. When a `Node` that produced a missing digest re-runs, it does so using progressively fewer caches (as introduced in #15854), in order to cache bust both local and remote partial cache entries. `eager_fetch=False` was already experimental, in that any `MissingDigest` error encountered later in the run would kill the entire run. Backtracking makes `eager_fetch=False` less experimental, in that we are now very likely to recover from a `MissingDigest` error. But it is still the case with `eager_fetch=False` that persistent remote infrastructure errors (those that last longer than our retry budget or timeout) could kill a run. Given that, we will likely want to gain more experience and further tune timeouts and retries before changing the default. Fixes #11331. [ci skip-build-wheels]

…}` with `--remote-execution` (#15900) #15854 moved to using the `remote_cache` runner wrapped around the `remote` runner. That opened the door to using different values of various `--remote-cache` settings, but to preserve existing behavior, we implicitly enabled those settings with remote execution. This change deprecates the implicit settings, asking that users set the previous defaults manually. [ci skip-build-wheels]

…17188) Finished the TODO from #15854 and #15900. Partially addresses #17142.

Render debug output for the CommandRunner stack on startup.

efa8da9

[ci skip-build-wheels]

stuhood added the category:performance label Jun 16, 2022

stuhood marked this pull request as ready for review June 16, 2022 19:32

stuhood requested review from tdyas and Eric-Arellano June 16, 2022 19:32

stuhood added 2 commits June 16, 2022 12:45

Create separate CommandRunner stacks for each backtracking attempt.

ef85bb4

[ci skip-build-wheels]

Remove synchronous cache lookup from remote execution.

eb996ee

[ci skip-build-wheels]

stuhood force-pushed the stuhood/command-stack-refactors branch from bf65f48 to 679a2c8 Compare June 16, 2022 19:45

Eric-Arellano reviewed Jun 16, 2022

View reviewed changes

tdyas reviewed Jun 16, 2022

View reviewed changes

src/python/pants/option/global_options.py Outdated Show resolved Hide resolved

stuhood force-pushed the stuhood/command-stack-refactors branch from 679a2c8 to 8278eca Compare June 16, 2022 21:03

stuhood mentioned this pull request Jun 16, 2022

Allow for configuring a separate remote cache address #15856

Closed

This was referenced Jun 16, 2022

Backtrack execution for missing digests to make eager_fetch=false more resilient #15850

Merged

Backtracking for missing Digests (aka: remove "eager fetch") #11331

Closed

When remote execution is enabled, implicitly disable eager fetch and …

fc9c014

…enable remote caching. [ci skip-build-wheels]

stuhood force-pushed the stuhood/command-stack-refactors branch from 8278eca to fc9c014 Compare June 17, 2022 16:21

Eric-Arellano approved these changes Jun 17, 2022

View reviewed changes

stuhood enabled auto-merge (squash) June 17, 2022 16:29

stuhood merged commit 0616042 into pantsbuild:main Jun 17, 2022

stuhood deleted the stuhood/command-stack-refactors branch June 17, 2022 18:14

stuhood mentioned this pull request Jun 22, 2022

Deprecate implicit setting of --remote-cache-{read,write,eager-fetch} with --remote-execution #15900

Merged

Eric-Arellano mentioned this pull request Oct 10, 2022

--remote-execution no longer implicitly enables --remote-cache #17188

Merged

Eric-Arellano added a commit that referenced this pull request Oct 11, 2022

--remote-execution no longer implicitly enables --remote-cache (#…

fe13e84

…17188) Finished the TODO from #15854 and #15900. Partially addresses #17142.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove synchronous remote cache lookup from remote execution #15854

Remove synchronous remote cache lookup from remote execution #15854

stuhood commented Jun 16, 2022 •

edited

Loading

stuhood commented Jun 16, 2022

Eric-Arellano left a comment

Eric-Arellano Jun 16, 2022

stuhood Jun 16, 2022

Eric-Arellano Jun 17, 2022

tdyas commented Jun 16, 2022

stuhood commented Jun 16, 2022

stuhood commented Jun 16, 2022

asherf commented Jun 16, 2022

stuhood commented Jun 16, 2022 •

edited

Loading

asherf commented Jun 16, 2022

stuhood commented Jun 16, 2022

Eric-Arellano commented Jun 17, 2022

stuhood commented Jun 17, 2022 •

edited

Loading

Eric-Arellano commented Jun 17, 2022

stuhood commented Jun 17, 2022 •

edited

Loading

Eric-Arellano commented Jun 17, 2022

stuhood commented Jun 17, 2022 •

edited

Loading

Eric-Arellano commented Jun 17, 2022

Remove synchronous remote cache lookup from remote execution #15854

Remove synchronous remote cache lookup from remote execution #15854

Conversation

stuhood commented Jun 16, 2022 • edited Loading

stuhood commented Jun 16, 2022

Eric-Arellano left a comment

Choose a reason for hiding this comment

Eric-Arellano Jun 16, 2022

Choose a reason for hiding this comment

stuhood Jun 16, 2022

Choose a reason for hiding this comment

Eric-Arellano Jun 17, 2022

Choose a reason for hiding this comment

tdyas commented Jun 16, 2022

stuhood commented Jun 16, 2022

stuhood commented Jun 16, 2022

asherf commented Jun 16, 2022

stuhood commented Jun 16, 2022 • edited Loading

asherf commented Jun 16, 2022

stuhood commented Jun 16, 2022

Eric-Arellano commented Jun 17, 2022

stuhood commented Jun 17, 2022 • edited Loading

Eric-Arellano commented Jun 17, 2022

stuhood commented Jun 17, 2022 • edited Loading

Eric-Arellano commented Jun 17, 2022

stuhood commented Jun 17, 2022 • edited Loading

Eric-Arellano commented Jun 17, 2022

stuhood commented Jun 16, 2022 •

edited

Loading

stuhood commented Jun 16, 2022 •

edited

Loading

stuhood commented Jun 17, 2022 •

edited

Loading

stuhood commented Jun 17, 2022 •

edited

Loading

stuhood commented Jun 17, 2022 •

edited

Loading