Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove synchronous remote cache lookup from remote execution #15854

Merged
merged 4 commits into from
Jun 17, 2022

Conversation

stuhood
Copy link
Member

@stuhood stuhood commented Jun 16, 2022

To prepare for #11331, it's necessary to clarify the building of the "stacks" of command runners, in order to progressively remove caches for each attempt to backtrack.

Both for performance reasons (to allow an attempt to execute a process to proceed concurrently with checking the cache), and to allow us to more easily conditionally remove remote caching from remote execution, this change removes the synchronous cache lookup from the remote::CommandRunner by wrapping in remote_cache::CommandRunner.

Additionally, refactors the CommandRunner stacking, and produces stacks per backtrack attempt for #11331.

@stuhood stuhood marked this pull request as ready for review June 16, 2022 19:32
@stuhood
Copy link
Member Author

stuhood commented Jun 16, 2022

Commits are useful to review independently.

@stuhood stuhood force-pushed the stuhood/command-stack-refactors branch from bf65f48 to 679a2c8 Compare June 16, 2022 19:45
Copy link
Contributor

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good.

For the 4th commit, did the actual behavior change? Will things still work for remote execution users if they don't enable remote caching, per our deprecation policy?

For the 5th commit, I'm not following why it's necessary. Maybe that would be best in a dedicated PR?

Comment on lines +373 to +378
// Use all enabled caches on the first attempt.
maybe_local_cached_runner,
// Remove local caching on the second attempt.
maybe_remote_cached_runner,
// Remove all caching on the third attempt.
Some(leaf_runner),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I fully understand this. The same leaf_runner will be used multiple times right? But because we use Arc, it's the same value every time? If so, maybe highlighting the role of Arc in a comment would help

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I do think this would be helpful to explicitly say in a comment. Totally fine in a follow up PR

src/python/pants/option/global_options.py Outdated Show resolved Hide resolved
src/python/pants/option/global_options.py Outdated Show resolved Hide resolved
@tdyas
Copy link
Contributor

tdyas commented Jun 16, 2022

Will the user have to set --remote-cache-address now in addition to --remote-store-address to use a remote cache?

@stuhood
Copy link
Member Author

stuhood commented Jun 16, 2022

Will the user have to set --remote-cache-address now in addition to --remote-store-address to use a remote cache?

No: see the new help string.

@stuhood
Copy link
Member Author

stuhood commented Jun 16, 2022

For the 4th commit, did the actual behavior change? Will things still work for remote execution users if they don't enable remote caching, per our deprecation policy?

Things will work, but they will be slower: 1. we'll eager fetch to local disk during remote execution, 2. we won't remote cache by default.

The first item seems fine as a deprecation, because whether or not we fetched to local disk during remote execution was undefined before. For the second item I could maybe see an argument that --remote-execution should just imply --remote-cache-* without any new settings? But the challenge with that is that I do think that in the long term we should support enabling/disabling use of the remote cache with remote execution (even though the remote execution server might still try to do it internally).

For the 5th commit, I'm not following why it's necessary. Maybe that would be best in a dedicated PR?

Sure.

@asherf
Copy link
Member

asherf commented Jun 16, 2022

TBH - I don't think adding those two options is needed. I know it defaults to the store address if it is not set, but still, it is another option which adds to the noise.
Is there an actual use case for it ?
I don't think we should be adding more options unless there is an actual real world use case for it.

@stuhood
Copy link
Member Author

stuhood commented Jun 16, 2022

TBH - I don't think adding those two options is needed. I know it defaults to the store address if it is not set, but still, it is another option which adds to the noise. Is there an actual use case for it ? I don't think we should be adding more options unless there is an actual real world use case for it.

Yes: it is used in the followup PR, which is posted as a draft: #15850 We have separate stub servers for the CAS and for the cache, and there isn't any particular reason that they need to be hosted on the same host/domain/port in production either.

I'll split that commit out as a separate PR though, as @Eric-Arellano mentioned.

@stuhood stuhood force-pushed the stuhood/command-stack-refactors branch from 679a2c8 to 8278eca Compare June 16, 2022 21:03
@asherf
Copy link
Member

asherf commented Jun 16, 2022

TBH - I don't think adding those two options is needed. I know it defaults to the store address if it is not set, but still, it is another option which adds to the noise. Is there an actual use case for it ? I don't think we should be adding more options unless there is an actual real world use case for it.

Yes: it is used in the followup PR, which is posted as a draft: #15850 We have separate stub servers for the CAS and for the cache, and there isn't any particular reason that they need to be hosted on the same host/domain/port in production either.

I'll split that commit out as a separate PR though, as @Eric-Arellano mentioned.

I don't agree that a pants test/integration test is a "real world" scenario, hence I think my comment still stands.
just my 2c. there are tons of options.. especially in the global scope. I don't think it is a good thing.

@stuhood
Copy link
Member Author

stuhood commented Jun 16, 2022

I've split out the new --remote-cache-address flag over here: #15856. Let's move all discussion of that flag there.

@Eric-Arellano
Copy link
Contributor

For the second item I could maybe see an argument that --remote-execution should just imply --remote-cache-* without any new settings? But the challenge with that is that I do think that in the long term we should support enabling/disabling use of the remote cache with remote execution (even though the remote execution server might still try to do it internally).

Note that we already require you to set remote_store_address, so commit 4 feels redundant. What's the motivation?

And if you do want to still land it, I think it would be best as a dedicated PR marked "User API change".

Otherwise this PR looks good.

@stuhood
Copy link
Member Author

stuhood commented Jun 17, 2022

For the second item I could maybe see an argument that --remote-execution should just imply --remote-cache-* without any new settings? But the challenge with that is that I do think that in the long term we should support enabling/disabling use of the remote cache with remote execution (even though the remote execution server might still try to do it internally).

Note that we already require you to set remote_store_address, so commit 4 feels redundant. What's the motivation?

Setting --remote-store-address doesn't enable remote execution (--remote-execution) or remote caching (--remote-cache-{read,write}), so I'm not sure what you mean here.


Some more details on the deprecations (which I can put in the messages if it makes sense):

  • Using --remote-execution without --remote-cache-read and --remote-cache-write.

    • These options used to be mutually exclusive, because when you enabled --remote-execution you implicitly also enabled remote caching (it was "built in" to the remote::CommandRunner). Commit 3 in this PR removed that built-in-ness (and is the namesake change for this PR, since it is probably the only bit that qualifies as user facing). Now that caching isn't built in to remote::CommandRunner, you need to set --remote-cache-{read,write} to preserve the old behavior.
  • Setting --remote-execution at the same time as --remote-cache-eager-fetch.

    • Because remote caching was previously baked-in to remote::CommandRunner, it ignored the eager_fetch setting. Now that caching is provided by remote_cache::CommandRunner, the eager_fetch setting is in play. But folks using --remote-execution will almost certainly want to disable it: otherwise they will fetch everything from the remote cache (which doesn't match the previous behavior).
    • Another option for this one would be to ignore the eager_fetch setting when remote_cache::CommandRunner is wrapped around remote::CommandRunner, but I can also imagine situations in the long run where a user might want to force fetching data (either to work around bugs, or to populate the local store).

So: in both cases, these are deprecating particular permutations of flags in order to be explicit, and leave open the possibility of allowing those combinations of flags in the future... rather than conditionally computing values for them (i.e. conditionally setting eager_fetch based on --remote-execution).

@Eric-Arellano
Copy link
Contributor

Okay, thanks for explaining that. Sounds reasonable.

For --remote-execution implying --remote-cache, how about we keep that behavior for this release, but deprecate it not being explicit? That way, we don't violate our deprecation policy. We do take away the ability to use RE w/o RC for this release, but that wasn't possible to begin with!

I also still think it's better for the deprecations to be a dedicated PR marked "User API change". Otherwise, people reading the changelog won't see this change.

@stuhood
Copy link
Member Author

stuhood commented Jun 17, 2022

I also still think it's better for the deprecations to be a dedicated PR marked "User API change". Otherwise, people reading the changelog won't see this change.

The behavior is changing in this PR: if this were split into two PRs, then between the first and second PR, you would have the behavior change, but no warning of the behavior change.

I can change the title and label to make it clearer that this is a user API change?

For --remote-execution implying --remote-cache, how about we keep that behavior for this release, but deprecate it not being explicit? That way, we don't violate our deprecation policy. We do take away the ability to use RE w/o RC for this release, but that wasn't possible to begin with!

Ok... that's probably reasonable.

@Eric-Arellano
Copy link
Contributor

For --remote-execution implying --remote-cache, how about we keep that behavior for this release, but deprecate it not being explicit?

Ok... that's probably reasonable.

If you do that, then there is no behavior change. Right?

My proposal is to land this PR as solely a performance change -- add special casing so RE implies RC, as before. Then, a quick followup w/ the deprecation.

…enable remote caching.

[ci skip-build-wheels]
@stuhood stuhood force-pushed the stuhood/command-stack-refactors branch from 8278eca to fc9c014 Compare June 17, 2022 16:21
@stuhood
Copy link
Member Author

stuhood commented Jun 17, 2022

My proposal is to land this PR as solely a performance change -- add special casing so RE implies RC, as before. Then, a quick followup w/ the deprecation.

I've pushed a change to do this. But FWIW, this feels like artificially contorting logical units of work for the purposes of the changelog. I like having an automated changelog as much of the rest of us, but when the result is two PRs which cannot be reverted independent of one another, it's not clear that it's a win.

@Eric-Arellano
Copy link
Contributor

I like having an automated changelog as much of the rest of us, but when the result is two PRs which cannot be reverted independent of one another, it's not clear that it's a win.

I haven't put any time into the proposal, but I actually am not a fan of the automated changelog because it indeed contorts changes like this. I would rather have a changelog that we manually update with each PR, like PyO3 does.

@stuhood stuhood enabled auto-merge (squash) June 17, 2022 16:29
@stuhood stuhood merged commit 0616042 into pantsbuild:main Jun 17, 2022
@stuhood stuhood deleted the stuhood/command-stack-refactors branch June 17, 2022 18:14
stuhood added a commit that referenced this pull request Jun 17, 2022
…ore resilient (#15850)

As described in #11331, in order to avoid having to deal with missing remote content later in the pipeline, `--remote-cache-eager-fetch` currently defaults to true. This means that before calling a cache hit a hit, we fully download the output of the cache entry.

In warm-cache situations, this can mean downloading a lot more than is strictly necessary. In theory, you could imagine `eager_fetch=False` downloading only stdio and no file content at all for a 100% cache hit rate run of tests. In practice, high hitrate runs [see about 80% fewer bytes downloaded, and 50% fewer RPCs](#11331 (comment)) than with `eager_fetch=True`.

To begin moving toward disabling `eager_fetch` by default (and eventually, ideally, removing the flag entirely), this change begins "backtracking" when missing digests are encountered. Backtracking is implemented by "catching" `MissingDigest` errors (introduced in #15761), and invalidating their source `Node` in the graph. When a `Node` that produced a missing digest re-runs, it does so using progressively fewer caches (as introduced in #15854), in order to cache bust both local and remote partial cache entries.

`eager_fetch=False` was already experimental, in that any `MissingDigest` error encountered later in the run would kill the entire run. Backtracking makes `eager_fetch=False` less experimental, in that we are now very likely to recover from a `MissingDigest` error. But it is still the case with `eager_fetch=False` that persistent remote infrastructure errors (those that last longer than our retry budget or timeout) could kill a run. Given that, we will likely want to gain more experience and further tune timeouts and retries before changing the default.

Fixes #11331.

[ci skip-build-wheels]
stuhood added a commit that referenced this pull request Jun 22, 2022
…}` with `--remote-execution` (#15900)

#15854 moved to using the `remote_cache` runner wrapped around the `remote` runner. That opened the door to using different values of various `--remote-cache` settings, but to preserve existing behavior, we implicitly enabled those settings with remote execution.

This change deprecates the implicit settings, asking that users set the previous defaults manually.

[ci skip-build-wheels]
Eric-Arellano added a commit that referenced this pull request Oct 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants