Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof of concept: support remote caching to Github Actions cache #17840

Closed
wants to merge 26 commits into from

Conversation

huonw
Copy link
Contributor

@huonw huonw commented Dec 18, 2022

This PR does a 'spike' to validate using a remote blob store as a (fine-grained) remote cache. It isn't at all ready for landing or detailed review (lots of debugging code, poorly factored, hard-coded config, plus #[allow(warnings)] in a bunch of places), but I think it validates one key thing: it's not too hard to do non-REAPI remote caching.

Specifically, this PR adds support for caching to/from Github Actions' built-in cache, ala https://github.com/actions/cache but fine-grained and integrated into pants. This is inspired by the support for using GHA as a fine-grained cache in both docker buildx and, recently, sccache.

This work relates to #11149, as it isn't specific to Github Actions, beyond the details of actually pushing bytes to it. That is, this PR also proves out the possibility of using other blob stores such as other CI caches or AWS's S3 or GCP's GCS. The core moving parts are two traits, which can be slotted into the lowest level of remote cache processing, and then the rest of the system seems to work with minimal changes:

// in `fs/store/`:
#[async_trait]
pub trait ByteStoreProvider: Sync + Send + 'static {
  async fn store_bytes(&self, digest: Digest, bytes: ByteSource) -> Result<(), String>;
  async fn load_bytes(&self, digest: Digest) -> Result<Option<Bytes>, String>;

  /// List any missing digests.
  ///
  /// None = not supported.
  async fn list_missing_digests(&self, _digests: &mut (dyn Iterator<Item = Digest> + Send)) -> Result<Option<HashSet<Digest>>, String> {
    Ok(None)
  }

  fn chunk_size_bytes(&self) -> usize;
}

// in `process_execution/`:
#[async_trait]
pub trait RemoteCacheProvider: Sync + Send + 'static {
  async fn update_action_result(&self, action_digest: Digest, action_result: ActionResult) -> Result<(), String>;
  async fn get_action_result(&self, action_digest: Digest, context: &Context) -> Result<Option<ActionResult>, String>;
}

I'm imagining, once this style of infrastructure is here, it's then easy enough to add support for other options when people want them, covering the options supported by the various build systems listed in #11149 (comment). (Assuming the table in that PR is still up-to-date, pants would have the easiest/cheapest remote caching via CI-focused caching like this proof-of-concept.)

Example

I've run this as an example in https://github.com/huonw/pants-cache-github-actions-poc, with https://github.com/huonw/pants-cache-github-actions-poc/actions/runs/3717717056/jobs/6305350011 being a specific example (logs saved for reference). I set up 30 Python test files, that each take 10s. When cached, the job runs much faster, and reports:

✓ //test_01.py:tests succeeded in 10.32s (cached remotely).
✓ //test_02.py:tests succeeded in 10.34s (cached remotely).
...
✓ //test_30.py:tests succeeded in 10.34s (cached remotely).

(That step seems to take a while (33s), but is running with MODE=debug for faster pants_from_sources Rust compilation. In release mode, it's more like 10-15s.)

Getting this ready to land

I've thought through a loose plan for next steps, assuming there's high-level agreement about landing this at all:

  1. preparatory refactoring to make this easier to land, particularly a little bit of teasing out of the REAPI types/calls, and adjusting the load_bytes_with function
  2. working out how to do the configuration required for more general caching
  3. actually landing the GHA support, with additional controls like concurrency limits (this final support would likely look quite different to the code in this PR)

Questions

  1. Before I spend more effort on this, is work towards this likely to be accepted?
  2. Any thoughts on how to configure this?
    1. I've been assuming a remote_backend = "..." global option with values like reapi and gha. Maybe the remoting config should move to its own subsystem somehow, but maybe that doesn't have to block this work?
    2. Maybe the URL/token provided by multiple environment variables: for the URL, allow reusing remote_store_address/PANTS_REMOTE_STORE_ADDRESS but also support the GHA name ACTIONS_CACHE_URL? For the token, somehow generalise remote_oauth_bearer_token_path.
    3. What about erroring out for irrelevant options?

Notes

  • This sort of "local" remote caching in CI may be able to augment truly remote caching, by providing a 'closer' cache that saves on network bandwidth. For instance, when reading a blob, first check the CI cache, before going out to the REAPI server (and when writing, write to both).
  • This would make https://www.pantsbuild.org/docs/using-pants-in-ci cleaner and more compelling.
  • The split to be generic over different "providers" seems like it would make testing the remote caching code easier, most tests can use an in-memory cache that implements the right traits.
  • This doesn't put any effort towards making the remote cache backend pluggable, meaning additional backends will need to be upstream in pants' Rust code. Pluggability seems like a much bigger effort.
  • Specifics of caching to GHA:
    • It's... kinda slow, but that's fine (since there's local speculation too). For example, the build/logs linked above have histograms on, and show remote_cache_get_action_result_time_micros ranging from 74ms - 1.1s (p50: 610ms), and remote_store_read_blob_time_micros ranging from 33ms - 561ms (p50: 249ms).
    • It's scoped per-branch, with 'feature' branches being able to read from caches from the default branch, but not write, meaning a PR workflow would still have to build new targets at least twice: once in the PR branch (available for reuse for later builds of that PR), and then once in the default branch after merge (available for reuse by all builds).
    • I'm not sure GHA can effectively handle cache_content_behavior = "validate", as, AFAIK, there's no bulk query available. It'd have to be a bunch of individual queries, but maybe that's okay?
    • This sort of caching only works for CI builds, and doesn't provide any benefit for local dev.
    • It seems like the environment variables required to find the cache are only provided to actions, rather than shell run steps, which forces the use of github-script to reexport them: https://github.com/huonw/pants-cache-github-actions-poc/blob/2e7ccfbfe/.github/workflows/ci.yml#L41-L46. This would be able to be packaged up into a init-pants, and/or automatically provided by if there's "invoke pants" action too.

huonw added a commit that referenced this pull request May 23, 2023
This separates the `store::remote::ByteStore` coordination code from the
gRPC REAPI implementation code, by:

- adding a `store::remote::ByteStoreProvider` trait
- moving the REAPI implementation into `store::remote::reapi` and
implementing that trait

This is, in theory, just a lift-and-shift, with no functionality change,
other than maybe changing some logging and when exactly a work-unit is
created.

This is partial preparation work for supporting more remote stores like
GHA cache or S3, for #11149, and is specifically broken out of #17840.
Additional work required to actually solve #11149:

- generalising the action cache too
- implementing other byte store providers
- dynamically choosing the right provider for `store::remote::ByteStore`

The commits are individually reviewable:

1. preparatory clean-up
2. define the trait
3. move the REAPI code and implement the trait, close to naively as
possible:
  - https://gist.github.com/huonw/475e4f0c281fe39e3d0c2e7e492cf823 has a
white-space-ignoring diff between `remote.rs` after 1, and the
`reapi.rs` file in this commit
  - there are some changes: making functions non-generic, and eliminating
the work-units etc. (all the code that was left in `remote.rs`)
5. minor clean-up
6. more major clean-up: inline the method calls into the trait
definition, so that they're not just `fn method(&self, ...) {
self.real_method(...) }`:
  - as with 3, this is just a move, without changing the details of the
code...
  - ...except for some minor changes like converting `?` into `.map(|e|
e.to_string())?` on the `get_capabilities` call in `store_bytes`
huonw added a commit that referenced this pull request Jun 1, 2023
This separates the `remote::remote_cache` coordination code from the
gRPC REAPI implementation by:

- adding a `remote::remote_cache::ActionCacheProvider` trait
- moving the REAPI implementation into `remote::remote_cache::reapi` and
implementing that trait

This is, in theory, just a lift-and-shift, with no functionality change.

This is preparation work for supporting more remote stores like GHA
cache or S3, for #11149, and
is specifically broken out of
#17840. It continues #19050.
Additional work required to actually solve
#11149:

- implementing other byte store and action cache providers
- dynamically choosing the right providers for
`store::remote::ByteStore` and `remote::remote_cache`

The commits are individually reviewable:
1. preparatory breaking out of gRPC code
2. defining the trait
3. move the REAPI code and implement the trait, close to naively as
possible:
  - https://gist.github.com/huonw/a60ad807b05ecea98387294c22de67cb has a
    white-space-ignoring diff between `remote_cache.rs` after 1, and the
    `reapi.rs` file in this commit (it's less useful than #19050, since most
    of the code is deleted, but, buried in there are chunks of completely
    unchanged code)
4. minor clean-up
@chris-stetter
Copy link

Would love this feature 🤟

@huonw
Copy link
Contributor Author

huonw commented Sep 14, 2023

This has been broken up into a whole bunch of pre-work PRs (see all the mentions above), which have culminated in the rather simpler #19831. Closing favour of that.

@huonw huonw closed this Sep 14, 2023
@huonw huonw deleted the feature/other-remote-caches branch September 14, 2023 23:05
huonw added a commit that referenced this pull request Sep 27, 2023
This adds support for remote caching to the Github Actions Cache, by
setting `remote_store_address =
"experimental:github-actions-cache+https://..."`. This is implemented as
just another OpenDAL (#19827) provider.

As written, this requires the `ACTIONS_CACHE_URL` and
`ACTIONS_RUNTIME_TOKEN` values to be massaged into the appropriate pants
configuration. I've included basic docs for how to do this. This set-up
can/should be packaged into the `init-pants` action in future, but for
now, we're in experimental-land.

There's various ways in which this can be improved, but I think it's
helpful to have a "real world" example of a remote cache beyond the
`file://` one to be able to guide the refactoring, plus good to start
testing this earlier. Thus, I'd like to land this, and do them as
follow-up:

- better docs/help (always!), particularly on the `using-pants-in-ci`
docs page, which I haven't updated here
- now that I've implemented this, I don't like my
`<provider>+<scheme>://...` proposal from #19827. I think it'd be better
to have a separate `remote_store_provider =
"experimental-github-actions-cache"` (or `"reapi"`, etc.) option and
`remote_store_address` is just the raw URL (with the provider validating
it can understand the URL, of course)
- it'd be good to be able to pass the token in in a simpler way, I'm
thinking there could be a `remote_oauth_bearer_token` option literally,
which is expected (and suggested in docs) to be passed as the
`PANTS_REMOTE_OAUTH_BEARER_TOKEN` env var, in addition to
`remote_oauth_bearer_token_path`

These are tracked in #19902 (and all of the others in that issue would
be good too, but weren't as acutely surfaced by this change, I'll start
chipping away at it once this is in).

This is awkward to test. My plan, not implemented here, is to somehow
tee up a dedicated set of tests to run within
https://github.com/nektos/act somehow. For now, we'll have to go with
manual testing, e.g. this specific PR is validated by
https://github.com/huonw/pants-cache-github-actions-poc/actions/runs/6298908094/job/17119702612#step:11:983

This is the first example of something useful user-facing for #11149.
This doesn't close it, I'll wait until we have a blob store like S3 too.

The commits are individually reviewable.

(Replaces #17840.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants