Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: external dependency caching using remote gRPC protocol #2557

Closed
mwitkow opened this issue Feb 20, 2017 · 12 comments
Closed
Labels
P2 We'll consider working on this in future. (Assignee optional) stale Issues or PRs that are stale (no activity for 30 days) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: feature request

Comments

@mwitkow
Copy link

mwitkow commented Feb 20, 2017

Description of the feature request:

As part of implementing a proof of concept of a distributed cache for Bazel builds (see mwitkow/bazel-distcache) it seems that the remote_protocol.proto CASService is not used for caching <output_dir>/external content.

The reason why we're interested in building distacache is because we run Dockerized bazel builds (using Jenkins and Concourse), and we'd rather not share the output_dir verbatim.

Not having the external cachable is a massive problem for such a use case. We have quite a few git dependencies for rules_go since there are tons of external dependencies in Go, and quite a few Skylark rules are referenced this way. Moreover, we have quite a few Maven deps which would be cacheable easily.

There seems to already be a RepositoryCache implementation in Bazel that allows HttpDownloader and MavenDownloader to cache from local disk (using --experimental_repository_cache) as part of #1752. I think it could rely on the CASService, as it even has the appropriate hashing methods.

The proposal:
(a): Extend RepositoryCache to be able to use the CASService, by reusing RemoteActionCache.
(b): Come up with a way to use RepositoryCache for GitRepos. Most likely split them up to be a ZIP that can be cached like http_archive. Most likely belongs in a separate ticket.

CCs:
@ola-rozenfeld - for RemoteCache
@jin - for RepositoryCache
@kchodorow - since seems to be main stakeholder in #1752
@dinowernli - since he probably wants to be on this ticket anyway ;)

@mwitkow
Copy link
Author

mwitkow commented Feb 21, 2017

It's worth noting that we'd be willing to contribute time to implement it, but guidance would be much appreciated before we start :)

@damienmg
Copy link
Contributor

FWIW, getting the CAS interface unified for repository feature and remote cache would be good. As with the prototype of on disk cache, where I tried to share the on-disk cache for both of those.

Indeed sharing the RemoteActionCache implementation might be good, in which case we might want to extract an interface RemoteCASCache that RemoteActionCache would be extending (and containing only *FileContents and *Blobs function).

my 0.02$

@mwitkow
Copy link
Author

mwitkow commented Mar 15, 2017

@ola-rozenfeld is there an upper limit to Blob sizes in CAS? i.e. if we had a large external dep tar-gzed in there, would it exceed it?

@mwitkow
Copy link
Author

mwitkow commented Mar 31, 2017

@damienmg Would you guys accept upstream patches if we were to contribute them?

  • optionally fetching github GIT urls as http archives
  • uploading these post-factum to remote cas cache

@mwitkow
Copy link
Author

mwitkow commented Apr 7, 2017

So, having looked at the new_git_repository, it seems that all of it is done in Skylark rules:
https://github.com/bazelbuild/bazel/blob/67f0f4ba16e96e0d678d405197be2e39743fe150/tools/build_defs/repo/git.bzl

We're hitting the Git external checkout issues hard :( We have a rule of our own that downloads github.com git repos using a HTTPS archive link, but it doesn't work for transitive workspace deps (ones that come from external git repos).

@damienmg
Copy link
Contributor

damienmg commented Apr 9, 2017

@mwitkow: Sorry for the delay apparently I missed your message.

IIUC what you propose that doesn't seem something we would want to integrate. You mean like you define a git_repository and git repository figure out by magic that it points to a github repository so we download the tarball instead of the github repository? This is easily doable with a simple skylark macro.

For the second one, it would be we download an archive that does not provide a shasum but we compute it after the fact and store it nonetheless in the CAS. Since you won't specify the shasum we won't ask the CAS anyway, why would you want that?

What do you mean "it doesn't work for transtive workspace deps"? Do you mean you are hitting #2757?

@mwitkow
Copy link
Author

mwitkow commented Apr 10, 2017

Thanks for getting back :)

I agree that having a simple skylark rule would solve the problem for external deps that we define in our own workspace. However, we do source external rules, such as rules_scala. They usually define a <something>_repositories() are the things that "flatten" the WORKSPACE into a single file, circumventing the issue of #2757.

Unfortunately, such <something>_repositories() can use the new_git_repository themselves. Which means we have no way of controlling how they download Git repos.

The proposal to add the Github-archive fetch is just a cheeky work-around to the original problem we have:
Checking out, and clean-building a complicated bazel workspace spends tons of time on git clone and there is no way of caching that.

We have a dockerized "blessed, clean build environment" build pipeline, and we currently rely on an rclone hack to get an external directory populated (from an early morning build). You can probably imagine how flaky that is ;) We'd love to all in on the bazel-buildfarm approach and once #1413 is fixed, the checkout of workspace external deps would be the only blocker for us.

An ideal solution IMHO (pardon my ignorance if incorrect) would be to treat the results of WORKSPACE-scoped Skylark rules as artifacts similarly to how BUILD rules are treated. This way a result of a new_git_repository would be cached inside the CasService of the bazel-buildfarm exactly like partial build artifacts are.

@damienmg
Copy link
Contributor

damienmg commented Apr 10, 2017 via email

@damienmg damienmg added this to the 0.7 milestone Apr 21, 2017
@damienmg damienmg added the P2 We'll consider working on this in future. (Assignee optional) label Apr 21, 2017
@jin jin added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. and removed category: extensibility > external repositories labels Mar 4, 2019
@rockwotj
Copy link
Contributor

rockwotj commented Aug 6, 2019

Any updates here?

@meisterT meisterT removed this from the 0.7 milestone May 12, 2020
@philwo philwo added the team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website label Jun 15, 2020
@pl-misuw
Copy link

Hey, was wondering, if this is still pending, or is there a way to have a remote-cache'abilty for external

@rockwotj
Copy link
Contributor

@pl-misuw see https://github.com/buchgr/bazel-remote and #10622

In short you should be able to do this today.

@philwo philwo removed the team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website label Nov 29, 2021
@sgowroji sgowroji added the stale Issues or PRs that are stale (no activity for 30 days) label Feb 3, 2023
@sgowroji
Copy link
Member

Hi there! We're doing a clean up of old issues and will be closing this one. Please reopen if you’d like to discuss anything further. We’ll respond as soon as we have the bandwidth/resources to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) stale Issues or PRs that are stale (no activity for 30 days) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: feature request
Projects
None yet
Development

No branches or pull requests