Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make maven_jar and friends smarter by re-using previously fetched artifacts across different projects #1752

Closed
davido opened this issue Sep 11, 2016 · 15 comments
Labels
P2 We'll consider working on this in future. (Assignee optional) type: feature request
Milestone

Comments

@davido
Copy link
Contributor

davido commented Sep 11, 2016

Current maven_jar() implementation is limited to only using the fetched artifact for specific project. It doesn't provide solution for very basic requirements (that native Maven would provide):

  • previously fetched artifacts should survive entire clean of output base for specific project (bazel clean --expunge)
  • previously fetched artifacts can be reused by other clones of the same project. Say i have separate clones for master and stable branch
  • previously fetched artifacts can be re-used by different projects. Say Gerrit Code Review, JGit, Gitiles and 42 Gerrit plugins (standalone build mode) are cloned and built on the same machine. Their prerequisites are almost the same

See Gerrit Code Review maven_jar Bucklet: [1] implementation how to get it right. The implementation is putting all fetched artifact in project independent area and is linking the artifacts to the poject output: [2]. More context is here: [3,4].

The bandwith is just too valuable resource to throw away (or ignore) previously downloaded artifacts and re-fetch the gigabytes of data again.

@kchodorow
Copy link
Contributor

+1, it's super annoying to have these re-downloaded on every new workspace. This is a little tricky to implement, because we do need a way to clear the "master cache," whether that's to clear or update individual entries that might be corrupt/out of date or just avoid taking up all of the user's disk space. We'll also have to be careful about correctness, perhaps using the cache should require a hash.

(There were other requests for this a while ago, but now I can't find them. Will link if I come across them.)

@kchodorow kchodorow added type: feature request P2 We'll consider working on this in future. (Assignee optional) labels Sep 12, 2016
@jin
Copy link
Member

jin commented Sep 12, 2016

I've thought about these problems as well while re-implementing maven_jar() (#1410), especially while dealing with the location of local repositories. Will need to find a balance between the persistence of local repositories and Bazel's reproducibility and correctness ethos.

@kchodorow
Copy link
Contributor

Aha, related to #1266.

@davido
Copy link
Contributor Author

davido commented Sep 14, 2016

Thanks for the link. There is also similar feature request mentioned in #1266 on the dev mailing list.

As pointed out by @jin it would be trivial, to teach maven_jar reimplementation as Sykalrk rule to use alternative (project independent) directory, say ~/.bazel_external_repository: [1], right? Bonus point, to make this location configurable, so that we could just point to the Buck's download_artifact cache, or even teach the Skylark rule to hijack ~/.m2 for that purpose and re-use the downloaded artifact by Maven itself (for those of us who still have to use Maven for some other projects, that weren't ported to Buck or Bazel yet).

@jin
Copy link
Member

jin commented Sep 16, 2016

Initial thoughts: a basic design uses a maven_local_repository rule, which is basically a thin wrapper around native local_repository. It only has one attribute, path, which lets you specify the absolute path of the system's local Maven repository.

load("@bazel_tools//tools/build_defs/repo:maven_rules.bzl", "maven_local_repository")
maven_local_repository(
    path = "/home/johndoe/.m2",
)

This folder will then be symlinked to each maven_jar subfolder in //external, hence allowing caching across Bazel builds, and reusability across build tools.

@kchodorow
Copy link
Contributor

This would need to work for all repository rules, not just Maven. Right now we download/link stuff into output_base/externalreponame, we'd need to come up with a central location to download/link stuff, then add a step to create the symlink output_base/external/reponame -> $CENTRAL_CACHE/reponame.

@damienmg
Copy link
Contributor

Caching the HttpDownloadValue is probably the easiest way to go forward. But we might want to offer that caching capabilities a bit more exposed directly (so execute result can also be cached)

@johnynek
Copy link
Member

generally caching any download with a sha would be great. This is a major painpoint for our developers as we have a lot of downloads (many external repos: maven_jar + git_repository).

bazel-io pushed a commit that referenced this issue Oct 5, 2016
…tep to

caching external repositories.

The option is categorized as hidden because it is a no-op.

Re-submit with fix from rollback in commit 9883e22 due to JDK7 build failure.

GitHub issue: #1752

--
MOS_MIGRATED_REVID=135231668
bazel-io pushed a commit that referenced this issue Oct 7, 2016
…ted HttpCache skeleton to implement caching logic of HttpDownloadValues as the first step (more types of caches will come later).

Having RepositoryDelegatorFunction initialize the cache in the respective RepositoryFunction handlers decouples the cache implementation from itself. It delegates the choice of Cache classes to the respective RepositoryFunctions, and let them decide what to do with the PathFragment of the cache location.

Continuation of commit 239d995.

A follow up CL will contain the implementation of HttpCache. For now, it's the empty interface of com.google.common.cache.Cache.

GITHUB: #1752

--
MOS_MIGRATED_REVID=135400724
bazel-io pushed a commit that referenced this issue Oct 19, 2016
To set and use a RepositoryCache instance in HttpDownloader while parsing the command line options, we can pass an AtomicReference<HttpDownloader> instance from BazelRepositoryModule to the HttpArchiveFunctions. However, we'll need to change HttpDownloader download() calls to be non-static in order to initialize an instance of HttpDownloader in BazelRepositoryModule.

Remaining TODOs:

- RepositoryCache implementation and unit testing
- RepositoryCache lockfiles
- RepositoryCache integration testing

GITHUB: #1752

--
MOS_MIGRATED_REVID=136593517
bazel-io pushed a commit that referenced this issue Oct 27, 2016
This is a basic implementation of writing and reading HttpDownloader download artifacts, keyed by the artifact's SHA256 checksum. For an artifact to be cached, its SHA256 value needs to be specified in the rule. Rules supported: http_archive, new_http_archive, http_file, http_jar, 

Remaining TODOs:

- Lockfiles for concurrent operations in the cache.
- Integration testing

GITHUB: #1752

--
MOS_MIGRATED_REVID=137289206
bazel-io pushed a commit that referenced this issue Oct 27, 2016
Remaining TODOs:

- Lockfiles for concurrent operations in the cache.

GITHUB: #1752

--
MOS_MIGRATED_REVID=137296606
@jin
Copy link
Member

jin commented Oct 27, 2016

0590483 now lets you use --experimental_repository_cache=$HOME/some/path to cache downloaded artifacts that have their SHA256 values specified. This cache will survive bazel clean --expunge. Works with artifacts downloaded with new_http_archive, http_archive, http_file, http_jar, Skylark's download and download_and_extract. Maven support coming up.

bazel-io pushed a commit that referenced this issue Oct 28, 2016
…download_and_execute().

GITHUB: #1752

--
MOS_MIGRATED_REVID=137535936
bazel-io pushed a commit that referenced this issue Nov 3, 2016
…line instantiation of HttpDownloader and RepositoryCache in BazelRepositoryModule.

There are sufficient similarities between the download flows of HttpDownloader and MavenDownloader such that we can extend HttpDownloader to MavenDownloader, and reuse method headers such as checkCache and download.

GITHUB: #1752

--
MOS_MIGRATED_REVID=137982375
bazel-io pushed a commit that referenced this issue Nov 3, 2016
@jin
Copy link
Member

jin commented Nov 3, 2016

With 38e54ac, maven_jar artifacts with the SHA1 value specified can now be cached using --experimental_repository_cache.

@davido
Copy link
Contributor Author

davido commented Nov 5, 2016

Thanks. This is very much appreciated!

I have built from the tip of master and am trying to integrate this great feature in Gerrit Code Revew Bazel build and having some questions:

Neither ~/.gerritcodereview/bazel_repository_cache nor $HOME/.gerritcodereview/bazel_repository_cache seems to be supported. What we would like to do is to put these lines in .bazelrc in root of Gerrit project (this file is under GIT control, so we don't have the option to use resolved user name, only ~ or $HOME):

build --experimental_repository_cache=~/.gerritcodereview/bazel_repository_cache --workspace_status_command=./tools/workspace-status.sh --strategy=Javac=worker
fetch --experimental_repository_cache=~/.gerritcodereview/bazel_repository_cache

Note that Buck supports something similar, one can hijack the directory cache to user home directory (excerpt from .buckconfig):

[cache]
  mode = dir
  dir = ~/.gerritcodereview/buck-cache/locally-built-artifacts

I added this feature to Buck 2 years ago: facebook/buck@a1ba001

For now I tested it with --experimental_repository_cache=/home/davido/.gerritcodereview/bazel_repository_cache and it works as expected. I tried bazel fetch gerrit, then disconnected the wire and have rebuilt with bazel build gerrit without network connection. It has also survived bazel clean --expunge ;-). This is great!

Question: Can it be that the cached artifact are copied to external artifacts and not linked? i cannot see that symbolic links are used. Any particular reason to not use symbolic links for that?

@kchodorow
Copy link
Contributor

Question: Can it be that the cached artifact are copied to external artifacts and not linked? i cannot see that symbolic links are used. Any particular reason to not use symbolic links for that?

We may change this in the future, but for now we decided to use copies to simplify cache cleanup.

Bazel options generally don't support ~ nor $HOME, I filed #2054 to gauge interest/have discussion.

@kchodorow kchodorow added this to the 0.5 milestone Dec 9, 2016
@kchodorow kchodorow removed this from the 0.5 milestone Dec 21, 2016
@kchodorow kchodorow modified the milestones: 0.6, 0.5 Dec 21, 2016
@pwnall
Copy link

pwnall commented Jan 9, 2017

Asides from all the benefits mentioned above, I think that having a build cache makes it easy to use bazel repositories with package managers that do not allow (by policy) the build process to download anything on its own.

@dslomov
Copy link
Contributor

dslomov commented Mar 21, 2019

We are deprecating native maven_jar.

@Vincent-M
Copy link

Vincent-M commented Mar 24, 2019

what is "native maven_jar" ? is this maven_jar(...) rule being deprecated? or will we just need to add a load(..., maven_jar at the top of the WORKSPACE file?

Regardless of the answer, the documentation says:

Prefer http_archive to git_repository, new_git_repository, and maven_jar

Could someone give an example on how to transform the following example:

maven_jar(
    name = "antlr27",
    artifact = "antlr:antlr:2.7.7",
    attach_source = False,
    sha1 = "83cd2cd674a217ade95a4bb83a8a14f351f48bd0",
)

to the http_library equivalent? (like let's say for instance that jar was not available in maven, but in some other place via a https url.)

My guess is in WORKSPACE :

http_archive(
    name = "antlr_lib",
    urls = [http://central.maven.org/maven2/antlr/antlr/2.7.7/antlr-2.7.7.jar"],
    sha256 = "<...>",
    build_file = "@//:antlr.BUILD",
)

but what to put in the antlr.BUILD file? a java_import?
Or is there a way to just have the http_archive like this:

http_archive(
    name = "antlr27",
    urls = [http://central.maven.org/maven2/antlr/antlr/2.7.7/antlr-2.7.7.jar"],
    sha256 = "<...>",
)

and direclty use the jar in the targets similar to how "@antlr27//jar" would be used out of maven_jar?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) type: feature request
Projects
None yet
Development

No branches or pull requests

8 participants