-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bazel downloads unnecessary artifacts from remote cache #1922
Comments
It looks intended to me. If you include "big" in src, it should be there. |
ping? |
Assigned but he's on vacation now. |
Bazel requires all files in the action graph to be on local disk (or on an in-memory or network file system), and if a file is missing, it'll re-execute the action (or re-download from the remote cache). In this case, when you build y, it'll see the dependency on the big rule and check that the big output file exists. If not, it'll execute big, i.e., download the file. I'm not sure how what you're proposing would work. In theory, if it's working with a remote action cache, Bazel could avoid downloading files and use the checksum from the remote cache to check whether subsequent actions need to be re-executed. But how would it decide which files to download? Would it download any file at all? Or are you saying that it shouldn't download intermediate files at all? |
Yes, I'm saying that it shouldn't download intermediate files at all. |
Ok. We first need to make remote caching / execution actually work before we could possibly work on this. I'm not sure if this will work well with Bazel's current design, but it's not completely out of the question either. Within Google, we've solved the problem of downloading large files outside of Bazel. We use a FUSE file system that downloads files on-demand and provides fast access to checksums (without downloading), so we don't actually eagerly download files from the remote cache / remote execution. The advantage of making it a file system is that it works transparently with other tools that may expect intermediate files to be locally accessible. |
Changing title to clarify that this doesn't just apply to hazelcast. |
I've got a potentially similar issue: several thousands of object files that are compiled into Under the current design of Bazel, it downloads all of the I'm wondering if:
|
@ulfjack I like the idea of the FUSE. Looking through Bazel's code, I don't think it currently support it, right? If I provide a FUSE that gives lazy download and could expose hashes (not sure how yet), how would we go about adding support for it in Bazel? I'm really interested in this as the downloads have been a little bit painful for us and it would also solve the same problem with repositories., I think. |
It might be possible to tweak I'm super interested in this as well. For some workloads I'm testing against, mean incremental build times with remote caching are longer than default caching even with great hardware / network. |
@mirandaconrado the way it works internally is that we implement the OutputService interface, and tell the local FUSE daemon about the outputs of each action after executing it remotely. I.e., we tell it "put a file with this remote address in this location". We'd need to come up with a protocol for doing so. We'd also need to hook it up with remote execution, but it shouldn't be difficult (famous last words). @jerrymarino my understanding of the ActionFileSystem is that it sandboxes actions within Bazel to enforce that we don't write to local disk. If you have a FUSE file system with local write support, then the ActionFileSystem isn't strictly necessary. We're working on it because we're looking at deployments inside Google that will not have FUSE file systems or will not have local write support. As such, I don't think it helps (except by tracking down cases where we accidentally write files locally). |
Interesting, I'll take a deeper look on that interface. Thanks for the pointer! |
@ulfjack, if you don't mind me asking, could you also give some details on how you handle the file download themselves without the remote execution part? More specifically, I'm curious about repositories. We have a pretty big 3rd party library (>10GB) that we update daily and we make it available as a repository in bazel. Currently people have to download it even if they don't use bc it's required to parse some build files correctly. This seems related to the problem in the original issue, as we get the download from the remote cache. Your answer seems more related to remote execution (I haven't dug deep enough to figure out how much these are different). So I was wondering how we could make this work in this case. Bazel would ask for the hash of the file to build the graph and verify what needs to be invalidated. We could have multiple files remotely on the same path but with different hashes, but I think we can identify them by the hash of how they're created. We would still need to download it to get the actual hash of the result, if I understand correctly, which doesn't sound great. For repos, I'm not even sure we can identify multiple versions of them in a reasonable fashion. |
@mirandaconrado That's right. I was just asked a similar question this week. :-) Unfortunately, I don't know. Bazel will need any BUILD files required to build the repositories, but if there was something to track intermediate artifacts, it could probably also be used to track non-BUILD files. What that would look like, though, I have no idea - internally, we have everything checked into a single monorepo. I hear that @ola-rozenfeld might want to work on this. Also @buchgr. |
Yeah, I'm familiar with the legend of Google's monorepo =) In our case, we have a library that people co-develop and it's in a separate repo, so we use a select to either get the nightly or the local, but people using the local have to download anyway. It seems like remote build actually simplify things in this case. You ask for a remote build, the server checks if it's already built and send back the location of the result, like you mentioned in the FUSE answer. This seems easier (from the client perspective) bc it can just assume all files can be found locally, like you mentioned. To solve without it, I imagine we'd have to go for a pull model (we recurse on the dependencies of the target we're trying to build until we find the referred files and don't continue back, which would solve the problem for the original poster) instead of a regular graph traversal (start from sources and make sure everything on the path is available), which seems to be what Bazel currently does. I also think, for us to pull this out, we couldn't depend on the actual hash of the file, just on the build description. I recall seeing a presentation where someone mentioned that the hash of a result file (as in the actual content) was used to create the hash of the next steps. Could you confirm it? I can't find the source right now and maybe it's been handled. |
We use the hash of an output file to compute the hash for a subsequent step that uses that output file as input. |
@buchgr do you have an eta for this work |
@ittaiz yep! good point! |
Consider this repository.
https://github.com/dfabulich/bazel-hazelcast-unnecessary-download
It includes a build file with three targets.
The
big
target generates a 100MB file. Thex
target depends on thebig
target and generates a short text file containing the word "hello". They
target depends onx
and just copies it to the destination file.This simulates the situation in our build, where we generate a large artifact as a build dependency for a much smaller artifact; downstream dependencies only need the smaller built file, not the large build dependency.
The
tools/bazel.rc
file in this repository contains settings that enable Hazelcast distributed artifact caching.To reproduce the issue, clone the repository and run
java -jar hazelcast-3.7.2.jar &
to spawn Hazelcast in the background and runbazel build :y
. Then runbazel clean && bazel build :y
to fetch the built artifacts from the repository.Actual: The critical path is 2s long; the 100MB
big
dependency is fetched and installed inbazel-genfiles/big.bin
. (And it can take much longer than 2s if your dependency is even bigger than that, and if you're downloading the dependency over the internet.)Expected: Since
x
only depends onbig
at build time, Bazel should skip downloadingbig
and just downloadx.txt
and provide it toy.txt
.If you comment out the
srcs
attribute ofx
, the build behaves as desired, and the critical path is 0.01s.bazel version
The text was updated successfully, but these errors were encountered: