Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel downloads unnecessary artifacts from remote cache #1922

Closed
dfabulich opened this issue Oct 10, 2016 · 19 comments
Closed

Bazel downloads unnecessary artifacts from remote cache #1922

dfabulich opened this issue Oct 10, 2016 · 19 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) type: feature request

Comments

@dfabulich
Copy link
Contributor

Consider this repository.

https://github.com/dfabulich/bazel-hazelcast-unnecessary-download

It includes a build file with three targets.

genrule(
    name='big',
    outs=['big.bin'],
    cmd='dd if=/dev/urandom of=$@ count=100 bs=1048576',
)

genrule(
    name='x',
    srcs=['big'],
    outs=['x.txt'],
    cmd='echo hello > $@',
)

genrule(
    name='y',
    srcs=['x'],
    outs=['y.txt'],
    cmd='cp $(location x) $@',
)

The big target generates a 100MB file. The x target depends on the big target and generates a short text file containing the word "hello". The y target depends on x and just copies it to the destination file.

This simulates the situation in our build, where we generate a large artifact as a build dependency for a much smaller artifact; downstream dependencies only need the smaller built file, not the large build dependency.

The tools/bazel.rc file in this repository contains settings that enable Hazelcast distributed artifact caching.

build --hazelcast_node=127.0.0.1:5701 --spawn_strategy=remote --genrule_strategy=remote

To reproduce the issue, clone the repository and run java -jar hazelcast-3.7.2.jar & to spawn Hazelcast in the background and run bazel build :y. Then run bazel clean && bazel build :y to fetch the built artifacts from the repository.

Actual: The critical path is 2s long; the 100MB big dependency is fetched and installed in bazel-genfiles/big.bin. (And it can take much longer than 2s if your dependency is even bigger than that, and if you're downloading the dependency over the internet.)

Expected: Since x only depends on big at build time, Bazel should skip downloading big and just download x.txt and provide it to y.txt.

If you comment out the srcs attribute of x, the build behaves as desired, and the critical path is 0.01s.

bazel version

$ bazel version
Build label: 0.3.2
Build target: bazel-out/local-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Fri Oct 7 17:25:52 2016 (1475861152)
Build timestamp: 1475861152
Build timestamp as int: 1475861152
@hermione521
Copy link
Contributor

It looks intended to me. If you include "big" in src, it should be there.
@philwo I remember you worked on Hazelcast?

@dfabulich
Copy link
Contributor Author

ping?

@hermione521
Copy link
Contributor

Assigned but he's on vacation now.

@ulfjack
Copy link
Contributor

ulfjack commented Mar 14, 2017

Bazel requires all files in the action graph to be on local disk (or on an in-memory or network file system), and if a file is missing, it'll re-execute the action (or re-download from the remote cache). In this case, when you build y, it'll see the dependency on the big rule and check that the big output file exists. If not, it'll execute big, i.e., download the file.

I'm not sure how what you're proposing would work. In theory, if it's working with a remote action cache, Bazel could avoid downloading files and use the checksum from the remote cache to check whether subsequent actions need to be re-executed. But how would it decide which files to download? Would it download any file at all? Or are you saying that it shouldn't download intermediate files at all?

@dfabulich
Copy link
Contributor Author

Yes, I'm saying that it shouldn't download intermediate files at all.

@ulfjack
Copy link
Contributor

ulfjack commented Mar 14, 2017

Ok. We first need to make remote caching / execution actually work before we could possibly work on this. I'm not sure if this will work well with Bazel's current design, but it's not completely out of the question either.

Within Google, we've solved the problem of downloading large files outside of Bazel. We use a FUSE file system that downloads files on-demand and provides fast access to checksums (without downloading), so we don't actually eagerly download files from the remote cache / remote execution. The advantage of making it a file system is that it works transparently with other tools that may expect intermediate files to be locally accessible.

@damienmg damienmg added the P2 We'll consider working on this in future. (Assignee optional) label Apr 11, 2017
@damienmg damienmg added this to the 1.0 milestone Apr 11, 2017
@ulfjack ulfjack changed the title Bazel downloads unnecessary artifacts from Hazelcast Bazel downloads unnecessary artifacts from remote cache Jun 22, 2017
@ulfjack
Copy link
Contributor

ulfjack commented Jun 22, 2017

Changing title to clarify that this doesn't just apply to hazelcast.

@jerrymarino
Copy link
Contributor

I've got a potentially similar issue: several thousands of object files that are compiled into .as for later linking. Mainly, I'm using objc_library and ios_application rules.

Under the current design of Bazel, it downloads all of the .os from the cache, even if we just need the resultant .a.

I'm wondering if:

  • The FUSE solution that @ulfjack mentioned is relevant here ( since we can just look at checksums of .os )
  • I should attempt to add dependency graph "pruning" logic to prevent the downloads

@mirandaconrado
Copy link

@ulfjack I like the idea of the FUSE. Looking through Bazel's code, I don't think it currently support it, right? If I provide a FUSE that gives lazy download and could expose hashes (not sure how yet), how would we go about adding support for it in Bazel?

I'm really interested in this as the downloads have been a little bit painful for us and it would also solve the same problem with repositories., I think.

@jerrymarino
Copy link
Contributor

It might be possible to tweak SkyFrameExecutor to skip redundant downloads: #5505 . It seems like this is what ActionFileSystem did / does.

I'm super interested in this as well. For some workloads I'm testing against, mean incremental build times with remote caching are longer than default caching even with great hardware / network.

@ulfjack
Copy link
Contributor

ulfjack commented Jul 16, 2018

@mirandaconrado the way it works internally is that we implement the OutputService interface, and tell the local FUSE daemon about the outputs of each action after executing it remotely. I.e., we tell it "put a file with this remote address in this location". We'd need to come up with a protocol for doing so. We'd also need to hook it up with remote execution, but it shouldn't be difficult (famous last words).

@jerrymarino my understanding of the ActionFileSystem is that it sandboxes actions within Bazel to enforce that we don't write to local disk. If you have a FUSE file system with local write support, then the ActionFileSystem isn't strictly necessary. We're working on it because we're looking at deployments inside Google that will not have FUSE file systems or will not have local write support. As such, I don't think it helps (except by tracking down cases where we accidentally write files locally).

@mirandaconrado
Copy link

Interesting, I'll take a deeper look on that interface. Thanks for the pointer!

@mirandaconrado
Copy link

@ulfjack, if you don't mind me asking, could you also give some details on how you handle the file download themselves without the remote execution part? More specifically, I'm curious about repositories. We have a pretty big 3rd party library (>10GB) that we update daily and we make it available as a repository in bazel. Currently people have to download it even if they don't use bc it's required to parse some build files correctly.

This seems related to the problem in the original issue, as we get the download from the remote cache. Your answer seems more related to remote execution (I haven't dug deep enough to figure out how much these are different).

So I was wondering how we could make this work in this case. Bazel would ask for the hash of the file to build the graph and verify what needs to be invalidated. We could have multiple files remotely on the same path but with different hashes, but I think we can identify them by the hash of how they're created. We would still need to download it to get the actual hash of the result, if I understand correctly, which doesn't sound great. For repos, I'm not even sure we can identify multiple versions of them in a reasonable fashion.

@ulfjack
Copy link
Contributor

ulfjack commented Jul 20, 2018

@mirandaconrado That's right. I was just asked a similar question this week. :-)

Unfortunately, I don't know. Bazel will need any BUILD files required to build the repositories, but if there was something to track intermediate artifacts, it could probably also be used to track non-BUILD files. What that would look like, though, I have no idea - internally, we have everything checked into a single monorepo.

I hear that @ola-rozenfeld might want to work on this. Also @buchgr.

@mirandaconrado
Copy link

Yeah, I'm familiar with the legend of Google's monorepo =) In our case, we have a library that people co-develop and it's in a separate repo, so we use a select to either get the nightly or the local, but people using the local have to download anyway.

It seems like remote build actually simplify things in this case. You ask for a remote build, the server checks if it's already built and send back the location of the result, like you mentioned in the FUSE answer. This seems easier (from the client perspective) bc it can just assume all files can be found locally, like you mentioned.

To solve without it, I imagine we'd have to go for a pull model (we recurse on the dependencies of the target we're trying to build until we find the referred files and don't continue back, which would solve the problem for the original poster) instead of a regular graph traversal (start from sources and make sure everything on the path is available), which seems to be what Bazel currently does.

I also think, for us to pull this out, we couldn't depend on the actual hash of the file, just on the build description. I recall seeing a presentation where someone mentioned that the hash of a result file (as in the actual content) was used to create the hash of the next steps. Could you confirm it? I can't find the source right now and maybe it's been handled.

@ulfjack
Copy link
Contributor

ulfjack commented Jul 24, 2018

We use the hash of an output file to compute the hash for a subsequent step that uses that output file as input.

@buchgr buchgr self-assigned this Jul 31, 2018
@buchgr buchgr removed this from the 1.0 milestone Aug 6, 2018
@buchgr buchgr added this to the Remote Execution milestone Aug 6, 2018
@bergsieker bergsieker added P1 I'll work on this now. (Assignee required) and removed P2 We'll consider working on this in future. (Assignee optional) labels Sep 11, 2018
@helenalt
Copy link
Contributor

helenalt commented Nov 6, 2018

@buchgr do you have an eta for this work

@ittaiz
Copy link
Member

ittaiz commented Dec 28, 2018

@buchgr should this be closed as duplicate of #6862 ?

@buchgr
Copy link
Contributor

buchgr commented Jan 16, 2019

@ittaiz yep! good point!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) type: feature request
Projects
None yet
Development

No branches or pull requests