Bazel downloads unnecessary artifacts from remote cache #1922

dfabulich · 2016-10-10T19:12:23Z

Consider this repository.

https://github.com/dfabulich/bazel-hazelcast-unnecessary-download

It includes a build file with three targets.

genrule(
    name='big',
    outs=['big.bin'],
    cmd='dd if=/dev/urandom of=$@ count=100 bs=1048576',
)

genrule(
    name='x',
    srcs=['big'],
    outs=['x.txt'],
    cmd='echo hello > $@',
)

genrule(
    name='y',
    srcs=['x'],
    outs=['y.txt'],
    cmd='cp $(location x) $@',
)

The big target generates a 100MB file. The x target depends on the big target and generates a short text file containing the word "hello". The y target depends on x and just copies it to the destination file.

This simulates the situation in our build, where we generate a large artifact as a build dependency for a much smaller artifact; downstream dependencies only need the smaller built file, not the large build dependency.

The tools/bazel.rc file in this repository contains settings that enable Hazelcast distributed artifact caching.

build --hazelcast_node=127.0.0.1:5701 --spawn_strategy=remote --genrule_strategy=remote

To reproduce the issue, clone the repository and run java -jar hazelcast-3.7.2.jar & to spawn Hazelcast in the background and run bazel build :y. Then run bazel clean && bazel build :y to fetch the built artifacts from the repository.

Actual: The critical path is 2s long; the 100MB big dependency is fetched and installed in bazel-genfiles/big.bin. (And it can take much longer than 2s if your dependency is even bigger than that, and if you're downloading the dependency over the internet.)

Expected: Since x only depends on big at build time, Bazel should skip downloading big and just download x.txt and provide it to y.txt.

If you comment out the srcs attribute of x, the build behaves as desired, and the critical path is 0.01s.

bazel version

$ bazel version
Build label: 0.3.2
Build target: bazel-out/local-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Fri Oct 7 17:25:52 2016 (1475861152)
Build timestamp: 1475861152
Build timestamp as int: 1475861152

The text was updated successfully, but these errors were encountered:

hermione521 · 2016-10-11T09:20:31Z

It looks intended to me. If you include "big" in src, it should be there.
@philwo I remember you worked on Hazelcast?

dfabulich · 2016-11-02T16:08:28Z

ping?

hermione521 · 2016-11-03T11:51:38Z

Assigned but he's on vacation now.

ulfjack · 2017-03-14T11:49:41Z

Bazel requires all files in the action graph to be on local disk (or on an in-memory or network file system), and if a file is missing, it'll re-execute the action (or re-download from the remote cache). In this case, when you build y, it'll see the dependency on the big rule and check that the big output file exists. If not, it'll execute big, i.e., download the file.

I'm not sure how what you're proposing would work. In theory, if it's working with a remote action cache, Bazel could avoid downloading files and use the checksum from the remote cache to check whether subsequent actions need to be re-executed. But how would it decide which files to download? Would it download any file at all? Or are you saying that it shouldn't download intermediate files at all?

dfabulich · 2017-03-14T14:12:39Z

Yes, I'm saying that it shouldn't download intermediate files at all.

ulfjack · 2017-03-14T14:25:46Z

Ok. We first need to make remote caching / execution actually work before we could possibly work on this. I'm not sure if this will work well with Bazel's current design, but it's not completely out of the question either.

Within Google, we've solved the problem of downloading large files outside of Bazel. We use a FUSE file system that downloads files on-demand and provides fast access to checksums (without downloading), so we don't actually eagerly download files from the remote cache / remote execution. The advantage of making it a file system is that it works transparently with other tools that may expect intermediate files to be locally accessible.

ulfjack · 2017-06-22T18:38:49Z

Changing title to clarify that this doesn't just apply to hazelcast.

jerrymarino · 2018-06-29T21:16:59Z

I've got a potentially similar issue: several thousands of object files that are compiled into .as for later linking. Mainly, I'm using objc_library and ios_application rules.

Under the current design of Bazel, it downloads all of the .os from the cache, even if we just need the resultant .a.

I'm wondering if:

The FUSE solution that @ulfjack mentioned is relevant here ( since we can just look at checksums of .os )
I should attempt to add dependency graph "pruning" logic to prevent the downloads

mirandaconrado · 2018-07-02T17:59:04Z

@ulfjack I like the idea of the FUSE. Looking through Bazel's code, I don't think it currently support it, right? If I provide a FUSE that gives lazy download and could expose hashes (not sure how yet), how would we go about adding support for it in Bazel?

I'm really interested in this as the downloads have been a little bit painful for us and it would also solve the same problem with repositories., I think.

jerrymarino · 2018-07-03T14:49:06Z

It might be possible to tweak SkyFrameExecutor to skip redundant downloads: #5505 . It seems like this is what ActionFileSystem did / does.

I'm super interested in this as well. For some workloads I'm testing against, mean incremental build times with remote caching are longer than default caching even with great hardware / network.

ulfjack · 2018-07-16T11:08:14Z

@mirandaconrado the way it works internally is that we implement the OutputService interface, and tell the local FUSE daemon about the outputs of each action after executing it remotely. I.e., we tell it "put a file with this remote address in this location". We'd need to come up with a protocol for doing so. We'd also need to hook it up with remote execution, but it shouldn't be difficult (famous last words).

@jerrymarino my understanding of the ActionFileSystem is that it sandboxes actions within Bazel to enforce that we don't write to local disk. If you have a FUSE file system with local write support, then the ActionFileSystem isn't strictly necessary. We're working on it because we're looking at deployments inside Google that will not have FUSE file systems or will not have local write support. As such, I don't think it helps (except by tracking down cases where we accidentally write files locally).

mirandaconrado · 2018-07-20T05:06:10Z

Interesting, I'll take a deeper look on that interface. Thanks for the pointer!

mirandaconrado · 2018-07-20T05:37:06Z

@ulfjack, if you don't mind me asking, could you also give some details on how you handle the file download themselves without the remote execution part? More specifically, I'm curious about repositories. We have a pretty big 3rd party library (>10GB) that we update daily and we make it available as a repository in bazel. Currently people have to download it even if they don't use bc it's required to parse some build files correctly.

This seems related to the problem in the original issue, as we get the download from the remote cache. Your answer seems more related to remote execution (I haven't dug deep enough to figure out how much these are different).

So I was wondering how we could make this work in this case. Bazel would ask for the hash of the file to build the graph and verify what needs to be invalidated. We could have multiple files remotely on the same path but with different hashes, but I think we can identify them by the hash of how they're created. We would still need to download it to get the actual hash of the result, if I understand correctly, which doesn't sound great. For repos, I'm not even sure we can identify multiple versions of them in a reasonable fashion.

ulfjack · 2018-07-20T06:43:50Z

@mirandaconrado That's right. I was just asked a similar question this week. :-)

Unfortunately, I don't know. Bazel will need any BUILD files required to build the repositories, but if there was something to track intermediate artifacts, it could probably also be used to track non-BUILD files. What that would look like, though, I have no idea - internally, we have everything checked into a single monorepo.

I hear that @ola-rozenfeld might want to work on this. Also @buchgr.

mirandaconrado · 2018-07-20T14:29:32Z

Yeah, I'm familiar with the legend of Google's monorepo =) In our case, we have a library that people co-develop and it's in a separate repo, so we use a select to either get the nightly or the local, but people using the local have to download anyway.

It seems like remote build actually simplify things in this case. You ask for a remote build, the server checks if it's already built and send back the location of the result, like you mentioned in the FUSE answer. This seems easier (from the client perspective) bc it can just assume all files can be found locally, like you mentioned.

To solve without it, I imagine we'd have to go for a pull model (we recurse on the dependencies of the target we're trying to build until we find the referred files and don't continue back, which would solve the problem for the original poster) instead of a regular graph traversal (start from sources and make sure everything on the path is available), which seems to be what Bazel currently does.

I also think, for us to pull this out, we couldn't depend on the actual hash of the file, just on the build description. I recall seeing a presentation where someone mentioned that the hash of a result file (as in the actual content) was used to create the hash of the next steps. Could you confirm it? I can't find the source right now and maybe it's been handled.

ulfjack · 2018-07-24T09:51:00Z

We use the hash of an output file to compute the hash for a subsequent step that uses that output file as input.

helenalt · 2018-11-06T14:17:46Z

@buchgr do you have an eta for this work

ittaiz · 2018-12-28T13:25:45Z

@buchgr should this be closed as duplicate of #6862 ?

buchgr · 2019-01-16T10:28:10Z

@ittaiz yep! good point!

hermione521 added the under investigation label Oct 11, 2016

hermione521 assigned philwo Nov 3, 2016

ulfjack unassigned philwo Mar 14, 2017

ulfjack added the category: service APIs label Mar 14, 2017

ulfjack added type: feature request and removed under investigation labels Mar 14, 2017

damienmg added the P2 We'll consider working on this in future. (Assignee optional) label Apr 11, 2017

damienmg added this to the 1.0 milestone Apr 11, 2017

ulfjack changed the title ~~Bazel downloads unnecessary artifacts from Hazelcast~~ Bazel downloads unnecessary artifacts from remote cache Jun 22, 2017

philwo added category: remote execution / caching and removed category: service APIs labels Jul 3, 2018

ulfjack mentioned this issue Jul 20, 2018

Provide OutputService and InjectionListener implementation #5505

Closed

buchgr self-assigned this Jul 31, 2018

buchgr removed this from the 1.0 milestone Aug 6, 2018

buchgr added this to the Remote Execution milestone Aug 6, 2018

bergsieker added P1 I'll work on this now. (Assignee required) and removed P2 We'll consider working on this in future. (Assignee optional) labels Sep 11, 2018

buchgr closed this as completed Jan 16, 2019

buchgr mentioned this issue Jan 16, 2019

Tracking issue for "Remote Builds without the Bytes" #6862

Closed

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bazel downloads unnecessary artifacts from remote cache #1922

Bazel downloads unnecessary artifacts from remote cache #1922

dfabulich commented Oct 10, 2016

hermione521 commented Oct 11, 2016

dfabulich commented Nov 2, 2016

hermione521 commented Nov 3, 2016

ulfjack commented Mar 14, 2017

dfabulich commented Mar 14, 2017

ulfjack commented Mar 14, 2017

ulfjack commented Jun 22, 2017

jerrymarino commented Jun 29, 2018

mirandaconrado commented Jul 2, 2018

jerrymarino commented Jul 3, 2018

ulfjack commented Jul 16, 2018

mirandaconrado commented Jul 20, 2018

mirandaconrado commented Jul 20, 2018

ulfjack commented Jul 20, 2018

mirandaconrado commented Jul 20, 2018

ulfjack commented Jul 24, 2018

helenalt commented Nov 6, 2018

ittaiz commented Dec 28, 2018

buchgr commented Jan 16, 2019

Bazel downloads unnecessary artifacts from remote cache #1922

Bazel downloads unnecessary artifacts from remote cache #1922

Comments

dfabulich commented Oct 10, 2016

hermione521 commented Oct 11, 2016

dfabulich commented Nov 2, 2016

hermione521 commented Nov 3, 2016

ulfjack commented Mar 14, 2017

dfabulich commented Mar 14, 2017

ulfjack commented Mar 14, 2017

ulfjack commented Jun 22, 2017

jerrymarino commented Jun 29, 2018

mirandaconrado commented Jul 2, 2018

jerrymarino commented Jul 3, 2018

ulfjack commented Jul 16, 2018

mirandaconrado commented Jul 20, 2018

mirandaconrado commented Jul 20, 2018

ulfjack commented Jul 20, 2018

mirandaconrado commented Jul 20, 2018

ulfjack commented Jul 24, 2018

helenalt commented Nov 6, 2018

ittaiz commented Dec 28, 2018

buchgr commented Jan 16, 2019