Allow sharing caches across actions when sandboxing is in use #7527

ktf · 2019-02-25T13:46:45Z

Description of the problem / feature request:

Extremely low performance of bazel when building a large project on macOS when compared to running CMake + make.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Build a large bazel project on macOS. I see that a lot of fsevents for files in /private/var/tmp/_bazel_user triggered and fseventsd seems to crawl at 100% - 200% CPU usage after them, slowing down the builds. I do have /private/var/tmp masked out in my spotlight configuration, but it does not seem to be enough.

What operating system are you running Bazel on?

Darwin grisu.dyndns.cern.ch 18.2.0 Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 2018; root:xnu-4903.231.4~2/RELEASE_X86_64 x86_64

What's the output of `bazel info release`?

release 0.22.0

The text was updated successfully, but these errors were encountered:

irengrig · 2019-03-01T16:02:12Z

Hello @ktf, thank you for your feedback.
Do you some particular project in mind that we can experiment with?

ktf · 2019-03-02T03:01:43Z

What I have is:

https://github.com/ktf/AliceO2/tree/bazel-support

where you can for example do bazel build @root//:cling. Given this is a subset of tensorflow (i.e. the part which builds llvm), plus some extra stuff to build parts of clang, I am surprised this is not a common issue. Notice I build on my local laptop (2015 MBP) with no external cache.

irengrig · 2019-03-02T20:10:26Z

Thank you for the example project!
Assigning to Apple team.

keith · 2019-03-02T20:22:03Z

It might be worth trying to repro with this commit 0877340

jmmv · 2019-03-02T20:32:55Z

Indeed; that commit may help and I'd be very interested to know if it does.

But I have also noticed fseventsd consuming lots of CPU (and, incidentally, this service runs at default priority -- so that commit actually makes Bazel run at a lower priority which is probably good). Haven't had a chance to dig into this. Maybe setting --nowatchfs would prevent the fseventsd activity, though then Bazel will be slower to detect file changes of course... (just thinking aloud). It just feels stupid that we have to waste so much CPU just to watch for file changes...

ktf · 2019-03-03T07:39:01Z

Thanks. I will try that when I have a bit of time. Notice also that the fsevents seem to be relative to stuff in /var/tmp/_bazel_user/, not in ~/src/myproject. Maybe I simply do not understand the bazel internals, but is that what you'd expect from the file watching process?

jmmv · 2019-03-03T10:27:59Z

Yes. That's where the build artifacts are -- they are not stored in your source tree -- and Bazel has to watch them for out-of-band changes.

kastiglione · 2019-03-04T17:05:43Z

Do you know the backstory is for watching the build tree? I was surprised to learn this.

keith · 2019-03-04T17:54:47Z

@jmmv it looks like --watchfs defaults to false, so unless @ktf is setting that, this may not apply?

jmmv · 2019-03-04T20:00:44Z

@kastiglione All outputs (except for top-level ones) are inputs to other actions. Bazel needs to know if they have been modified out-of-band to invalidate their corresponding actions. (Yes, you could assume that the build tree is not modified by users, but that's not what real-world usage shows.)

@keith That's a good point. If the reporter is indeed not using --nowatchfs, then fseventsd usage comes from elsewhere.

ktf · 2019-03-04T20:41:41Z

@keith AFAICR, I did not explicitly enable --watchfs. Any way to check this is the case?

@jmmv, does that happen also for files which are not currently in any given action? The files being watched seem to be the full contents of a tarball of sources I download with http_archive, regardless of them being used in the build or not.

kastiglione · 2019-03-05T00:26:19Z

@ktf have you run without the sandbox (--spawn_strategy=standalone) to compare the time difference?

ktf · 2019-03-05T12:31:19Z

So I gave a quick try, with the suggested target and I got:

bazel build --spawn_strategy=standalone "@root//:cling" 0.24s user 0.36s system 0% cpu 23:11.70 total
bazel build "@root//:cling" 0.25s user 0.36s system 0% cpu 27:17.31 total

I did a bazel clean --expunge in between the two. What I also see is that in the standalone case 90% of the CPU time for the whole machine is in user, while in the second case I see a lot of longish spikes in system usage and fseventd popping up high in the offenders list.
In a shorter test I have an even more dramatic difference:

INFO: Elapsed time: 190,447s, Critical Path: 29,94s
INFO: 226 processes: 226 darwin-sandbox.
INFO: Build completed successfully, 229 total actions
bazel build "@fairmq//..."  0.06s user 0.08s system 0% cpu 3:10.52 total

INFO: Elapsed time: 122,390s, Critical Path: 21,09s
INFO: 226 processes: 226 local.
INFO: Build completed successfully, 229 total actions
bazel build --spawn_strategy=standalone "@fairmq//..."  0.06s user 0.07s system 0% cpu 2:02.46 total

jmmv · 2019-03-08T13:50:21Z

I'm not sure why you focus on fseventsd though... That's probably a symptom more than a problem.

What exactly are you doing? What build times are you seeing? What are you comparing those against?

ktf · 2019-03-12T09:13:03Z

Because I see fseventd crawling in "Activity Monitoring", so I suspect it's getting overwhelmed by events and steals CPU cycles which could be used for compilation.

The current implementation of these functions is very inefficient and degrades overall performance significantly, especially when sandboxing is enabled. However, that's almost the best we can do with a generic algorithm. To make room for optimizations that rely on specific file system features, move these functions into the FileSystem class. I will supply a custom implementation for UnixFileSystem later. Note that this is intended to be a pure code move. I haven't applied any improvements to the code nor tests yet (with the exception of cleaning up docstrings). Addresses #7527. RELNOTES: None. PiperOrigin-RevId: 239412965

Add missing tests for the deleteTreesBelow entry point (expecting to leave the given path untouched) and for corner cases like handling an unwritable directory. Addresses #7527. RELNOTES: None. PiperOrigin-RevId: 239489795

Make deleteTreesBelow faster by assuming that the directories to be deleted are readable and writable. We create most of the trees we delete via this function anyway, so we know that they are accessible for the most part. The previous code was blindly resetting read/write/execute permissions for each traversed directory, and was doing so individually, which means we issued 3 extra syscalls per directory. And on Unix file systems, go even further by taking advantage of the fact that readdir returns the type of each entry: there is no need to issue a separate stat for each entry to determine if it is a subdirectory or not. Do this from our JNI code because there really is no reason to pay the cost of going in an out of Java for each file: we are traversing very large directory trees, so every bit helps. A fully-local build of a large iOS app on a Mac Pro 2013 shows that this reduces build times from about 7300s to 5100s. A build of a similar app on a MacBook Pro 2015 shows a reduction from 7500s to 5400s. The impact on these builds using dynamic execution is much smaller, and there is no observable improvement in smaller builds. Addresses #7527. RELNOTES: None. PiperOrigin-RevId: 239594433

During the lifetime of a Bazel server, assign unique identifiers to each sandboxed action so that their symlink trees are guaranteed to be disjoint as they are keyed on that identifier. This was in theory already happening... but it actually wasn't and there was no test to validate this assumption. With that done, there is no need to ensure that the sandbox base is clean before a build -- unless we are the very first build of a server, in which case we must ensure we don't clash with possible leftovers from a past server. Note that we already clean up each action's tree as soon as the action completes, so the only thing we are trying to clean up here are stale files that may be left if those individual deletions didn't work (e.g. because there were still stray processes writing to the directories) or if --sandbox_debug was in use. This is a prerequisite before making deletions asynchronous for two reasons: first, we don't want to delay build starts if old deletions are still ongoing; and, second, we don't want to schedule too broad deletions that could step over subsequent builds (i.e. we only want to delete the contents of the *-sandbox directories, which contain one subdirectory per action, and not the whole tree). Lastly, add a test for this expected behavior (which is what actually triggered the fix) and for the fact that we expect the identifiers to be always different. Partially addresses #7527. RELNOTES: None. PiperOrigin-RevId: 243635557

Each sandbox action runs within a symlink forest that exists in a separate subtree because we use a unique identifier for those subtrees. Therefore it is unnecessary to delete those trees in the critical path. Tree deletions can be very expensive, especially on macOS, so make them asynchronous if --experimental_sandbox_async_tree_delete_idle_threads is given. When this flag is not zero, Bazel will schedule all deletions on a separate low-priority thread while the build is running, and will then use the requested number of threads once the build is done to quickly catch up with an still-ongoing deletions. For a large iOS build, this cuts down clean build times with sandboxed enabled significantly. Helps more on machines with more cores: * On a Mac Pro 2013, the improvement is almost 20%: standalone: mean 2746.33, median 2736.00, stddev 33.07 sandboxed-async: mean 4394.67, median 4393.00, stddev 33.09 sandboxed-sync: mean 5284.33, median 5288.00, stddev 20.17 * On a MacBook Pro 2015, we see a more modest 10% improvement: standalone: mean 3418.33, median 3422.00, stddev 7.41 sandboxed-async: mean 5090.00, median 5086.00, stddev 40.92 sandboxed-sync: mean 5694.67, median 5700.00, stddev 37.75 Partially addresses #7527. RELNOTES: None. PiperOrigin-RevId: 243805556

Globegitter · 2019-05-12T22:32:26Z

@jmmv I am just trying out the flag --experimental_sandbox_async_tree_delete_idle_threads="auto" in bazel 0.26rc4 (would also be nice to see a description on what auto does - how is it different to HOST_CPUS?) you added and it seems to work well for a build that succeeds but now I am getting following NullPointer for failing builds:

(00:29:13) ERROR: Build failed. Not running target
Internal error thrown during build. Printing stack trace: java.lang.NullPointerException
	at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:877)
	at com.google.devtools.build.lib.sandbox.SandboxModule.afterCommand(SandboxModule.java:472)
	at com.google.devtools.build.lib.runtime.BlazeRuntime.afterCommand(BlazeRuntime.java:580)
	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:506)
	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:206)
	at com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:749)
	at com.google.devtools.build.lib.server.GrpcServerImpl.access$1600(GrpcServerImpl.java:103)
	at com.google.devtools.build.lib.server.GrpcServerImpl$2.lambda$run$0(GrpcServerImpl.java:818)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
java.lang.NullPointerException
	at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:877)
	at com.google.devtools.build.lib.sandbox.SandboxModule.afterCommand(SandboxModule.java:472)
	at com.google.devtools.build.lib.runtime.BlazeRuntime.afterCommand(BlazeRuntime.java:580)
	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:506)
	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:206)
	at com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:749)
	at com.google.devtools.build.lib.server.GrpcServerImpl.access$1600(GrpcServerImpl.java:103)
	at com.google.devtools.build.lib.server.GrpcServerImpl$2.lambda$run$0(GrpcServerImpl.java:818)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
java.lang.NullPointerException
	at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:877)
	at com.google.devtools.build.lib.sandbox.SandboxModule.afterCommand(SandboxModule.java:472)
	at com.google.devtools.build.lib.runtime.BlazeRuntime.afterCommand(BlazeRuntime.java:580)
	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:506)
	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:206)
	at com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:749)
	at com.google.devtools.build.lib.server.GrpcServerImpl.access$1600(GrpcServerImpl.java:103)
	at com.google.devtools.build.lib.server.GrpcServerImpl$2.lambda$run$0(GrpcServerImpl.java:818)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
```

meisterT · 2019-12-19T07:44:12Z

Can you try this?

bazel build --experimental_collect_local_sandbox_action_metrics --sandbox_debug //...

Then you should have in the output_base a directories called sandbox/<number> with stats.out files in it. The format is https://github.com/bazelbuild/bazel/blob/master/src/main/protobuf/execution_statistics.proto

jmmv · 2019-12-19T11:32:59Z

Also... the only thing left to try is to run with --sandbox_debug to keep the sandbox directory around, run with --subcommands to print what is being run, and then manually copy/paste the command invocation without any of the sandbox-exec stuff to run it within the sandbox directory, outside of Bazel's control.

If that's slow... then your theory of slow symlink traversal has some extra points. But I cannot find any hints online that this may be a problem.

Do you use multiple volumes? Is the sandbox maybe being created in a separate one?

ob · 2019-12-19T16:50:07Z

@meisterT Thanks, I did that and am in the process of writing a text dumper.

@jmmv Interesting, I tried that. I artificially introduced a failure via

#error "fail here"

and ran the bazel command again (sandboxed). Again, it took 18 s. Then I copied the command and ran it manually:

$ time ./buildit.sh
        0.42 real         0.20 user         0.20 sys

So it's something when you run it from Java maybe?

Do you use multiple volumes? Is the sandbox maybe being created in a separate one?

No, it's all in the same SSD volume.

ob · 2019-12-19T18:15:32Z

So I ran it again using Instruments.app and the weird thing is that when I run it through Bazel, all the time is spent in clang:

Note that clang is taking 14.36 s in this trace. But if I run the exact same command via bash, I get:

with clang taking 387 ms. This is what made me suspect the symbolic links but now I'm not sure... something in the way that Bazel spawns these commands is changing clang's behavior.

ob · 2019-12-19T18:25:47Z

Ok, found it... when clang is being run through Bazel, it gets passed the flag

'-fmodules-cache-path=bazel-out/ios-x86_64-min11.0-applebin_ios-ios_x86_64-dbg/bin/_objc_module_cache'

and that directory is initially empty. This causes clang to recompile all the system and user's modules. When I run it without the sandbox (or in my bash script), the *.pcm files are already there so clang saves a bunch of time. If I modify my bash script to remove that directory before running:

$ time ./buildit.sh
       14.30 real        11.39 user         2.83 sys

which matches what I'm seeing with Bazel.

jmmv · 2019-12-19T19:18:38Z

Aha, yeah, that explains it (or at least one problem). I had encountered this earlier because the way the module cache works breaks loudly when sandboxfs is in use but silently "works" in the default case. @allevato

But I'm surprised though: from our previous discussions, it seemed that you'd only see the slowdown in some machines and not others, and only after upgrading to Catalina? Those shouldn't influence how this works.

allevato · 2019-12-19T19:23:49Z

The Clang and Swift implicit module caches are inherently incompatible with hermetic builds and sandboxed execution. The only way to achieve the same degree of performance as the implicit module cache in a Bazel-like build is to explicitly precompile the modules as part of the build graph and propagate them as known inputs/outputs instead of dumping them into a single common directory. I'm working on that for Swift, but I don't think there's any progress planned for Clang in the extreme short-term.

ktf · 2019-12-19T19:59:52Z

Is there a way to disable the modulecache completely for C++ projects?

As a side note, do you use DYLD_LIBRARY_PATH anywhere? We noticed that since Mohave the lookup is extremely slow and on my laptop there is ~ 1ms overhead per extra path:

→ time ls
ANALYSIS.md  BUILD  CMakeLists.txt  COOKBOOK.md  README.md  include  scripts  src  test
ls -G  0.00s user 0.00s system 70% cpu 0.006 total

→ time DYLD_LIBRARY_PATH=./a:./b:./c:./d:./e:./f ls
ANALYSIS.md  BUILD  CMakeLists.txt  COOKBOOK.md  README.md  include  scripts  src  test
DYLD_LIBRARY_PATH=./a:./b:./c:./d:./e:./f ls -G  0.00s user 0.01s system 82% cpu 0.012 total

this is particularly bad for the lookup of system libraries.

ob · 2019-12-19T21:05:55Z

But I'm surprised though: from our previous discussions, it seemed that you'd only see the slowdown in some machines and not others, and only after upgrading to Catalina? Those shouldn't influence how this works.

I think that was a bug in the version of Xcode that I was using / version of Catalina beta. I had one machine where this was extremely slow, 22 hours for a build. But the 15x difference between non-sandboxed and sandboxed builds has always been there.

jmmv · 2019-12-19T22:04:13Z

@allevato We'd easily poke a hole in the sandbox to share a cache across actions. Sure, mumble mumble sandboxing, but if the performance hit is so significant, and if it hits people by default, maybe we should do it. This wouldn't be that different from using workers actually.

Add missing tests for the deleteTreesBelow entry point (expecting to leave the given path untouched) and for corner cases like handling an unwritable directory. Addresses bazelbuild/bazel#7527. RELNOTES: None. PiperOrigin-RevId: 239489795

The current implementation of these functions is very inefficient and degrades overall performance significantly, especially when sandboxing is enabled. However, that's almost the best we can do with a generic algorithm. To make room for optimizations that rely on specific file system features, move these functions into the FileSystem class. I will supply a custom implementation for UnixFileSystem later. Note that this is intended to be a pure code move. I haven't applied any improvements to the code nor tests yet (with the exception of cleaning up docstrings). Addresses bazelbuild/bazel#7527. RELNOTES: None. PiperOrigin-RevId: 239412965

sgowroji · 2023-02-16T08:26:25Z

Hi there! We're doing a clean up of old issues and will be closing this one. Please reopen if you’d like to discuss anything further. We’ll respond as soon as we have the bandwidth/resources to do so.

brentleyjones · 2023-02-16T12:44:12Z

This is still relevant.

Kila2 · 2024-06-24T14:46:56Z

I found that when exec bazel builds. macos is listening to the bazel workspace and bazel-out directory. I don't think the bazel-out directory should be included here.
ps: the watchfs option is false

ktf · 2024-09-05T05:53:08Z

Any update on this?

Kila2 · 2024-09-05T06:21:13Z

We found a workaround: use a ramdisk or hdiutil to create a new tmp volume. Then, disable fseventsd for this volume to keep the CPU usage of the fseventsd process lower.

irengrig added the more data needed label Mar 1, 2019

irengrig added z-team-Apple Deprecated. Send to rules_apple, or label team-Rules-CPP + platform:apple untriaged and removed more data needed labels Mar 2, 2019

jmmv added type: bug P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels Mar 2, 2019

jmmv self-assigned this Mar 2, 2019

jmmv mentioned this issue Mar 22, 2019

Make the dynamic strategy work without sandboxing #7818

Closed

jmmv changed the title ~~Extremely low performance of bazel when building a large project on macOS~~ Allow sharing caches across actions when sandboxing is in use May 13, 2020

jmmv added type: feature request and removed type: bug labels May 13, 2020

jmmv mentioned this issue May 14, 2020

clang module cache not being reused in darwin sandbox #1873

Closed

jmmv added team-Local-Exec Issues and PRs for the Execution (Local) team and removed z-team-Apple Deprecated. Send to rules_apple, or label team-Rules-CPP + platform:apple labels May 14, 2020

sigmaxipi mentioned this issue Dec 30, 2020

cc_library for Android builds very slow on WSL #12761

Closed

brentleyjones mentioned this issue Feb 3, 2021

Performance issues with swift.index_while_building bazelbuild/rules_swift#561

Closed

sgowroji added the stale Issues or PRs that are stale (no activity for 30 days) label Feb 16, 2023

sgowroji closed this as not planned Won't fix, can't repro, duplicate, stale Feb 16, 2023

sgowroji reopened this Feb 16, 2023

sgowroji added not stale Issues or PRs that are inactive but not considered stale and removed stale Issues or PRs that are stale (no activity for 30 days) labels Feb 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow sharing caches across actions when sandboxing is in use #7527

Allow sharing caches across actions when sandboxing is in use #7527

ktf commented Feb 25, 2019

irengrig commented Mar 1, 2019

ktf commented Mar 2, 2019

irengrig commented Mar 2, 2019

keith commented Mar 2, 2019

jmmv commented Mar 2, 2019 •

edited

Loading

ktf commented Mar 3, 2019

jmmv commented Mar 3, 2019

kastiglione commented Mar 4, 2019

keith commented Mar 4, 2019

jmmv commented Mar 4, 2019

ktf commented Mar 4, 2019

kastiglione commented Mar 5, 2019

ktf commented Mar 5, 2019 •

edited

Loading

jmmv commented Mar 8, 2019

ktf commented Mar 12, 2019

Globegitter commented May 12, 2019

meisterT commented Dec 19, 2019

jmmv commented Dec 19, 2019

ob commented Dec 19, 2019

ob commented Dec 19, 2019

ob commented Dec 19, 2019

jmmv commented Dec 19, 2019

allevato commented Dec 19, 2019

ktf commented Dec 19, 2019

ob commented Dec 19, 2019

jmmv commented Dec 19, 2019

sgowroji commented Feb 16, 2023

brentleyjones commented Feb 16, 2023

Kila2 commented Jun 24, 2024

ktf commented Sep 5, 2024

Kila2 commented Sep 5, 2024

Allow sharing caches across actions when sandboxing is in use #7527

Allow sharing caches across actions when sandboxing is in use #7527

Comments

ktf commented Feb 25, 2019

Description of the problem / feature request:

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

What operating system are you running Bazel on?

What's the output of bazel info release?

irengrig commented Mar 1, 2019

ktf commented Mar 2, 2019

irengrig commented Mar 2, 2019

keith commented Mar 2, 2019

jmmv commented Mar 2, 2019 • edited Loading

ktf commented Mar 3, 2019

jmmv commented Mar 3, 2019

kastiglione commented Mar 4, 2019

keith commented Mar 4, 2019

jmmv commented Mar 4, 2019

ktf commented Mar 4, 2019

kastiglione commented Mar 5, 2019

ktf commented Mar 5, 2019 • edited Loading

jmmv commented Mar 8, 2019

ktf commented Mar 12, 2019

Globegitter commented May 12, 2019

meisterT commented Dec 19, 2019

jmmv commented Dec 19, 2019

ob commented Dec 19, 2019

ob commented Dec 19, 2019

ob commented Dec 19, 2019

jmmv commented Dec 19, 2019

allevato commented Dec 19, 2019

ktf commented Dec 19, 2019

ob commented Dec 19, 2019

jmmv commented Dec 19, 2019

sgowroji commented Feb 16, 2023

brentleyjones commented Feb 16, 2023

Kila2 commented Jun 24, 2024

ktf commented Sep 5, 2024

Kila2 commented Sep 5, 2024

What's the output of `bazel info release`?

jmmv commented Mar 2, 2019 •

edited

Loading

ktf commented Mar 5, 2019 •

edited

Loading