-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow sharing caches across actions when sandboxing is in use #7527
Comments
Hello @ktf, thank you for your feedback. |
What I have is: https://github.com/ktf/AliceO2/tree/bazel-support where you can for example do |
Thank you for the example project! |
It might be worth trying to repro with this commit 0877340 |
Indeed; that commit may help and I'd be very interested to know if it does. But I have also noticed fseventsd consuming lots of CPU (and, incidentally, this service runs at default priority -- so that commit actually makes Bazel run at a lower priority which is probably good). Haven't had a chance to dig into this. Maybe setting |
Thanks. I will try that when I have a bit of time. Notice also that the fsevents seem to be relative to stuff in |
Yes. That's where the build artifacts are -- they are not stored in your source tree -- and Bazel has to watch them for out-of-band changes. |
Do you know the backstory is for watching the build tree? I was surprised to learn this. |
@kastiglione All outputs (except for top-level ones) are inputs to other actions. Bazel needs to know if they have been modified out-of-band to invalidate their corresponding actions. (Yes, you could assume that the build tree is not modified by users, but that's not what real-world usage shows.) @keith That's a good point. If the reporter is indeed not using |
@keith AFAICR, I did not explicitly enable @jmmv, does that happen also for files which are not currently in any given action? The files being watched seem to be the full contents of a tarball of sources I download with |
@ktf have you run without the sandbox ( |
So I gave a quick try, with the suggested target and I got:
I did a
|
I'm not sure why you focus on fseventsd though... That's probably a symptom more than a problem. What exactly are you doing? What build times are you seeing? What are you comparing those against? |
Because I see |
The current implementation of these functions is very inefficient and degrades overall performance significantly, especially when sandboxing is enabled. However, that's almost the best we can do with a generic algorithm. To make room for optimizations that rely on specific file system features, move these functions into the FileSystem class. I will supply a custom implementation for UnixFileSystem later. Note that this is intended to be a pure code move. I haven't applied any improvements to the code nor tests yet (with the exception of cleaning up docstrings). Addresses #7527. RELNOTES: None. PiperOrigin-RevId: 239412965
Add missing tests for the deleteTreesBelow entry point (expecting to leave the given path untouched) and for corner cases like handling an unwritable directory. Addresses #7527. RELNOTES: None. PiperOrigin-RevId: 239489795
Make deleteTreesBelow faster by assuming that the directories to be deleted are readable and writable. We create most of the trees we delete via this function anyway, so we know that they are accessible for the most part. The previous code was blindly resetting read/write/execute permissions for each traversed directory, and was doing so individually, which means we issued 3 extra syscalls per directory. And on Unix file systems, go even further by taking advantage of the fact that readdir returns the type of each entry: there is no need to issue a separate stat for each entry to determine if it is a subdirectory or not. Do this from our JNI code because there really is no reason to pay the cost of going in an out of Java for each file: we are traversing very large directory trees, so every bit helps. A fully-local build of a large iOS app on a Mac Pro 2013 shows that this reduces build times from about 7300s to 5100s. A build of a similar app on a MacBook Pro 2015 shows a reduction from 7500s to 5400s. The impact on these builds using dynamic execution is much smaller, and there is no observable improvement in smaller builds. Addresses #7527. RELNOTES: None. PiperOrigin-RevId: 239594433
During the lifetime of a Bazel server, assign unique identifiers to each sandboxed action so that their symlink trees are guaranteed to be disjoint as they are keyed on that identifier. This was in theory already happening... but it actually wasn't and there was no test to validate this assumption. With that done, there is no need to ensure that the sandbox base is clean before a build -- unless we are the very first build of a server, in which case we must ensure we don't clash with possible leftovers from a past server. Note that we already clean up each action's tree as soon as the action completes, so the only thing we are trying to clean up here are stale files that may be left if those individual deletions didn't work (e.g. because there were still stray processes writing to the directories) or if --sandbox_debug was in use. This is a prerequisite before making deletions asynchronous for two reasons: first, we don't want to delay build starts if old deletions are still ongoing; and, second, we don't want to schedule too broad deletions that could step over subsequent builds (i.e. we only want to delete the contents of the *-sandbox directories, which contain one subdirectory per action, and not the whole tree). Lastly, add a test for this expected behavior (which is what actually triggered the fix) and for the fact that we expect the identifiers to be always different. Partially addresses #7527. RELNOTES: None. PiperOrigin-RevId: 243635557
Each sandbox action runs within a symlink forest that exists in a separate subtree because we use a unique identifier for those subtrees. Therefore it is unnecessary to delete those trees in the critical path. Tree deletions can be very expensive, especially on macOS, so make them asynchronous if --experimental_sandbox_async_tree_delete_idle_threads is given. When this flag is not zero, Bazel will schedule all deletions on a separate low-priority thread while the build is running, and will then use the requested number of threads once the build is done to quickly catch up with an still-ongoing deletions. For a large iOS build, this cuts down clean build times with sandboxed enabled significantly. Helps more on machines with more cores: * On a Mac Pro 2013, the improvement is almost 20%: standalone: mean 2746.33, median 2736.00, stddev 33.07 sandboxed-async: mean 4394.67, median 4393.00, stddev 33.09 sandboxed-sync: mean 5284.33, median 5288.00, stddev 20.17 * On a MacBook Pro 2015, we see a more modest 10% improvement: standalone: mean 3418.33, median 3422.00, stddev 7.41 sandboxed-async: mean 5090.00, median 5086.00, stddev 40.92 sandboxed-sync: mean 5694.67, median 5700.00, stddev 37.75 Partially addresses #7527. RELNOTES: None. PiperOrigin-RevId: 243805556
@jmmv I am just trying out the flag
|
Can you try this?
Then you should have in the output_base a directories called |
Also... the only thing left to try is to run with If that's slow... then your theory of slow symlink traversal has some extra points. But I cannot find any hints online that this may be a problem. Do you use multiple volumes? Is the sandbox maybe being created in a separate one? |
@meisterT Thanks, I did that and am in the process of writing a text dumper. @jmmv Interesting, I tried that. I artificially introduced a failure via
and ran the bazel command again (sandboxed). Again, it took 18 s. Then I copied the command and ran it manually:
So it's something when you run it from Java maybe?
No, it's all in the same SSD volume. |
So I ran it again using Instruments.app and the weird thing is that when I run it through Bazel, all the time is spent in Note that with |
Ok, found it... when
and that directory is initially empty. This causes
which matches what I'm seeing with Bazel. |
Aha, yeah, that explains it (or at least one problem). I had encountered this earlier because the way the module cache works breaks loudly when sandboxfs is in use but silently "works" in the default case. @allevato But I'm surprised though: from our previous discussions, it seemed that you'd only see the slowdown in some machines and not others, and only after upgrading to Catalina? Those shouldn't influence how this works. |
The Clang and Swift implicit module caches are inherently incompatible with hermetic builds and sandboxed execution. The only way to achieve the same degree of performance as the implicit module cache in a Bazel-like build is to explicitly precompile the modules as part of the build graph and propagate them as known inputs/outputs instead of dumping them into a single common directory. I'm working on that for Swift, but I don't think there's any progress planned for Clang in the extreme short-term. |
Is there a way to disable the modulecache completely for C++ projects? As a side note, do you use DYLD_LIBRARY_PATH anywhere? We noticed that since Mohave the lookup is extremely slow and on my laptop there is ~ 1ms overhead per extra path:
this is particularly bad for the lookup of system libraries. |
I think that was a bug in the version of Xcode that I was using / version of Catalina beta. I had one machine where this was extremely slow, 22 hours for a build. But the 15x difference between non-sandboxed and sandboxed builds has always been there. |
@allevato We'd easily poke a hole in the sandbox to share a cache across actions. Sure, mumble mumble sandboxing, but if the performance hit is so significant, and if it hits people by default, maybe we should do it. This wouldn't be that different from using workers actually. |
Add missing tests for the deleteTreesBelow entry point (expecting to leave the given path untouched) and for corner cases like handling an unwritable directory. Addresses bazelbuild/bazel#7527. RELNOTES: None. PiperOrigin-RevId: 239489795
The current implementation of these functions is very inefficient and degrades overall performance significantly, especially when sandboxing is enabled. However, that's almost the best we can do with a generic algorithm. To make room for optimizations that rely on specific file system features, move these functions into the FileSystem class. I will supply a custom implementation for UnixFileSystem later. Note that this is intended to be a pure code move. I haven't applied any improvements to the code nor tests yet (with the exception of cleaning up docstrings). Addresses bazelbuild/bazel#7527. RELNOTES: None. PiperOrigin-RevId: 239412965
Hi there! We're doing a clean up of old issues and will be closing this one. Please reopen if you’d like to discuss anything further. We’ll respond as soon as we have the bandwidth/resources to do so. |
This is still relevant. |
I found that when exec bazel builds. macos is listening to the bazel workspace and bazel-out directory. I don't think the bazel-out directory should be included here. |
Any update on this? |
We found a workaround: use a ramdisk or |
Description of the problem / feature request:
Extremely low performance of bazel when building a large project on macOS when compared to running CMake + make.
Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Build a large bazel project on macOS. I see that a lot of
fsevents
for files in/private/var/tmp/_bazel_user
triggered andfseventsd
seems to crawl at 100% - 200% CPU usage after them, slowing down the builds. I do have/private/var/tmp
masked out in my spotlight configuration, but it does not seem to be enough.What operating system are you running Bazel on?
What's the output of
bazel info release
?The text was updated successfully, but these errors were encountered: