-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Builds without the Bytes causes Bazel OOM #18145
Comments
Can you share the full list of flags you're building with? (Feel free to redact out sensitive data like URLs, etc) One thing that could help debug this would be to capture a heap histogram ( |
Hi @tjgq , yes, sure. Here are the flags, I redacted target names and URLs but I suppose that should be OK:
EDIT: Here are the histograms. Without BWOB:
And for BWOB ( a bazel server that stopped responding, sometimes it crashes sometimes it stays in some "zombie state" ):
Difference seems quite dramatic, it's more than 3x. EDIT: To make things clearer for debugging, here are the flags that were added to the BWOB box comparing to Non-BWOB box ( they're a subset of the flags posted above ):
|
The plot thickens. I took three heap dumps:
Heap Dump from BWOB server after build vs Heap dump taken off a resting Bazel server post non-BWOB build is like 3x in difference. For me, that looks like a memory leak that eventually leads to memory pressure that bazel server is unable to handle, but if there's any other ideas you have, lemme know! |
For BwoB specifically, it might be #17604 where Bazel sends download progress to the UI. It keeps some state for each active download. In your case, I guess there are many active downloads which causes OOM. Can you try with a patched Bazel that comments out function |
@coeuvre I tracked this even further, you're pointing to the right thing, but I still believe Bazel is defective here - In fact, I believe there is at least one , and might be two bugs at play here. Brace yourself, there's a lot of text :) Bug #1: I looked more into the heap dumps - it seems UIStateTracker that is owned by UIEventHandler does have a non-zero-size "runningDownloads" deque. Fortunately, this class also hosts a "downloads" TreeMap which points to the latest event for the given URL. And the instances of this event are of instance type DownloadProgress and it matches perfectly to the fact that these are dispatched to be observed only if download is done with Priority.LOW. And guess what - this is done by ToplevelArtifactsDownloader , which is created only when we use BWOB! See: Now, the interesting bit: Remember the tree i mentioned above? The last progress message is empty. And there's only one moment when progress message can be empty - when a download is started. See this code: Quick read will tell you that it only returns an empty progress message if includeBytes = false and finished = false. And this combination is called only in started(), here: The logical conclusion is - we got the "started" event but never got to listen on finished() event which would remove entries from UI tracking downloads and ultimately allow shutting down UI thread. Now, where all this machinery is called - see here: It calls started(), and then attaches a listener that calls finished() when blob is downloaded. My guess is that either this listener does not fire, or Closing thought - we should look into why potentially this listener wouldn't run. On my end, I'll compile Bazel tomorrow with some logging and see if it indeed does not run ( or if it runs but out.close() throws an exception ). Bug nr #2: This is in relation to how UIEventHandler eventually stops the updating thread. UIEventThread has HasActivities looks like this:
Now, when those things change, events are being listened on and checking is re-done in UIEventHandler to ensure that in case of a new state, we can stop the UI update thread. For example: https://sourcegraph.com/github.com/bazelbuild/bazel/-/blob/src/main/java/com/google/devtools/build/lib/runtime/UiEventHandler.java?L583 ( BuildCompletedEvent ) https://sourcegraph.com/github.com/bazelbuild/bazel/-/blob/src/main/java/com/google/devtools/build/lib/runtime/UiEventHandler.java?L760 ( ActionUploadFinishedEvent ) However, for the event that updates the downloads: We don't seem to call This also matches my observations - one of the leaked cli-update-threads indeed had an empty On my end, I'll compile Bazel tomorrow with the downloadProgress commented out ( just like you said ) , which should help with the leaking. I'll report tomorrow whether it works. |
Hi there, Sorry, busy days & didn't have time to work on this. I'd like to report that patching out the event handler in UI tracker indeed helped with the memory leak - we don't see our bazel server crashing anymore. |
Thanks for confirming! I will work on the fix soon. |
@coeuvre Do we have a formal fix for the issue so far or still just comment out https://sourcegraph.com/github.com/bazelbuild/bazel/-/blob/src/main/java/com/google/devtools/build/lib/runtime/UiEventHandler.java?L672? |
I believe this is fixed in Bazel HEAD due to other changes. Let me try to work out a dedicated fix for 6.3.0. |
We report download progress to UI when downloading outputs from remote cache. UI thread keeps track of active downloads. There are two cases the UI thread could leak memory: 1. If we failed to close the output stream, the `reporter.finished()` will never be called, prevent UI thread from releasing the active download. This is fixed by calling `reporter.finished()` in `finally` block. 2. Normally, UI thread stops after `BuildCompleted` event. However, if we have background download after build is completed, UI thread is not stopped to continue printing out download progress. But after all downloads are done, we forgot to stop the UI thread, resulting all referenced objects leaked. This is fixed by calling `checkActivities()` for every download progress. Fixes bazelbuild#18145. Closes bazelbuild#18593. PiperOrigin-RevId: 539923685 Change-Id: I7e2887035e540b39e382ab5fcbc06bad03b10427
Thanks a tons for working on this @coeuvre! Once Bazel 6.3.0 drops, I'll report on whether the issue indeed has been mitigated in our setup. |
We report download progress to UI when downloading outputs from remote cache. UI thread keeps track of active downloads. There are two cases the UI thread could leak memory: 1. If we failed to close the output stream, the `reporter.finished()` will never be called, prevent UI thread from releasing the active download. This is fixed by calling `reporter.finished()` in `finally` block. 2. Normally, UI thread stops after `BuildCompleted` event. However, if we have background download after build is completed, UI thread is not stopped to continue printing out download progress. But after all downloads are done, we forgot to stop the UI thread, resulting all referenced objects leaked. This is fixed by calling `checkActivities()` for every download progress. Fixes #18145. Closes #18593. PiperOrigin-RevId: 539923685 Change-Id: I7e2887035e540b39e382ab5fcbc06bad03b10427 Co-authored-by: Chi Wang <chiwang@google.com>
We report download progress to UI when downloading outputs from remote cache. UI thread keeps track of active downloads. There are two cases the UI thread could leak memory: 1. If we failed to close the output stream, the `reporter.finished()` will never be called, prevent UI thread from releasing the active download. This is fixed by calling `reporter.finished()` in `finally` block. 2. Normally, UI thread stops after `BuildCompleted` event. However, if we have background download after build is completed, UI thread is not stopped to continue printing out download progress. But after all downloads are done, we forgot to stop the UI thread, resulting all referenced objects leaked. This is fixed by calling `checkActivities()` for every download progress. Fixes bazelbuild#18145. Closes bazelbuild#18593. PiperOrigin-RevId: 539923685 Change-Id: I7e2887035e540b39e382ab5fcbc06bad03b10427
@alexofortune @coeuvre, I am still experiencing occasional Bazel OOMs on incremental RBE BwoB builds using 6.3.0rc1. I also see many large "cli-update-threads" in dump created during the OOM, with accumulative retained size accounting for ~40% of my total heap. I think this issue should be reopened. Happy to provide more information from my dumps, but unable to share them directly. |
Can you try whether it can be reproduced with Bazel@HEAD? |
Please reopen with more information if the issue persists in 7.x. |
Description of the bug:
We're building code around a fairly large repo and we noticed Bazel server consistently OOMing when --remote_download_toplevel is used.
A heap dump was made ( unfortunately i don't think I can share it :( ), which points to four threads, all named "cli-update-thread" local variables taking roughly 80% of the memory. The variables stack ( from thread gc root inwards ):
( Each of the threads seems to have their own InMemoryMemoizingEvaluator )
It should be noted that this happens very consistently using BWOB and does not happen basically at all using non-BWOB builds.
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Make a large build with BWOB running over a rather large repository. It won't crash every time, but it will crash eventually.
Which operating system are you running Bazel on?
Linux
What is the output of
bazel info release
?release 6.1.0
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?No response
Have you found anything relevant by searching the web?
Nothing
Any other information, logs, or outputs that you want to share?
Bazel before OOM threw a lot of logs like this:
The text was updated successfully, but these errors were encountered: