Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Builds without the Bytes causes Bazel OOM #18145

Closed
alexofortune opened this issue Apr 19, 2023 · 13 comments
Closed

Builds without the Bytes causes Bazel OOM #18145

alexofortune opened this issue Apr 19, 2023 · 13 comments
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@alexofortune
Copy link

alexofortune commented Apr 19, 2023

Description of the bug:

We're building code around a fairly large repo and we noticed Bazel server consistently OOMing when --remote_download_toplevel is used.

A heap dump was made ( unfortunately i don't think I can share it :( ), which points to four threads, all named "cli-update-thread" local variables taking roughly 80% of the memory. The variables stack ( from thread gc root inwards ):

UIEventHandler.stateTracker.executionProgressReceiver.eventBus.subscribers.table[0] ( of type InMemoryMemoizingEvaluator )
evaluator.graph.nodeMap -> contains many, many, many entries.

( Each of the threads seems to have their own InMemoryMemoizingEvaluator )

It should be noted that this happens very consistently using BWOB and does not happen basically at all using non-BWOB builds.

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Make a large build with BWOB running over a rather large repository. It won't crash every time, but it will crash eventually.

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

release 6.1.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

Nothing

Any other information, logs, or outputs that you want to share?

Bazel before OOM threw a lot of logs like this:

230419 11:46:00.012:I 19 [com.google.devtools.build.lib.skyframe.HighWaterMarkLimiter.handle] Dropping unnecessary temporary state in response to memory pressure. actual=99 threshold=85 [CONTEXT ratelimit_period="10 SECONDS [skipped: 1]" ]
230419 11:46:02.591:I 19 [com.google.devtools.build.lib.metrics.PostGCMemoryUseRecorder.handleNotification] Memory use after full GC: 5546637224
@meisterT meisterT added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Apr 19, 2023
@tjgq
Copy link
Contributor

tjgq commented Apr 19, 2023

Can you share the full list of flags you're building with? (Feel free to redact out sensitive data like URLs, etc)

One thing that could help debug this would be to capture a heap histogram (jmap -histo) for the "with bytes" and "without bytes" scenarios and diff them to see where the extra allocations are coming from. I think this should be ok to share, since it would only contain the names of classes in the Bazel codebase, and nothing specific to your project?

@alexofortune
Copy link
Author

alexofortune commented Apr 19, 2023

Hi @tjgq , yes, sure. Here are the flags, I redacted target names and URLs but I suppose that should be OK:

build --auto_cpu_environment_group=[redacted]
build --crosstool_top=[redacted]
build --platforms [redacted]
build --host_platform [redacted]
build --noincompatible_remove_legacy_whole_archive
build --sandbox_base=/dev/shm
build --experimental_strict_action_env
build --jobs 32
build --local_test_jobs 2
build --noremote_allow_symlink_upload
build --noremote_verify_downloads
build --experimental_remote_merkle_tree_cache=true
build --experimental_remote_merkle_tree_cache_size=10000
build --disk_cache=[redacted]
build --test_tmpdir=[redacted]
build --incompatible_remote_results_ignore_disk
build --remote_download_toplevel
build --remote_upload_local_results=false
build --remote_cache=[redacted, remote-cache-grpc-upstream url]
build --test_tmpdir=/mnt/cache/bazel/test_tmpdir

EDIT:

Here are the histograms.

Without BWOB:

  1:       5046563      443718888  [B (java.base@11.0.6)
   2:       4748176      284030144  [Ljava.lang.Object; (java.base@11.0.6)
   3:       3990750       95778000  java.lang.String (java.base@11.0.6)
   4:       2692717       86166944  java.util.concurrent.ConcurrentHashMap$Node (java.base@11.0.6)
   5:       2078356       83134240  com.google.common.collect.MapMakerInternalMap$WeakKeyDummyValueEntry
   6:       1839395       73575800  com.google.devtools.build.skyframe.InMemoryNodeEntry
   7:       2407414       38518624  com.google.devtools.build.lib.vfs.PathFragment$RelativePathFragment
   8:       1479944       35518656  com.google.devtools.build.lib.collect.nestedset.NestedSet
   9:       1143955       27454920  java.util.ArrayList (java.base@11.0.6)
  10:           169       24973880  [Ljava.util.concurrent.ConcurrentHashMap$Node; (java.base@11.0.6)
  11:        575593       23023720  java.util.LinkedHashMap$Entry (java.base@11.0.6)
  12:        542849       21713960  com.google.devtools.build.lib.actions.FileContentsProxy
  13:        880121       21122904  com.google.devtools.build.lib.vfs.RootedPath
  14:        593551       18993632  com.google.devtools.build.lib.actions.Artifact$DerivedArtifact
  15:       1091842       17469472  com.google.common.collect.RegularImmutableList
  16:        527896       16892672  com.google.devtools.build.lib.actions.FileArtifactValue$RegularFileArtifactValue
  17:        274841       15391096  com.google.devtools.build.lib.packages.Rule
  18:        268540       15038240  java.util.LinkedHashMap (java.base@11.0.6)
[...]
Total      45904234     1794130736

And for BWOB ( a bazel server that stopped responding, sometimes it crashes sometimes it stays in some "zombie state" ):

 1:      12763072      990879536  [B (java.base@11.0.6)
   2:      16448840      895386536  [Ljava.lang.Object; (java.base@11.0.6)
   3:       7860660      314426400  com.google.devtools.build.skyframe.InMemoryNodeEntry
   4:       9688347      310027104  java.util.concurrent.ConcurrentHashMap$Node (java.base@11.0.6)
   5:      10308844      247412256  java.lang.String (java.base@11.0.6)
   6:       3173465      126938600  java.util.LinkedHashMap$Entry (java.base@11.0.6)
   7:       4789560      114949440  java.util.ArrayList (java.base@11.0.6)
   8:       4254406      102105744  com.google.devtools.build.lib.collect.nestedset.NestedSet
   9:       6061622       96985952  com.google.devtools.build.lib.vfs.PathFragment$RelativePathFragment
  10:       1707246       95605776  com.google.devtools.build.lib.packages.Rule
  11:           488       91158512  [Ljava.util.concurrent.ConcurrentHashMap$Node; (java.base@11.0.6)
  12:       2223180       88927200  com.google.common.collect.MapMakerInternalMap$WeakKeyDummyValueEntry
  13:       1386372       77636832  java.util.LinkedHashMap (java.base@11.0.6)
  14:       1441043       69969088  [Ljava.util.HashMap$Node; (java.base@11.0.6)
  15:       2151420       68845440  java.util.HashMap$Node (java.base@11.0.6)
  16:       1238280       59437440  com.google.common.collect.SingletonImmutableBiMap
  17:       3643862       58301792  com.google.common.collect.RegularImmutableList
  18:       1720587       55058784  com.google.devtools.build.lib.packages.CallStack$Node
[...]
Total     143838308     5324807032

Difference seems quite dramatic, it's more than 3x.

EDIT:

To make things clearer for debugging, here are the flags that were added to the BWOB box comparing to Non-BWOB box ( they're a subset of the flags posted above ):

build --disk_cache=[redacted]
build --test_tmpdir=[redacted]
build --incompatible_remote_results_ignore_disk
build --remote_download_toplevel
build --remote_upload_local_results=false
build --remote_cache=[redacted, remote-cache-grpc-upstream url]
build --test_tmpdir=/mnt/cache/bazel/test_tmpdir

@alexofortune
Copy link
Author

The plot thickens.

I took three heap dumps:

  • Heap Dump from non-BWOB server, after build ( resting state )
  • Heap Dump from BWOB server after build ( resting state ). I forgot to mention BWOB doesn't crash consistently 100% of the time
  • Heap Dump after OOM on BWOB server.

Heap Dump from BWOB server after build vs Heap dump taken off a resting Bazel server post non-BWOB build is like 3x in difference.
And what's more - dominator tree in Memory Analyzer ( shows biggest objects and their connections to GC roots ) looks basically the same in the resting bazel server post BWOB build and the one that crashed with OOM during BWOB build. The tree looks completely different on non-BWOB resting bazel heapdump ( no mention of "cli-update-thread" at all, the majority is a single memoizing evaluator that is pointed to by SequencedSkyframeExecutor which is rooted in main Thread )

For me, that looks like a memory leak that eventually leads to memory pressure that bazel server is unable to handle, but if there's any other ideas you have, lemme know!

@coeuvre
Copy link
Member

coeuvre commented Apr 20, 2023

cli-update-threads, created by UiEventHandler, are used to handle UI related events.

For BwoB specifically, it might be #17604 where Bazel sends download progress to the UI. It keeps some state for each active download.

In your case, I guess there are many active downloads which causes OOM. Can you try with a patched Bazel that comments out function downloadProgress to see whether the OOM still happens?

@alexofortune
Copy link
Author

alexofortune commented Apr 20, 2023

@coeuvre I tracked this even further, you're pointing to the right thing, but I still believe Bazel is defective here - In fact, I believe there is at least one , and might be two bugs at play here. Brace yourself, there's a lot of text :)

Bug #1:

I looked more into the heap dumps - it seems UIStateTracker that is owned by UIEventHandler does have a non-zero-size "runningDownloads" deque. Fortunately, this class also hosts a "downloads" TreeMap which points to the latest event for the given URL. And the instances of this event are of instance type DownloadProgress and it matches perfectly to the fact that these are dispatched to be observed only if download is done with Priority.LOW. And guess what - this is done by ToplevelArtifactsDownloader , which is created only when we use BWOB!

See:
https://sourcegraph.com/github.com/bazelbuild/bazel/-/blob/src/main/java/com/google/devtools/build/lib/remote/RemoteModule.java?L975 ( Toplevel downloader created only if we do not download all outputs )
https://sourcegraph.com/github.com/bazelbuild/bazel/-/blob/src/main/java/com/google/devtools/build/lib/remote/RemoteActionInputFetcher.java?L99 ( Only Priority.LOW downloads are dispatching events eventually listened to by the UIEventHandler )

Now, the interesting bit: Remember the tree i mentioned above? The last progress message is empty. And there's only one moment when progress message can be empty - when a download is started. See this code:

https://sourcegraph.com/github.com/bazelbuild/bazel/-/blob/src/main/java/com/google/devtools/build/lib/remote/RemoteCache.java?L286

Quick read will tell you that it only returns an empty progress message if includeBytes = false and finished = false. And this combination is called only in started(), here:

https://sourcegraph.com/github.com/bazelbuild/bazel/-/blob/src/main/java/com/google/devtools/build/lib/remote/RemoteCache.java?L273

The logical conclusion is - we got the "started" event but never got to listen on finished() event which would remove entries from UI tracking downloads and ultimately allow shutting down UI thread.

Now, where all this machinery is called - see here:

https://sourcegraph.com/github.com/bazelbuild/bazel/-/blob/src/main/java/com/google/devtools/build/lib/remote/RemoteCache.java?L374

It calls started(), and then attaches a listener that calls finished() when blob is downloaded. My guess is that either this listener does not fire, or out.close(); throws an IOException and finished() never runs. I believe in my case it's "listener never fires" situation, as i didn't see the exception logged in my log ( but this probably should be fixed regardless ).

Closing thought - we should look into why potentially this listener wouldn't run. On my end, I'll compile Bazel tomorrow with some logging and see if it indeed does not run ( or if it runs but out.close() throws an exception ).

Bug nr #2:

This is in relation to how UIEventHandler eventually stops the updating thread. UIEventThread has stopUpdateThread() which will stop the thread. For this to happen, several conditions have to happen - during various events, stateTracker.hasActivities() is called ( either directly or via a helper method checkActivities() on the class ) to re-check for these conditions when some of this state is changed..

HasActivities looks like this:

  synchronized boolean hasActivities() {
    return !(buildCompleted()
        && bepOpenTransports.isEmpty()
        && activeActionUploads.get() == 0
        && activeActionDownloads.get() == 0 // this variable is dead and always zero btw 
        && runningDownloads.isEmpty());
  }

Now, when those things change, events are being listened on and checking is re-done in UIEventHandler to ensure that in case of a new state, we can stop the UI update thread.

For example:

https://sourcegraph.com/github.com/bazelbuild/bazel/-/blob/src/main/java/com/google/devtools/build/lib/runtime/UiEventHandler.java?L583 ( BuildCompletedEvent )

https://sourcegraph.com/github.com/bazelbuild/bazel/-/blob/src/main/java/com/google/devtools/build/lib/runtime/UiEventHandler.java?L760 ( ActionUploadFinishedEvent )

However, for the event that updates the downloads:

https://sourcegraph.com/github.com/bazelbuild/bazel/-/blob/src/main/java/com/google/devtools/build/lib/runtime/UiEventHandler.java?L672

We don't seem to call checkActivities() or stateTracker.hasActivities() anywhere in this method, so, even if we dropped downloads back to zero, it seems that if these events came last and none of above wuould come after it, we'd never re-check the state and never shut down the UI thread.

This also matches my observations - one of the leaked cli-update-threads indeed had an empty runningDownloads deque, yet it was still leaking, so my guess is this is what happened.

On my end, I'll compile Bazel tomorrow with the downloadProgress commented out ( just like you said ) , which should help with the leaking. I'll report tomorrow whether it works.

@coeuvre coeuvre self-assigned this Apr 25, 2023
@coeuvre coeuvre added P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels Apr 25, 2023
@alexofortune
Copy link
Author

Hi there,

Sorry, busy days & didn't have time to work on this. I'd like to report that patching out the event handler in UI tracker indeed helped with the memory leak - we don't see our bazel server crashing anymore.

@coeuvre
Copy link
Member

coeuvre commented Apr 27, 2023

Thanks for confirming! I will work on the fix soon.

@jacobmou
Copy link

jacobmou commented May 9, 2023

@coeuvre Do we have a formal fix for the issue so far or still just comment out https://sourcegraph.com/github.com/bazelbuild/bazel/-/blob/src/main/java/com/google/devtools/build/lib/runtime/UiEventHandler.java?L672?
i was trying to upgrade to bazel 6.1.1 and would hope to leverage builds without the bytes.

@coeuvre
Copy link
Member

coeuvre commented Jun 6, 2023

I believe this is fixed in Bazel HEAD due to other changes. Let me try to work out a dedicated fix for 6.3.0.

iancha1992 pushed a commit to iancha1992/bazel that referenced this issue Jun 13, 2023
We report download progress to UI when downloading outputs from remote cache. UI thread keeps track of active downloads. There are two cases the UI thread could leak memory:

1. If we failed to close the output stream, the `reporter.finished()` will never be called, prevent UI thread from releasing the active download. This is fixed by calling `reporter.finished()` in `finally` block.
2. Normally, UI thread stops after `BuildCompleted` event. However, if we have background download after build is completed, UI thread is not stopped to continue printing out download progress. But after all downloads are done, we forgot to stop the UI thread, resulting all referenced objects leaked. This is fixed by calling `checkActivities()` for every download progress.

Fixes bazelbuild#18145.

Closes bazelbuild#18593.

PiperOrigin-RevId: 539923685
Change-Id: I7e2887035e540b39e382ab5fcbc06bad03b10427
@alexofortune
Copy link
Author

Thanks a tons for working on this @coeuvre! Once Bazel 6.3.0 drops, I'll report on whether the issue indeed has been mitigated in our setup.

iancha1992 added a commit that referenced this issue Jun 14, 2023
We report download progress to UI when downloading outputs from remote cache. UI thread keeps track of active downloads. There are two cases the UI thread could leak memory:

1. If we failed to close the output stream, the `reporter.finished()` will never be called, prevent UI thread from releasing the active download. This is fixed by calling `reporter.finished()` in `finally` block.
2. Normally, UI thread stops after `BuildCompleted` event. However, if we have background download after build is completed, UI thread is not stopped to continue printing out download progress. But after all downloads are done, we forgot to stop the UI thread, resulting all referenced objects leaked. This is fixed by calling `checkActivities()` for every download progress.

Fixes #18145.

Closes #18593.

PiperOrigin-RevId: 539923685
Change-Id: I7e2887035e540b39e382ab5fcbc06bad03b10427

Co-authored-by: Chi Wang <chiwang@google.com>
traversaro pushed a commit to traversaro/bazel that referenced this issue Jun 24, 2023
We report download progress to UI when downloading outputs from remote cache. UI thread keeps track of active downloads. There are two cases the UI thread could leak memory:

1. If we failed to close the output stream, the `reporter.finished()` will never be called, prevent UI thread from releasing the active download. This is fixed by calling `reporter.finished()` in `finally` block.
2. Normally, UI thread stops after `BuildCompleted` event. However, if we have background download after build is completed, UI thread is not stopped to continue printing out download progress. But after all downloads are done, we forgot to stop the UI thread, resulting all referenced objects leaked. This is fixed by calling `checkActivities()` for every download progress.

Fixes bazelbuild#18145.

Closes bazelbuild#18593.

PiperOrigin-RevId: 539923685
Change-Id: I7e2887035e540b39e382ab5fcbc06bad03b10427
@joeljeske
Copy link
Contributor

joeljeske commented Jul 21, 2023

@alexofortune @coeuvre, I am still experiencing occasional Bazel OOMs on incremental RBE BwoB builds using 6.3.0rc1. I also see many large "cli-update-threads" in dump created during the OOM, with accumulative retained size accounting for ~40% of my total heap.

I think this issue should be reopened. Happy to provide more information from my dumps, but unable to share them directly.

@coeuvre coeuvre reopened this Jul 24, 2023
@coeuvre
Copy link
Member

coeuvre commented Jul 24, 2023

Can you try whether it can be reproduced with Bazel@HEAD?

@tjgq
Copy link
Contributor

tjgq commented Aug 29, 2024

Please reopen with more information if the issue persists in 7.x.

@tjgq tjgq closed this as completed Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants