Bazel hangs indefinitely when building with remote cache #11782

keith · 2020-07-15T22:26:40Z

We intermittently see an issue where bazel will hang indefinitely while building a target:

[5,322 / 5,621] Compiling Swift module SomeTarget; 11111s remote-cache

Seemingly while trying to fetch it from the remote cache. In these cases once we kill the bazel process no execution logs or chrome trace logs are produced, so I don't have that info, but when I sample the process on macOS, or kill -QUIT PID it I get these logs:

sample log:

Call graph:
    2796 Thread_77950543   DispatchQueue_1: com.apple.main-thread  (serial)
    + 2796 start  (in libdyld.dylib) + 1  [0x7fff736f0cc9]
    +   2796 main  (in bazel) + 143  [0x106f197ff]
    +     2796 blaze::Main(int, char const* const*, blaze::WorkspaceLayout*, blaze::OptionProcessor*, unsigned long long)  (in bazel) + 5317  [0x106f012a5]
    +       2796 blaze::RunLauncher(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, blaze::StartupOptions const&, blaze::OptionProcessor const&, blaze::WorkspaceLayout const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, blaze::LoggingInfo*)  (in bazel) + 11470  [0x106f0477e]
    +         2796 blaze::RunClientServerMode(blaze_util::Path const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, blaze_util::Path const&, blaze::WorkspaceLayout const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, blaze::OptionProcessor const&, blaze::StartupOptions const&, blaze::LoggingInfo*, blaze::DurationMillis, blaze::DurationMillis, blaze::BlazeServer*)  (in bazel) + 4966  [0x106f113d6]
    +           2796 blaze::BlazeServer::Communicate(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::vector<blaze::RcStartupFlag, std::__1::allocator<blaze::RcStartupFlag> > const&, blaze::LoggingInfo const&, blaze::DurationMillis, blaze::DurationMillis, blaze::DurationMillis)  (in bazel) + 1632  [0x106f09350]
    +             2796 grpc::ClientReader<command_server::RunResponse>::Read(command_server::RunResponse*)  (in bazel) + 496  [0x106f086f0]
    +               2796 cq_pluck(grpc_completion_queue*, void*, gpr_timespec, void*)  (in bazel) + 548  [0x106fda5e4]
    +                 2796 pollset_work(grpc_pollset*, grpc_pollset_worker**, long long)  (in bazel) + 1303  [0x106fb8fc7]
    +                   2796 poll  (in libsystem_kernel.dylib) + 10  [0x7fff738383d6]
    2796 Thread_77950545
    + 2796 start_wqthread  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b77]
    +   2796 _pthread_wqthread  (in libsystem_pthread.dylib) + 390  [0x7fff738f1aa1]
    +     2796 __workq_kernreturn  (in libsystem_kernel.dylib) + 10  [0x7fff738334ce]
    2796 Thread_77950552: default-executor
    + 2796 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b8b]
    +   2796 _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff738f5109]
    +     2796 grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*)::'lambda'(void*)::__invoke(void*)  (in bazel) + 125  [0x106fe9edd]
    +       2796 GrpcExecutor::ThreadMain(void*)  (in bazel) + 550  [0x106fbce36]
    +         2796 gpr_cv_wait  (in bazel) + 120  [0x106fe8768]
    +           2796 _pthread_cond_wait  (in libsystem_pthread.dylib) + 698  [0x7fff738f5425]
    +             2796 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff73834882]
    2796 Thread_77950553: resolver-executor
    + 2796 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b8b]
    +   2796 _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff738f5109]
    +     2796 grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*)::'lambda'(void*)::__invoke(void*)  (in bazel) + 125  [0x106fe9edd]
    +       2796 GrpcExecutor::ThreadMain(void*)  (in bazel) + 550  [0x106fbce36]
    +         2796 gpr_cv_wait  (in bazel) + 120  [0x106fe8768]
    +           2796 _pthread_cond_wait  (in libsystem_pthread.dylib) + 698  [0x7fff738f5425]
    +             2796 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff73834882]
    2796 Thread_77950554: grpc_global_timer
    + 2796 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b8b]
    +   2796 _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff738f5109]
    +     2796 grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*)::'lambda'(void*)::__invoke(void*)  (in bazel) + 125  [0x106fe9edd]
    +       2796 timer_thread(void*)  (in bazel) + 895  [0x106fc9dff]
    +         2796 gpr_cv_wait  (in bazel) + 102  [0x106fe8756]
    +           2796 _pthread_cond_wait  (in libsystem_pthread.dylib) + 698  [0x7fff738f5425]
    +             2796 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff73834882]
    2796 Thread_77950566
    + 2796 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b8b]
    +   2796 _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff738f5109]
    +     2796 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (blaze::BlazeServer::*)(), blaze::BlazeServer*> >(void*)  (in bazel) + 62  [0x106f1950e]
    +       2796 blaze::BlazeServer::CancelThread()  (in bazel) + 152  [0x106f073d8]
    +         2796 blaze_util::PosixPipe::Receive(void*, int, int*)  (in bazel) + 25  [0x1070ff1f9]
    +           2796 read  (in libsystem_kernel.dylib) + 10  [0x7fff7383281e]
    2796 Thread_77950583: grpc_global_timer
      2796 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b8b]
        2796 _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff738f5109]
          2796 grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*)::'lambda'(void*)::__invoke(void*)  (in bazel) + 125  [0x106fe9edd]
            2796 timer_thread(void*)  (in bazel) + 895  [0x106fc9dff]
              2796 gpr_cv_wait  (in bazel) + 120  [0x106fe8768]
                2796 _pthread_cond_wait  (in libsystem_pthread.dylib) + 698  [0x7fff738f5425]
                  2796 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff73834882]

Total number in stack (recursive counted multiple, when >=5):
        5       _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff738f5109]
        5       thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b8b]

Sort by top of stack, same collapsed (when >= 5):
        __psynch_cvwait  (in libsystem_kernel.dylib)        11184
        __workq_kernreturn  (in libsystem_kernel.dylib)        2796
        poll  (in libsystem_kernel.dylib)        2796
        read  (in libsystem_kernel.dylib)        2796

jvm.out produced by sigquit https://gist.github.com/keith/3c77f7e49c108964596440a251c05929

It's worth noting that during this time bazel is consuming 0% CPU, so it seems likely that it's locked waiting on a response that has already timed out. Currently we're using a gRPC remote cache, with the default remote timeout of 60, and the default value for --cpus.

What operating system are you running Bazel on?

macOS

What's the output of `bazel info release`?

3.3.0 we've heard reports from others that this repros on 3.2.0 as well, we have since updated to 3.4.1 but we don't see this often enough to know if it's an issue there yet as well.

Have you found anything relevant by searching the web?

There are various related sounding issues but AFAICT they're all closed

The text was updated successfully, but these errors were encountered:

segiddins · 2020-07-15T22:28:16Z

We see this issue as well using an HTTP remote cache on bazel 3.2.0

meisterT · 2020-07-16T19:34:30Z

Cc @coeuvre

meisterT · 2020-07-16T19:53:15Z

I assume this is the stack trace of the blocking thread:

"skyframe-evaluator 12": awaiting notification on [0x0000000645000ca0]

	at jdk.internal.misc.Unsafe.park(java.base@11.0.6/Native Method)

	at java.util.concurrent.locks.LockSupport.park(java.base@11.0.6/Unknown Source)

	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:497)

	at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:83)

	at com.google.devtools.build.lib.remote.util.Utils.getFromFuture(Utils.java:56)

	at com.google.devtools.build.lib.remote.RemoteCache.waitForBulkTransfer(RemoteCache.java:219)

	at com.google.devtools.build.lib.remote.RemoteCache.download(RemoteCache.java:331)

	at com.google.devtools.build.lib.remote.RemoteSpawnCache.lookup(RemoteSpawnCache.java:179)

	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:123)

	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:96)

	at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:39)

	at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:62)

	at com.google.devtools.build.lib.analysis.actions.SpawnAction.beginExecution(SpawnAction.java:331)

	at com.google.devtools.build.lib.actions.Action.execute(Action.java:124)

	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$4.execute(SkyframeActionExecutor.java:779)

	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:925)

	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:896)

	at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:128)

	at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:80)

	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:419)

	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:897)

	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:300)

	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:438)

	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:398)

	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.6/Unknown Source)

	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.6/Unknown Source)

	at java.lang.Thread.run(java.base@11.0.6/Unknown Source)

coeuvre · 2020-07-17T08:04:21Z

Currently, Bazel only triggers timeout error when there is no byte downloaded/uploaded within --remote_timeout seconds.

In this case, I think it is still downloading bytes from remote cache, but maybe with slow network speed, so it never triggers timeout, hence, hangs for a long time.

Should we change to trigger timeout error if the download/upload isn't finished within --remote_timeout seconds?

keith · 2020-07-20T23:36:36Z

FWIW in our case the targets we've seen this happen with are definitely <10mbs, but I suppose there could be an intermittent issue with network performance. I'm definitely a +1 for changing remote_timeout to timeout when the request hasn't completed in that time, but even if we don't want to change that behavior I definitely think there should be some way to do that since otherwise these seem to hang "forever"

coeuvre · 2020-08-10T09:01:24Z

Bazel only triggers timeout error when there is no byte downloaded/uploaded within --remote_timeout seconds.

This is only true for HTTP remote cache.

For gRPC remote cache, bazel did trigger timeout with withDeadlineAfter and will retry for --remote_retries times which is default to 5.

In theory, the max waiting time for downloading a file/blob from cache should be 60s * (1+5) = 360s by default. There must be something wrong in this case.

keith · 2020-08-17T15:27:53Z

Related #11957

ulfjack · 2020-08-17T15:46:33Z

Interesting. Is your remote cache behind a load balancer? If so, what kind?

keith · 2020-08-17T15:47:35Z

We're using google RBE as a cache

segiddins · 2020-08-17T15:48:09Z

I've seen it happen with an HTTP cache behind AWS ELB, and a gRPC cache not sitting behind anything. I can try with an unbalanced HHTP cache if that would be helpful?

ulfjack · 2020-08-17T15:57:05Z

Let me repost part of my PR description that may be relevant here:

We have seen cases where GCP's external TCP load balancer silently drops connections without telling the client. If this happens when the client is waiting for replies from the server (e.g., when all worker threads are in blocking 'execute' calls), then the client does not notice the connection loss and hangs.

In our testing, the client unblocked after ~2 hours without this change, although we are not sure why (maybe a default Linux Kernel timeout on TCP connections?). With this flag set to 10 minutes, and running builds over night, we see repeated 10-minute gaps in remote service utilization, which seems pretty clear evidence that this is the underlying problem.

ulfjack · 2020-08-17T16:00:58Z

While we think it's due to GCP's TCP load balancer, we technically can't rule out any other system on the network between my machine and the cluster. What is surprising to us that we have seen it happen in the middle of the build. I let some builds run through last night, and we have clear evidence of ~10 hangs during that period, all of which recovered after exactly 10 minutes, with 10 minutes being exactly the keep-alive time I set on Bazel.

ulfjack · 2020-08-17T16:01:54Z

We think that some of the other spikes could also be caused by the same issue, just that Bazel happened to recover faster because it tried to make a remote call rather than just wait for replies.

ulfjack · 2020-08-17T16:31:43Z

I think the best case would be to get my PR merged, and then test with specific settings for the timeout and, ideally, monitor the length of the gaps and see if there's a correlation. I think that most clearly shows whether it's a similar problem or a different one.

What I did to debug this was to take heap dumps from Bazel and the server and analyze them with YourKit Profiler. YourKit supports searching for and inspecting individual objects, so I could clearly see that Bazel had an open connection and the server did not have an open connection. This asymmetry tipped me off that it was something in the network.

EricBurnett · 2020-08-17T21:58:24Z

Ulf, please see https://docs.google.com/document/d/13vXJTFHJX_hnYUxKdGQ2puxbNM3uJdZcEO5oNGP-Nnc/edit#heading=h.z3810n36by5c - Jakob was investigating keepalives last year and concluded that gRPC keepalives were unworkable because there's also a limit on the number of consecutive pings, where that limit is "2" in practice.

I'm generally inclined to call this a bug - we (mostly Dave and Chi) have been chasing this steadily and have observed that the outstanding RPC can hang for longer than the RPC timeout, which is fairly clearly either a gRPC or a bazel bug since gRPC should have surfaced a timeout by then if nothing else. But the fact that it's also affecting HTTP caches could mean it's a bazel bug? (Or a Netty bug, if both use netty?) Either way, there's some concurrency problem where expected signals (RPC failures, timeouts) are not getting handled properly. And if we believe that timeouts are being mishandled and that it may be in the gRPC/netty/bazel stack somewhere, it may also be the case that connection terminations should be noticed but are being mishandled by the same bug.

ulfjack · 2020-08-17T22:47:51Z

AIUI, Bazel currently doesn’t set a timeout on execute calls. The proposal to cancel the call and retry is not feasible IMO as it doesn’t allow distinguishing between actual cancellation and keep-alive cancellation. If an action is really expensive, letting it run to completion if nobody is going to use the result seems unacceptable. We could send app-level keep-alives from the server but I don’t see how that’s better than client-side pings. It’s actually worse because it’ll require more network bandwidth (client-side pings are per connection rather than per call). Did anyone discuss this with gRPC team? Because my research seemed to indicate that they added the client-side keep-alives for this purpose. Also, we need a workaround now-ish because the current state of affairs is that remote execution is literally unusable in some not uncommon scenarios. Ideally, someone would fix the load balancers to sent a TCP reset instead of us working around the issue, but I won’t hold my breath.

…

On Mon 17. Aug 2020 at 23:58, Eric Burnett ***@***.***> wrote: Ulf, please see https://docs.google.com/document/d/13vXJTFHJX_hnYUxKdGQ2puxbNM3uJdZcEO5oNGP-Nnc/edit#heading=h.z3810n36by5c - Jakob was investigating keepalives last year and concluded that gRPC keepalives were unworkable because there's also a limit on the number of consecutive pings, where that limit is "2" in practice. I'm generally inclined to call this a bug - we (mostly Dave and Chi) have been chasing this steadily and have observed that the outstanding RPC can hang for longer than the RPC timeout, which is fairly clearly either a gRPC or a bazel bug since gRPC *should* have surfaced a timeout by then if nothing else. But the fact that it's also affecting HTTP caches could mean it's a bazel bug? (Or a Netty bug, if both use netty?) Either way, there's some concurrency problem where expected signals (RPC failures, timeouts) are not getting handled properly. And if we believe that timeouts are being mishandled and that it may be in the gRPC/netty/bazel stack somewhere, it may also be the case that connection terminations *should* be noticed but are being mishandled by the same bug. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#11782 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABYD2YN3LGBYF7MVXE4CXNDSBGRY7ANCNFSM4O3CMBFA> .

ulfjack · 2020-08-17T23:40:37Z

Based on my professional opinion that that would be a completely absurd way to specify keep-alive ping behavior given what the purpose of a keep-alive ping is, I tried it with the gRPC Java implementation, and I can confirm that it sends PING packets at regular intervals (not just 2) and the server keeps replying to them, as I would expect. I also observe that gRPC seems to enforce an undocumented 10s minimum ping interval regardless of what's specified on the server-side (why???).

Screenshot of Wireshark which clearly shows 10s ping intervals:

I also tried to ping the server more aggressively than the server allows, and that does seem to result in a GOAWAY, as expected (based on my assumption that the implementation is dumb). Ideally, the server would tell the client about the minimum allowed keep-alive time rather than having to do that manually.

There's also @ejona86's comment on grpc/grpc-java#7237 which seems to indicate that the GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA setting may be misunderstood. I was not able to find more useful documentation about the setting - note that Java does not have such a setting.

My conclusion is that @buchgr's conclusion in the doc is incorrect, at least for the Java gRPC library. I have not tried it with other language implementations, but I would be surprised if they broke keep-alive pings in this way.

ulfjack · 2020-08-17T23:47:33Z

It is unclear from the doc whether @buchgr tested the actual behavior or came to the conclusion purely based on reading the docs. Also, after reading the comments more carefully, it looks like both @EricBurnett and @bergsieker expressed incredulity at gRPC keep-alive being defined that way, and @buchgr said he sent an email to the mailing list but never reported back.

ulfjack · 2020-08-17T23:53:31Z

I can't find the email @buchgr mentioned, but all discussions about keep-alive pings on the main gRPC mailing list seem to indicate that pings work exactly the way I would expect rather than having some random limit on how many pings can be sent by the client.

ulfjack · 2020-08-17T23:55:06Z

Apologies for the many posts, but this is really important for us, because, as mentioned, this makes remote execution unusable in some scenarios, and I'd like to have a workaround / fix in sooner rather than later.

ejona86 · 2020-08-18T13:47:31Z

As part of investigating grpc/grpc#17667 (not very useful of a reference) I discovered that, yes, C did apply the GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA setting to keepalive pings. And yes, that is absurd. It was "working as expected" by the C core team though, so any discussions with them probably did not lead anywhere. Bringing it to my attention at that point though would have helped. There are legitimate reasons for GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA related to BDP monitoring, but it caused problems to expose the option and the option should not have impacted keepalive. The very existence of the option is what was causing problems here, in a "tail wagging the dog" way. They have carved out some time to fix the implementation; it's not trivial for a variety of reasons.

Java should behave as you expect. The 10 second minimum is defined in https://github.com/grpc/proposal/blob/master/A8-client-side-keepalive.md . The Java documentation states that low values may be increased, but it does not define 10 seconds specifically. Very recently Java did receive something similar to GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA internally, but it isn't exposed as an option and it does not impact keepalive.

When gRPC clients receive a GOAWAY with too_many_pings they should increase their keepalive time by 2x. This is to reduce the impact when things are misconfigured, while not removing errors entirely so the user can be informed of the issue, as they wouldn't be getting their expected keepalive setting.

ejona86 · 2020-08-18T14:00:05Z

It sounds like you are saying a TCP LB is being used (not L7).

You should be aware there is a 10 minute idle timeout. https://cloud.google.com/compute/docs/troubleshooting/general-tips#communicatewithinternet . But it doesn't seem you are hitting it. In order to avoid that you need a keepalive time less than 10 minutes (say, 9 minutes). I'd suggest the gRPC server to configure this keepalive.

You could also be hitting a default 30 second idle timeout. This value can be increased, but probably not above 10 minutes. Again, it would be best for the server to manage the keepalive here, as it is controlling the TCP proxy settings and can update them when things change.

EricBurnett · 2020-08-18T14:13:12Z

@ejona86 thanks for those details! Unfortunate to hear it does in fact work this way, but good to have confirmation.

@ulfjack :

AIUI, Bazel currently doesn’t set a timeout on execute calls.

Interesting! We should also get that fixed. Bazel should add a timeout to all outgoing RPCs. Retry logic here is a little tricky due to the Execute/WaitExecution dance and the need for supporting arbitrarily long actions, but I think the result should be something like:

Bazel sets a timeout on all Execute and WaitExecution calls. Probably just the value of the remote RPC timeout flag, like everything else, for now? But it could go reasonably low (few minutes) if we also:
Correct the retrier to not count DEADLINE_EXCEEDED on Execute or WaitExecution against the maximum retries. That's an expected, normal case, meaning "please retry waiting". Right now it does, which requires raising max retries to support really long actions :(.

We could send app-level keep-alives from the server but I don’t see how that’s better than client-side pings.

I'd definitely recommend it. Client-side pings happen at the connection level, but different HTTP2 proxies may also apply idle timeouts to the carried streams IIUC. For Execute requests RBE intentionally sends a no-op streaming response once per minute, which has proven sufficient to keep streams alive in all networking environments our clients run in.

I'm not fully understanding the scenario you're running into with dropped connections, but this may also suffice to avoid your issue, with server-side changes only. That should be quick and easy for you to land.

buchgr · 2020-08-18T14:25:03Z

I can't say that I still remember the details of this, but here are the contents of the e-mail:

My e-mail

Hey,

I'd like to ask a clarifying question regarding the gRPC keepalive spec for the following scenario: I have an active channel with one or more open server-streaming calls that do not send any messages for hours. The gRPC keepalive time is set to 30s. All other knobs are left at their defaults.

My understanding of the spec is that in this scenario HTTP/2 PING frames will be sent at the following times after the channel was opened: 30s, 60s, 6 minutes, 11 minutes, ...

I am thinking this because GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA defaults to 2 and GRPC_ARG_HTTP2_MIN_SENT_PING_INTERVAL_WITHOUT_DATA_MS defaults to 5 minutes.

Is this correct?

Response from the gRPC team

Hi,

After 2 pings without data, we will NOT send any more pings. So if there is no data, two pings will be sent 5 minutes apart after which no pings will be sent.

ulfjack · 2020-08-18T16:06:28Z

I am using a TCP load balancer rather than an HTTPS load balancer in this cluster because the HTTPS load balancer is significantly more expensive. We previously ran into the 30s idle timeout on the HTTPS load balancer but not the TCP load balancer. As far as I can tell, there is no way to set an idle timeout on an external TCP load balancer in GCP, and it's not clear from the GCP documentation whether there is an idle timeout in this case (it doesn't mention external TCP load balancers, and I also couldn't find a setting for this).

I'm also fairly sure that the connection was active at the time of the timeout, although I can't say this with complete certainty. If I'm right, then a server-side keep-alive ping isn't going to do diddly.

Why should this block merging PR #11957? The Java gRPC library doesn't have this bug, and it's an opt-in setting.

What's the purpose of setting a deadline on an execute call given that the client is just going to retry? This seems like a blatant abuse of functionality that is intended for something else. If we want to detect that the connection is dead, the right way to do that is to use keep-alives.

coeuvre · 2020-11-06T09:37:51Z

Ensure NetworkTime never throw exceptions inside onClose #12422 adds check around NetworkTime to ensure we don't break gRPC.
Ensure SettableFutures in remote module are all set. #12426 adds checks around SettableFutures in remote module to ensure they are all set.

All these places can throw an unchecked exception in theory. By inserting code there, which randomly throws RuntimeException, we can reproduce the hangs and the stack traces are same with what we get from users.

At this point, I believe the root cause of the hanging issue is that an unchecked exception was thrown during the RPC so that a Future never completes. It's hard to reproduce locally because the conditions are hard to meet e.g.

The issue is fixed by the PRs in theory. Can you help to verify?

Failed doing so, will cause Bazel hanging forever. This could be one of causes for #11782. Closes #12426. PiperOrigin-RevId: 341554621

ulfjack · 2020-11-11T18:08:28Z

I build a Bazel binary from head including both #12422 and #12426 and I still see hangs.

coeuvre · 2020-11-12T02:01:29Z

Where were the hangs? If it is stuck at remote execution, can you try with --experimental_remote_execution_keepalive?

ulfjack · 2020-11-12T13:33:54Z

Maybe unrelated, but I see this error when I enable --experimental_remote_execution_keepalive:

ERROR: /home/ulfjack/Google/tensorflow/tensorflow/core/util/BUILD:364:24: Action tensorflow/core/util/version_info.cc failed: Exec failed due to IOException: /dev/null.remote (Permission denied)

I have had it running for 4 hours without any major hangs.

ulfjack · 2020-11-12T14:48:40Z

Well, I spoke too soon. It hung itself up just after my previous reply. Output looks like this:

[8,281 / 11,167] 75 actions, 48 running
    Compiling tensorflow/core/kernels/training_ops.cc; 2945s remote
    Compiling tensorflow/core/kernels/list_kernels.cc; 2885s remote
    Compiling tensorflow/lite/toco/tflite/operator.cc; 2860s remote
    Compiling tensorflow/lite/toco/tflite/op_version.cc; 2860s remote
    Compiling tensorflow/lite/toco/tflite/import.cc; 2860s remote
    Compiling tensorflow/lite/toco/tflite/export.cc; 2860s remote
    Compiling .../compiler/mlir/tensorflow/ir/tf_ops_n_z.cc; 2860s remote
    Compiling .../lite/toco/logging/conversion_log_util.cc; 2860s remote ...

philwo · 2020-11-12T16:59:56Z

Well, I spoke too soon. It hung itself up just after my previous reply. Output looks like this:

Oh no 😱

What does the stack trace say and is there anything interesting in the jvm.out? I think Chi added some more logging that might be useful for analysis.

ulfjack · 2020-11-12T17:29:37Z

jvm.out is empty, but I didn't capture a stack trace. Will rerun and see if I can get one.

bergsieker · 2020-11-12T17:46:14Z

Note that I was able to reproduce a hang somewhat regularly with illegal GO_AWAY behavior on the server side. I haven't tried a build with Chi's fixes yet, but if you're working with a server that might send GO_AWAY frames that's a possibility.

ulfjack · 2020-11-12T17:48:00Z

The cluster I'm currently working with is not configured to send GO_AWAY frames.

ulfjack · 2020-11-13T15:33:19Z

bazel-hanging-threads.txt

Attached a stack trace. It's possible I removed jvm.out before I looked at it.

I ran Bazel in a loop since last night, and it got stuck 4 times.

ulfjack · 2020-11-13T17:06:23Z

Nope, no output in jvm.out.

coeuvre · 2020-11-16T05:09:05Z

Nope, no output in jvm.out.

output from server is stored at $output_base/java.log.

bazel-hanging-threads.txt

Attached a stack trace.

This stack trace shows that Bazel was stuck at remote execution calls (without --experimental_remote_execution_keepalive). This is different from the hangs reported by Keith and other users - stack traces showed that Bazel was stuck at remote cache calls. Both #12422 and #12426 are used to fix hangs at remote cache calls.

Hangs at remote execution calls is explaind by this design doc and should be fixed with --experimental_remote_execution_keepalive. Does the error from #11782 (comment) stop you from using this flag? If so, could you please share more details (e.g. jvm.log, java.log) so I can look into?

ulfjack · 2020-11-16T12:56:20Z

Hmm, ok. The stacktrace is from stuck builds with --experimental_remote_execution_keepalive. I think this is a log file from one of those runs.

java.log.txt

coeuvre · 2020-11-16T14:17:25Z

Interesting. Remote execution with --experimental_remote_execution_keepalive should use ExperimentalGrpcRemoteExecutor but the stack traces showed that Bazel is stuck at GrpcRemoteExecutor which is not created when the flag is enabled. Would please double check the hangs is from a build with --experimental_remote_execution_keepalive?

ulfjack · 2020-11-16T20:37:29Z

Ok, new stack trace.

bazel-hangs-stacks.txt

ulfjack · 2020-11-16T20:46:42Z

And a matching log file.

java.log.txt

coeuvre · 2020-11-17T03:31:23Z

Can't find hints from java.log.

Since both GrpcRemoteExecutor and ExperimentalGrpcRemoteExecutor use blocking gRPC calls, we don't have the risks like we had for remote cache calls where a future never completes resulting in a hang (unless a bug inside gRPC). The new stack trace shows that Bazel is waiting for network data. What's the values of --remote_timeout/--remote_retries here? Is that possible we didn't reach the maximum time (--remote_timeout x (--remote_retries + 1))?

coeuvre · 2020-12-09T03:58:11Z

Closing as the issue seems to be fixed. Feel free to re-open if it happens again.

Failed doing so will cause gRPC hanging forever. This could be one of causes that leads to bazelbuild#11782. Closes bazelbuild#12422. PiperOrigin-RevId: 340995977

Failed doing so, will cause Bazel hanging forever. This could be one of causes for bazelbuild#11782. Closes bazelbuild#12426. PiperOrigin-RevId: 341554621 # Conflicts: # src/main/java/com/google/devtools/build/lib/remote/GrpcCacheClient.java # src/main/java/com/google/devtools/build/lib/remote/http/HttpCacheClient.java

ulfjack · 2022-04-25T11:19:21Z

We're still seeing the issue with 6.0.0-pre.20220405.2 (we forgot to explicitly set --grpc_keepalive_time and it's still disabled by default).

coeuvre · 2022-04-25T13:11:01Z

(we forgot to explicitly set --grpc_keepalive_time and it's still disabled by default).

Does that mean you don't have the issue if --grpc_keepalive_time is set? Can you try with --experimental_remote_execution_keepalive?

wolfd · 2022-04-27T22:21:43Z

Also seeing this happen with a gRPC-based remote-cache. We also have a TCP load balancer in the loop, although it's on AWS rather than GCP. Adding --grpc_keepalive_time seems to help a lot, a clean build of something went from 1h42m to 8m with the flag set to 10s.

meisterT changed the title ~~Bazel hangs indefinitely when building with cache~~ Bazel hangs indefinitely when building with remote cache Jul 16, 2020

oquenchil added team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged type: bug labels Jul 22, 2020

keith mentioned this issue Aug 17, 2020

Add --grpc_keepalive_time/grpc_keepalive_timeout #11957

Closed

coeuvre mentioned this issue Nov 6, 2020

Ensure SettableFutures in remote module are all set. #12426

Closed

bazel-io pushed a commit that referenced this issue Nov 10, 2020

Ensure SettableFutures in remote module are all set.

c7c7f5f

Failed doing so, will cause Bazel hanging forever. This could be one of causes for #11782. Closes #12426. PiperOrigin-RevId: 341554621

coeuvre mentioned this issue Nov 20, 2020

Take no action to prefetch empty artifacts. #12514

Closed

coeuvre closed this as completed Dec 9, 2020

coeuvre mentioned this issue Dec 9, 2020

Bazel hangs when using GRPC cache #10731

Closed

irfansharif mentioned this issue Jan 6, 2021

bazel: invocations occasionally hang cockroachdb/cockroach#58478

Closed

siggisim mentioned this issue Feb 12, 2021

Pass toolchain info to the cabal_wrapper at the call site tweag/rules_haskell#1493

Merged

Bazel hangs indefinitely when building with remote cache #11782

Bazel hangs indefinitely when building with remote cache #11782

Comments

keith commented Jul 15, 2020 • edited Loading

What operating system are you running Bazel on?

What's the output of bazel info release?

Have you found anything relevant by searching the web?

segiddins commented Jul 15, 2020 • edited Loading

meisterT commented Jul 16, 2020

meisterT commented Jul 16, 2020

coeuvre commented Jul 17, 2020

keith commented Jul 20, 2020

coeuvre commented Aug 10, 2020 • edited Loading

keith commented Aug 17, 2020

ulfjack commented Aug 17, 2020

keith commented Aug 17, 2020

segiddins commented Aug 17, 2020

ulfjack commented Aug 17, 2020

ulfjack commented Aug 17, 2020

ulfjack commented Aug 17, 2020

ulfjack commented Aug 17, 2020

EricBurnett commented Aug 17, 2020

ulfjack commented Aug 17, 2020 via email

ulfjack commented Aug 17, 2020

ulfjack commented Aug 17, 2020

ulfjack commented Aug 17, 2020

ulfjack commented Aug 17, 2020

ejona86 commented Aug 18, 2020

ejona86 commented Aug 18, 2020

EricBurnett commented Aug 18, 2020

buchgr commented Aug 18, 2020 • edited Loading

ulfjack commented Aug 18, 2020

coeuvre commented Nov 6, 2020

ulfjack commented Nov 11, 2020

coeuvre commented Nov 12, 2020

ulfjack commented Nov 12, 2020

ulfjack commented Nov 12, 2020

philwo commented Nov 12, 2020

ulfjack commented Nov 12, 2020

bergsieker commented Nov 12, 2020

ulfjack commented Nov 12, 2020

ulfjack commented Nov 13, 2020

ulfjack commented Nov 13, 2020

coeuvre commented Nov 16, 2020

ulfjack commented Nov 16, 2020

coeuvre commented Nov 16, 2020

ulfjack commented Nov 16, 2020

ulfjack commented Nov 16, 2020

coeuvre commented Nov 17, 2020

coeuvre commented Dec 9, 2020

ulfjack commented Apr 25, 2022

coeuvre commented Apr 25, 2022

wolfd commented Apr 27, 2022

keith commented Jul 15, 2020 •

edited

Loading

What's the output of `bazel info release`?

segiddins commented Jul 15, 2020 •

edited

Loading

coeuvre commented Aug 10, 2020 •

edited

Loading

buchgr commented Aug 18, 2020 •

edited

Loading