Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel hangs indefinitely when building with remote cache #11782

Closed
keith opened this issue Jul 15, 2020 · 74 comments
Closed

Bazel hangs indefinitely when building with remote cache #11782

keith opened this issue Jul 15, 2020 · 74 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@keith
Copy link
Member

keith commented Jul 15, 2020

We intermittently see an issue where bazel will hang indefinitely while building a target:

[5,322 / 5,621] Compiling Swift module SomeTarget; 11111s remote-cache

Seemingly while trying to fetch it from the remote cache. In these cases once we kill the bazel process no execution logs or chrome trace logs are produced, so I don't have that info, but when I sample the process on macOS, or kill -QUIT PID it I get these logs:

sample log:

Call graph:
    2796 Thread_77950543   DispatchQueue_1: com.apple.main-thread  (serial)
    + 2796 start  (in libdyld.dylib) + 1  [0x7fff736f0cc9]
    +   2796 main  (in bazel) + 143  [0x106f197ff]
    +     2796 blaze::Main(int, char const* const*, blaze::WorkspaceLayout*, blaze::OptionProcessor*, unsigned long long)  (in bazel) + 5317  [0x106f012a5]
    +       2796 blaze::RunLauncher(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, blaze::StartupOptions const&, blaze::OptionProcessor const&, blaze::WorkspaceLayout const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, blaze::LoggingInfo*)  (in bazel) + 11470  [0x106f0477e]
    +         2796 blaze::RunClientServerMode(blaze_util::Path const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, blaze_util::Path const&, blaze::WorkspaceLayout const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, blaze::OptionProcessor const&, blaze::StartupOptions const&, blaze::LoggingInfo*, blaze::DurationMillis, blaze::DurationMillis, blaze::BlazeServer*)  (in bazel) + 4966  [0x106f113d6]
    +           2796 blaze::BlazeServer::Communicate(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::vector<blaze::RcStartupFlag, std::__1::allocator<blaze::RcStartupFlag> > const&, blaze::LoggingInfo const&, blaze::DurationMillis, blaze::DurationMillis, blaze::DurationMillis)  (in bazel) + 1632  [0x106f09350]
    +             2796 grpc::ClientReader<command_server::RunResponse>::Read(command_server::RunResponse*)  (in bazel) + 496  [0x106f086f0]
    +               2796 cq_pluck(grpc_completion_queue*, void*, gpr_timespec, void*)  (in bazel) + 548  [0x106fda5e4]
    +                 2796 pollset_work(grpc_pollset*, grpc_pollset_worker**, long long)  (in bazel) + 1303  [0x106fb8fc7]
    +                   2796 poll  (in libsystem_kernel.dylib) + 10  [0x7fff738383d6]
    2796 Thread_77950545
    + 2796 start_wqthread  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b77]
    +   2796 _pthread_wqthread  (in libsystem_pthread.dylib) + 390  [0x7fff738f1aa1]
    +     2796 __workq_kernreturn  (in libsystem_kernel.dylib) + 10  [0x7fff738334ce]
    2796 Thread_77950552: default-executor
    + 2796 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b8b]
    +   2796 _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff738f5109]
    +     2796 grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*)::'lambda'(void*)::__invoke(void*)  (in bazel) + 125  [0x106fe9edd]
    +       2796 GrpcExecutor::ThreadMain(void*)  (in bazel) + 550  [0x106fbce36]
    +         2796 gpr_cv_wait  (in bazel) + 120  [0x106fe8768]
    +           2796 _pthread_cond_wait  (in libsystem_pthread.dylib) + 698  [0x7fff738f5425]
    +             2796 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff73834882]
    2796 Thread_77950553: resolver-executor
    + 2796 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b8b]
    +   2796 _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff738f5109]
    +     2796 grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*)::'lambda'(void*)::__invoke(void*)  (in bazel) + 125  [0x106fe9edd]
    +       2796 GrpcExecutor::ThreadMain(void*)  (in bazel) + 550  [0x106fbce36]
    +         2796 gpr_cv_wait  (in bazel) + 120  [0x106fe8768]
    +           2796 _pthread_cond_wait  (in libsystem_pthread.dylib) + 698  [0x7fff738f5425]
    +             2796 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff73834882]
    2796 Thread_77950554: grpc_global_timer
    + 2796 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b8b]
    +   2796 _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff738f5109]
    +     2796 grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*)::'lambda'(void*)::__invoke(void*)  (in bazel) + 125  [0x106fe9edd]
    +       2796 timer_thread(void*)  (in bazel) + 895  [0x106fc9dff]
    +         2796 gpr_cv_wait  (in bazel) + 102  [0x106fe8756]
    +           2796 _pthread_cond_wait  (in libsystem_pthread.dylib) + 698  [0x7fff738f5425]
    +             2796 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff73834882]
    2796 Thread_77950566
    + 2796 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b8b]
    +   2796 _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff738f5109]
    +     2796 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (blaze::BlazeServer::*)(), blaze::BlazeServer*> >(void*)  (in bazel) + 62  [0x106f1950e]
    +       2796 blaze::BlazeServer::CancelThread()  (in bazel) + 152  [0x106f073d8]
    +         2796 blaze_util::PosixPipe::Receive(void*, int, int*)  (in bazel) + 25  [0x1070ff1f9]
    +           2796 read  (in libsystem_kernel.dylib) + 10  [0x7fff7383281e]
    2796 Thread_77950583: grpc_global_timer
      2796 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b8b]
        2796 _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff738f5109]
          2796 grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*)::'lambda'(void*)::__invoke(void*)  (in bazel) + 125  [0x106fe9edd]
            2796 timer_thread(void*)  (in bazel) + 895  [0x106fc9dff]
              2796 gpr_cv_wait  (in bazel) + 120  [0x106fe8768]
                2796 _pthread_cond_wait  (in libsystem_pthread.dylib) + 698  [0x7fff738f5425]
                  2796 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff73834882]

Total number in stack (recursive counted multiple, when >=5):
        5       _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff738f5109]
        5       thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b8b]

Sort by top of stack, same collapsed (when >= 5):
        __psynch_cvwait  (in libsystem_kernel.dylib)        11184
        __workq_kernreturn  (in libsystem_kernel.dylib)        2796
        poll  (in libsystem_kernel.dylib)        2796
        read  (in libsystem_kernel.dylib)        2796

jvm.out produced by sigquit https://gist.github.com/keith/3c77f7e49c108964596440a251c05929

It's worth noting that during this time bazel is consuming 0% CPU, so it seems likely that it's locked waiting on a response that has already timed out. Currently we're using a gRPC remote cache, with the default remote timeout of 60, and the default value for --cpus.

What operating system are you running Bazel on?

macOS

What's the output of bazel info release?

3.3.0 we've heard reports from others that this repros on 3.2.0 as well, we have since updated to 3.4.1 but we don't see this often enough to know if it's an issue there yet as well.

Have you found anything relevant by searching the web?

There are various related sounding issues but AFAICT they're all closed

@segiddins
Copy link
Contributor

segiddins commented Jul 15, 2020

We see this issue as well using an HTTP remote cache on bazel 3.2.0

@meisterT meisterT changed the title Bazel hangs indefinitely when building with cache Bazel hangs indefinitely when building with remote cache Jul 16, 2020
@meisterT
Copy link
Member

Cc @coeuvre

@meisterT
Copy link
Member

I assume this is the stack trace of the blocking thread:

"skyframe-evaluator 12": awaiting notification on [0x0000000645000ca0]

	at jdk.internal.misc.Unsafe.park(java.base@11.0.6/Native Method)

	at java.util.concurrent.locks.LockSupport.park(java.base@11.0.6/Unknown Source)

	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:497)

	at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:83)

	at com.google.devtools.build.lib.remote.util.Utils.getFromFuture(Utils.java:56)

	at com.google.devtools.build.lib.remote.RemoteCache.waitForBulkTransfer(RemoteCache.java:219)

	at com.google.devtools.build.lib.remote.RemoteCache.download(RemoteCache.java:331)

	at com.google.devtools.build.lib.remote.RemoteSpawnCache.lookup(RemoteSpawnCache.java:179)

	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:123)

	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:96)

	at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:39)

	at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:62)

	at com.google.devtools.build.lib.analysis.actions.SpawnAction.beginExecution(SpawnAction.java:331)

	at com.google.devtools.build.lib.actions.Action.execute(Action.java:124)

	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$4.execute(SkyframeActionExecutor.java:779)

	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:925)

	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:896)

	at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:128)

	at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:80)

	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:419)

	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:897)

	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:300)

	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:438)

	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:398)

	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.6/Unknown Source)

	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.6/Unknown Source)

	at java.lang.Thread.run(java.base@11.0.6/Unknown Source)

@coeuvre
Copy link
Member

coeuvre commented Jul 17, 2020

Currently, Bazel only triggers timeout error when there is no byte downloaded/uploaded within --remote_timeout seconds.

In this case, I think it is still downloading bytes from remote cache, but maybe with slow network speed, so it never triggers timeout, hence, hangs for a long time.

Should we change to trigger timeout error if the download/upload isn't finished within --remote_timeout seconds?

@keith
Copy link
Member Author

keith commented Jul 20, 2020

FWIW in our case the targets we've seen this happen with are definitely <10mbs, but I suppose there could be an intermittent issue with network performance. I'm definitely a +1 for changing remote_timeout to timeout when the request hasn't completed in that time, but even if we don't want to change that behavior I definitely think there should be some way to do that since otherwise these seem to hang "forever"

@oquenchil oquenchil added team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged type: bug labels Jul 22, 2020
@coeuvre
Copy link
Member

coeuvre commented Aug 10, 2020

Bazel only triggers timeout error when there is no byte downloaded/uploaded within --remote_timeout seconds.

This is only true for HTTP remote cache.

For gRPC remote cache, bazel did trigger timeout with withDeadlineAfter and will retry for --remote_retries times which is default to 5.

In theory, the max waiting time for downloading a file/blob from cache should be 60s * (1+5) = 360s by default. There must be something wrong in this case.

@keith
Copy link
Member Author

keith commented Aug 17, 2020

Related #11957

@ulfjack
Copy link
Contributor

ulfjack commented Aug 17, 2020

Interesting. Is your remote cache behind a load balancer? If so, what kind?

@keith
Copy link
Member Author

keith commented Aug 17, 2020

We're using google RBE as a cache

@segiddins
Copy link
Contributor

I've seen it happen with an HTTP cache behind AWS ELB, and a gRPC cache not sitting behind anything. I can try with an unbalanced HHTP cache if that would be helpful?

@ulfjack
Copy link
Contributor

ulfjack commented Aug 17, 2020

Let me repost part of my PR description that may be relevant here:

We have seen cases where GCP's external TCP load balancer silently drops connections without telling the client. If this happens when the client is waiting for replies from the server (e.g., when all worker threads are in blocking 'execute' calls), then the client does not notice the connection loss and hangs.

In our testing, the client unblocked after ~2 hours without this change, although we are not sure why (maybe a default Linux Kernel timeout on TCP connections?). With this flag set to 10 minutes, and running builds over night, we see repeated 10-minute gaps in remote service utilization, which seems pretty clear evidence that this is the underlying problem.

@ulfjack
Copy link
Contributor

ulfjack commented Aug 17, 2020

While we think it's due to GCP's TCP load balancer, we technically can't rule out any other system on the network between my machine and the cluster. What is surprising to us that we have seen it happen in the middle of the build. I let some builds run through last night, and we have clear evidence of ~10 hangs during that period, all of which recovered after exactly 10 minutes, with 10 minutes being exactly the keep-alive time I set on Bazel.

used-executors-keepalive

@ulfjack
Copy link
Contributor

ulfjack commented Aug 17, 2020

We think that some of the other spikes could also be caused by the same issue, just that Bazel happened to recover faster because it tried to make a remote call rather than just wait for replies.

@ulfjack
Copy link
Contributor

ulfjack commented Aug 17, 2020

I think the best case would be to get my PR merged, and then test with specific settings for the timeout and, ideally, monitor the length of the gaps and see if there's a correlation. I think that most clearly shows whether it's a similar problem or a different one.

What I did to debug this was to take heap dumps from Bazel and the server and analyze them with YourKit Profiler. YourKit supports searching for and inspecting individual objects, so I could clearly see that Bazel had an open connection and the server did not have an open connection. This asymmetry tipped me off that it was something in the network.

@EricBurnett
Copy link

Ulf, please see https://docs.google.com/document/d/13vXJTFHJX_hnYUxKdGQ2puxbNM3uJdZcEO5oNGP-Nnc/edit#heading=h.z3810n36by5c - Jakob was investigating keepalives last year and concluded that gRPC keepalives were unworkable because there's also a limit on the number of consecutive pings, where that limit is "2" in practice.

I'm generally inclined to call this a bug - we (mostly Dave and Chi) have been chasing this steadily and have observed that the outstanding RPC can hang for longer than the RPC timeout, which is fairly clearly either a gRPC or a bazel bug since gRPC should have surfaced a timeout by then if nothing else. But the fact that it's also affecting HTTP caches could mean it's a bazel bug? (Or a Netty bug, if both use netty?) Either way, there's some concurrency problem where expected signals (RPC failures, timeouts) are not getting handled properly. And if we believe that timeouts are being mishandled and that it may be in the gRPC/netty/bazel stack somewhere, it may also be the case that connection terminations should be noticed but are being mishandled by the same bug.

@ulfjack
Copy link
Contributor

ulfjack commented Aug 17, 2020 via email

@ulfjack
Copy link
Contributor

ulfjack commented Aug 17, 2020

Based on my professional opinion that that would be a completely absurd way to specify keep-alive ping behavior given what the purpose of a keep-alive ping is, I tried it with the gRPC Java implementation, and I can confirm that it sends PING packets at regular intervals (not just 2) and the server keeps replying to them, as I would expect. I also observe that gRPC seems to enforce an undocumented 10s minimum ping interval regardless of what's specified on the server-side (why???).

Screenshot of Wireshark which clearly shows 10s ping intervals:
wireshark-grpc-ping

I also tried to ping the server more aggressively than the server allows, and that does seem to result in a GOAWAY, as expected (based on my assumption that the implementation is dumb). Ideally, the server would tell the client about the minimum allowed keep-alive time rather than having to do that manually.

There's also @ejona86's comment on grpc/grpc-java#7237 which seems to indicate that the GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA setting may be misunderstood. I was not able to find more useful documentation about the setting - note that Java does not have such a setting.

My conclusion is that @buchgr's conclusion in the doc is incorrect, at least for the Java gRPC library. I have not tried it with other language implementations, but I would be surprised if they broke keep-alive pings in this way.

@ulfjack
Copy link
Contributor

ulfjack commented Aug 17, 2020

It is unclear from the doc whether @buchgr tested the actual behavior or came to the conclusion purely based on reading the docs. Also, after reading the comments more carefully, it looks like both @EricBurnett and @bergsieker expressed incredulity at gRPC keep-alive being defined that way, and @buchgr said he sent an email to the mailing list but never reported back.

@ulfjack
Copy link
Contributor

ulfjack commented Aug 17, 2020

I can't find the email @buchgr mentioned, but all discussions about keep-alive pings on the main gRPC mailing list seem to indicate that pings work exactly the way I would expect rather than having some random limit on how many pings can be sent by the client.

@ulfjack
Copy link
Contributor

ulfjack commented Aug 17, 2020

Apologies for the many posts, but this is really important for us, because, as mentioned, this makes remote execution unusable in some scenarios, and I'd like to have a workaround / fix in sooner rather than later.

@ejona86
Copy link

ejona86 commented Aug 18, 2020

As part of investigating grpc/grpc#17667 (not very useful of a reference) I discovered that, yes, C did apply the GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA setting to keepalive pings. And yes, that is absurd. It was "working as expected" by the C core team though, so any discussions with them probably did not lead anywhere. Bringing it to my attention at that point though would have helped. There are legitimate reasons for GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA related to BDP monitoring, but it caused problems to expose the option and the option should not have impacted keepalive. The very existence of the option is what was causing problems here, in a "tail wagging the dog" way. They have carved out some time to fix the implementation; it's not trivial for a variety of reasons.

Java should behave as you expect. The 10 second minimum is defined in https://github.com/grpc/proposal/blob/master/A8-client-side-keepalive.md . The Java documentation states that low values may be increased, but it does not define 10 seconds specifically. Very recently Java did receive something similar to GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA internally, but it isn't exposed as an option and it does not impact keepalive.

When gRPC clients receive a GOAWAY with too_many_pings they should increase their keepalive time by 2x. This is to reduce the impact when things are misconfigured, while not removing errors entirely so the user can be informed of the issue, as they wouldn't be getting their expected keepalive setting.

@ejona86
Copy link

ejona86 commented Aug 18, 2020

It sounds like you are saying a TCP LB is being used (not L7).

You should be aware there is a 10 minute idle timeout. https://cloud.google.com/compute/docs/troubleshooting/general-tips#communicatewithinternet . But it doesn't seem you are hitting it. In order to avoid that you need a keepalive time less than 10 minutes (say, 9 minutes). I'd suggest the gRPC server to configure this keepalive.

You could also be hitting a default 30 second idle timeout. This value can be increased, but probably not above 10 minutes. Again, it would be best for the server to manage the keepalive here, as it is controlling the TCP proxy settings and can update them when things change.

@EricBurnett
Copy link

@ejona86 thanks for those details! Unfortunate to hear it does in fact work this way, but good to have confirmation.

@ulfjack :

AIUI, Bazel currently doesn’t set a timeout on execute calls.

Interesting! We should also get that fixed. Bazel should add a timeout to all outgoing RPCs. Retry logic here is a little tricky due to the Execute/WaitExecution dance and the need for supporting arbitrarily long actions, but I think the result should be something like:

  • Bazel sets a timeout on all Execute and WaitExecution calls. Probably just the value of the remote RPC timeout flag, like everything else, for now? But it could go reasonably low (few minutes) if we also:
  • Correct the retrier to not count DEADLINE_EXCEEDED on Execute or WaitExecution against the maximum retries. That's an expected, normal case, meaning "please retry waiting". Right now it does, which requires raising max retries to support really long actions :(.

We could send app-level keep-alives from the server but I don’t see how that’s better than client-side pings.

I'd definitely recommend it. Client-side pings happen at the connection level, but different HTTP2 proxies may also apply idle timeouts to the carried streams IIUC. For Execute requests RBE intentionally sends a no-op streaming response once per minute, which has proven sufficient to keep streams alive in all networking environments our clients run in.

I'm not fully understanding the scenario you're running into with dropped connections, but this may also suffice to avoid your issue, with server-side changes only. That should be quick and easy for you to land.

@buchgr
Copy link
Contributor

buchgr commented Aug 18, 2020

I can't say that I still remember the details of this, but here are the contents of the e-mail:

My e-mail

Hey,

I'd like to ask a clarifying question regarding the gRPC keepalive spec for the following scenario: I have an active channel with one or more open server-streaming calls that do not send any messages for hours. The gRPC keepalive time is set to 30s. All other knobs are left at their defaults.

My understanding of the spec is that in this scenario HTTP/2 PING frames will be sent at the following times after the channel was opened: 30s, 60s, 6 minutes, 11 minutes, ...

I am thinking this because GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA defaults to 2 and GRPC_ARG_HTTP2_MIN_SENT_PING_INTERVAL_WITHOUT_DATA_MS defaults to 5 minutes.

Is this correct?

Response from the gRPC team

Hi,

After 2 pings without data, we will NOT send any more pings. So if there is no data, two pings will be sent 5 minutes apart after which no pings will be sent.

@ulfjack
Copy link
Contributor

ulfjack commented Aug 18, 2020

I am using a TCP load balancer rather than an HTTPS load balancer in this cluster because the HTTPS load balancer is significantly more expensive. We previously ran into the 30s idle timeout on the HTTPS load balancer but not the TCP load balancer. As far as I can tell, there is no way to set an idle timeout on an external TCP load balancer in GCP, and it's not clear from the GCP documentation whether there is an idle timeout in this case (it doesn't mention external TCP load balancers, and I also couldn't find a setting for this).

I'm also fairly sure that the connection was active at the time of the timeout, although I can't say this with complete certainty. If I'm right, then a server-side keep-alive ping isn't going to do diddly.

Why should this block merging PR #11957? The Java gRPC library doesn't have this bug, and it's an opt-in setting.

What's the purpose of setting a deadline on an execute call given that the client is just going to retry? This seems like a blatant abuse of functionality that is intended for something else. If we want to detect that the connection is dead, the right way to do that is to use keep-alives.

@coeuvre
Copy link
Member

coeuvre commented Nov 6, 2020

All these places can throw an unchecked exception in theory. By inserting code there, which randomly throws RuntimeException, we can reproduce the hangs and the stack traces are same with what we get from users.

At this point, I believe the root cause of the hanging issue is that an unchecked exception was thrown during the RPC so that a Future never completes. It's hard to reproduce locally because the conditions are hard to meet e.g.

The issue is fixed by the PRs in theory. Can you help to verify?

bazel-io pushed a commit that referenced this issue Nov 10, 2020
Failed doing so, will cause Bazel hanging forever. This could be one of causes for #11782.

Closes #12426.

PiperOrigin-RevId: 341554621
@ulfjack
Copy link
Contributor

ulfjack commented Nov 11, 2020

I build a Bazel binary from head including both #12422 and #12426 and I still see hangs.

@coeuvre
Copy link
Member

coeuvre commented Nov 12, 2020

Where were the hangs? If it is stuck at remote execution, can you try with --experimental_remote_execution_keepalive?

@ulfjack
Copy link
Contributor

ulfjack commented Nov 12, 2020

Maybe unrelated, but I see this error when I enable --experimental_remote_execution_keepalive:

ERROR: /home/ulfjack/Google/tensorflow/tensorflow/core/util/BUILD:364:24: Action tensorflow/core/util/version_info.cc failed: Exec failed due to IOException: /dev/null.remote (Permission denied)

I have had it running for 4 hours without any major hangs.

@ulfjack
Copy link
Contributor

ulfjack commented Nov 12, 2020

Well, I spoke too soon. It hung itself up just after my previous reply. Output looks like this:

[8,281 / 11,167] 75 actions, 48 running
    Compiling tensorflow/core/kernels/training_ops.cc; 2945s remote
    Compiling tensorflow/core/kernels/list_kernels.cc; 2885s remote
    Compiling tensorflow/lite/toco/tflite/operator.cc; 2860s remote
    Compiling tensorflow/lite/toco/tflite/op_version.cc; 2860s remote
    Compiling tensorflow/lite/toco/tflite/import.cc; 2860s remote
    Compiling tensorflow/lite/toco/tflite/export.cc; 2860s remote
    Compiling .../compiler/mlir/tensorflow/ir/tf_ops_n_z.cc; 2860s remote
    Compiling .../lite/toco/logging/conversion_log_util.cc; 2860s remote ...

@philwo
Copy link
Member

philwo commented Nov 12, 2020

Well, I spoke too soon. It hung itself up just after my previous reply. Output looks like this:

Oh no 😱

What does the stack trace say and is there anything interesting in the jvm.out? I think Chi added some more logging that might be useful for analysis.

@ulfjack
Copy link
Contributor

ulfjack commented Nov 12, 2020

jvm.out is empty, but I didn't capture a stack trace. Will rerun and see if I can get one.

@bergsieker
Copy link

Note that I was able to reproduce a hang somewhat regularly with illegal GO_AWAY behavior on the server side. I haven't tried a build with Chi's fixes yet, but if you're working with a server that might send GO_AWAY frames that's a possibility.

@ulfjack
Copy link
Contributor

ulfjack commented Nov 12, 2020

The cluster I'm currently working with is not configured to send GO_AWAY frames.

@ulfjack
Copy link
Contributor

ulfjack commented Nov 13, 2020

bazel-hanging-threads.txt

Attached a stack trace. It's possible I removed jvm.out before I looked at it.

I ran Bazel in a loop since last night, and it got stuck 4 times.

@ulfjack
Copy link
Contributor

ulfjack commented Nov 13, 2020

Nope, no output in jvm.out.

@coeuvre
Copy link
Member

coeuvre commented Nov 16, 2020

Nope, no output in jvm.out.

output from server is stored at $output_base/java.log.

bazel-hanging-threads.txt

Attached a stack trace.

This stack trace shows that Bazel was stuck at remote execution calls (without --experimental_remote_execution_keepalive). This is different from the hangs reported by Keith and other users - stack traces showed that Bazel was stuck at remote cache calls. Both #12422 and #12426 are used to fix hangs at remote cache calls.

Hangs at remote execution calls is explaind by this design doc and should be fixed with --experimental_remote_execution_keepalive. Does the error from #11782 (comment) stop you from using this flag? If so, could you please share more details (e.g. jvm.log, java.log) so I can look into?

@ulfjack
Copy link
Contributor

ulfjack commented Nov 16, 2020

Hmm, ok. The stacktrace is from stuck builds with --experimental_remote_execution_keepalive. I think this is a log file from one of those runs.

java.log.txt

@coeuvre
Copy link
Member

coeuvre commented Nov 16, 2020

Interesting. Remote execution with --experimental_remote_execution_keepalive should use ExperimentalGrpcRemoteExecutor but the stack traces showed that Bazel is stuck at GrpcRemoteExecutor which is not created when the flag is enabled. Would please double check the hangs is from a build with --experimental_remote_execution_keepalive?

@ulfjack
Copy link
Contributor

ulfjack commented Nov 16, 2020

Ok, new stack trace.

bazel-hangs-stacks.txt

@ulfjack
Copy link
Contributor

ulfjack commented Nov 16, 2020

And a matching log file.

java.log.txt

@coeuvre
Copy link
Member

coeuvre commented Nov 17, 2020

Can't find hints from java.log.

Since both GrpcRemoteExecutor and ExperimentalGrpcRemoteExecutor use blocking gRPC calls, we don't have the risks like we had for remote cache calls where a future never completes resulting in a hang (unless a bug inside gRPC). The new stack trace shows that Bazel is waiting for network data. What's the values of --remote_timeout/--remote_retries here? Is that possible we didn't reach the maximum time (--remote_timeout x (--remote_retries + 1))?

@coeuvre
Copy link
Member

coeuvre commented Dec 9, 2020

Closing as the issue seems to be fixed. Feel free to re-open if it happens again.

@coeuvre coeuvre closed this as completed Dec 9, 2020
coeuvre added a commit to coeuvre/bazel that referenced this issue Jan 28, 2021
Failed doing so will cause gRPC hanging forever. This could be one of causes that leads to bazelbuild#11782.

Closes bazelbuild#12422.

PiperOrigin-RevId: 340995977
coeuvre added a commit to coeuvre/bazel that referenced this issue Jan 28, 2021
Failed doing so, will cause Bazel hanging forever. This could be one of causes for bazelbuild#11782.

Closes bazelbuild#12426.

PiperOrigin-RevId: 341554621

# Conflicts:
#	src/main/java/com/google/devtools/build/lib/remote/GrpcCacheClient.java
#	src/main/java/com/google/devtools/build/lib/remote/http/HttpCacheClient.java
@ulfjack
Copy link
Contributor

ulfjack commented Apr 25, 2022

We're still seeing the issue with 6.0.0-pre.20220405.2 (we forgot to explicitly set --grpc_keepalive_time and it's still disabled by default).

@coeuvre
Copy link
Member

coeuvre commented Apr 25, 2022

(we forgot to explicitly set --grpc_keepalive_time and it's still disabled by default).

Does that mean you don't have the issue if --grpc_keepalive_time is set? Can you try with --experimental_remote_execution_keepalive?

@wolfd
Copy link
Contributor

wolfd commented Apr 27, 2022

Also seeing this happen with a gRPC-based remote-cache. We also have a TCP load balancer in the loop, although it's on AWS rather than GCP. Adding --grpc_keepalive_time seems to help a lot, a clean build of something went from 1h42m to 8m with the flag set to 10s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests