-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bazel hangs indefinitely when building with remote cache #11782
Comments
We see this issue as well using an HTTP remote cache on bazel 3.2.0 |
Cc @coeuvre |
I assume this is the stack trace of the blocking thread:
|
Currently, Bazel only triggers timeout error when there is no byte downloaded/uploaded within In this case, I think it is still downloading bytes from remote cache, but maybe with slow network speed, so it never triggers timeout, hence, hangs for a long time. Should we change to trigger timeout error if the download/upload isn't finished within |
FWIW in our case the targets we've seen this happen with are definitely <10mbs, but I suppose there could be an intermittent issue with network performance. I'm definitely a +1 for changing |
This is only true for HTTP remote cache. For gRPC remote cache, bazel did trigger timeout with In theory, the max waiting time for downloading a file/blob from cache should be |
Related #11957 |
Interesting. Is your remote cache behind a load balancer? If so, what kind? |
We're using google RBE as a cache |
I've seen it happen with an HTTP cache behind AWS ELB, and a gRPC cache not sitting behind anything. I can try with an unbalanced HHTP cache if that would be helpful? |
Let me repost part of my PR description that may be relevant here: We have seen cases where GCP's external TCP load balancer silently drops connections without telling the client. If this happens when the client is waiting for replies from the server (e.g., when all worker threads are in blocking 'execute' calls), then the client does not notice the connection loss and hangs. In our testing, the client unblocked after ~2 hours without this change, although we are not sure why (maybe a default Linux Kernel timeout on TCP connections?). With this flag set to 10 minutes, and running builds over night, we see repeated 10-minute gaps in remote service utilization, which seems pretty clear evidence that this is the underlying problem. |
While we think it's due to GCP's TCP load balancer, we technically can't rule out any other system on the network between my machine and the cluster. What is surprising to us that we have seen it happen in the middle of the build. I let some builds run through last night, and we have clear evidence of ~10 hangs during that period, all of which recovered after exactly 10 minutes, with 10 minutes being exactly the keep-alive time I set on Bazel. |
We think that some of the other spikes could also be caused by the same issue, just that Bazel happened to recover faster because it tried to make a remote call rather than just wait for replies. |
I think the best case would be to get my PR merged, and then test with specific settings for the timeout and, ideally, monitor the length of the gaps and see if there's a correlation. I think that most clearly shows whether it's a similar problem or a different one. What I did to debug this was to take heap dumps from Bazel and the server and analyze them with YourKit Profiler. YourKit supports searching for and inspecting individual objects, so I could clearly see that Bazel had an open connection and the server did not have an open connection. This asymmetry tipped me off that it was something in the network. |
Ulf, please see https://docs.google.com/document/d/13vXJTFHJX_hnYUxKdGQ2puxbNM3uJdZcEO5oNGP-Nnc/edit#heading=h.z3810n36by5c - Jakob was investigating keepalives last year and concluded that gRPC keepalives were unworkable because there's also a limit on the number of consecutive pings, where that limit is "2" in practice. I'm generally inclined to call this a bug - we (mostly Dave and Chi) have been chasing this steadily and have observed that the outstanding RPC can hang for longer than the RPC timeout, which is fairly clearly either a gRPC or a bazel bug since gRPC should have surfaced a timeout by then if nothing else. But the fact that it's also affecting HTTP caches could mean it's a bazel bug? (Or a Netty bug, if both use netty?) Either way, there's some concurrency problem where expected signals (RPC failures, timeouts) are not getting handled properly. And if we believe that timeouts are being mishandled and that it may be in the gRPC/netty/bazel stack somewhere, it may also be the case that connection terminations should be noticed but are being mishandled by the same bug. |
AIUI, Bazel currently doesn’t set a timeout on execute calls. The proposal
to cancel the call and retry is not feasible IMO as it doesn’t allow
distinguishing between actual cancellation and keep-alive cancellation. If
an action is really expensive, letting it run to completion if nobody is
going to use the result seems unacceptable.
We could send app-level keep-alives from the server but I don’t see how
that’s better than client-side pings. It’s actually worse because it’ll
require more network bandwidth (client-side pings are per connection rather
than per call). Did anyone discuss this with gRPC team? Because my research
seemed to indicate that they added the client-side keep-alives for this
purpose.
Also, we need a workaround now-ish because the current state of affairs is
that remote execution is literally unusable in some not uncommon scenarios.
Ideally, someone would fix the load balancers to sent a TCP reset instead
of us working around the issue, but I won’t hold my breath.
…On Mon 17. Aug 2020 at 23:58, Eric Burnett ***@***.***> wrote:
Ulf, please see
https://docs.google.com/document/d/13vXJTFHJX_hnYUxKdGQ2puxbNM3uJdZcEO5oNGP-Nnc/edit#heading=h.z3810n36by5c
- Jakob was investigating keepalives last year and concluded that gRPC
keepalives were unworkable because there's also a limit on the number of
consecutive pings, where that limit is "2" in practice.
I'm generally inclined to call this a bug - we (mostly Dave and Chi) have
been chasing this steadily and have observed that the outstanding RPC can
hang for longer than the RPC timeout, which is fairly clearly either a gRPC
or a bazel bug since gRPC *should* have surfaced a timeout by then if
nothing else. But the fact that it's also affecting HTTP caches could mean
it's a bazel bug? (Or a Netty bug, if both use netty?) Either way, there's
some concurrency problem where expected signals (RPC failures, timeouts)
are not getting handled properly. And if we believe that timeouts are being
mishandled and that it may be in the gRPC/netty/bazel stack somewhere, it
may also be the case that connection terminations *should* be noticed but
are being mishandled by the same bug.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#11782 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABYD2YN3LGBYF7MVXE4CXNDSBGRY7ANCNFSM4O3CMBFA>
.
|
Based on my professional opinion that that would be a completely absurd way to specify keep-alive ping behavior given what the purpose of a keep-alive ping is, I tried it with the gRPC Java implementation, and I can confirm that it sends PING packets at regular intervals (not just 2) and the server keeps replying to them, as I would expect. I also observe that gRPC seems to enforce an undocumented 10s minimum ping interval regardless of what's specified on the server-side (why???). Screenshot of Wireshark which clearly shows 10s ping intervals: I also tried to ping the server more aggressively than the server allows, and that does seem to result in a GOAWAY, as expected (based on my assumption that the implementation is dumb). Ideally, the server would tell the client about the minimum allowed keep-alive time rather than having to do that manually. There's also @ejona86's comment on grpc/grpc-java#7237 which seems to indicate that the My conclusion is that @buchgr's conclusion in the doc is incorrect, at least for the Java gRPC library. I have not tried it with other language implementations, but I would be surprised if they broke keep-alive pings in this way. |
It is unclear from the doc whether @buchgr tested the actual behavior or came to the conclusion purely based on reading the docs. Also, after reading the comments more carefully, it looks like both @EricBurnett and @bergsieker expressed incredulity at gRPC keep-alive being defined that way, and @buchgr said he sent an email to the mailing list but never reported back. |
I can't find the email @buchgr mentioned, but all discussions about keep-alive pings on the main gRPC mailing list seem to indicate that pings work exactly the way I would expect rather than having some random limit on how many pings can be sent by the client. |
Apologies for the many posts, but this is really important for us, because, as mentioned, this makes remote execution unusable in some scenarios, and I'd like to have a workaround / fix in sooner rather than later. |
As part of investigating grpc/grpc#17667 (not very useful of a reference) I discovered that, yes, C did apply the GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA setting to keepalive pings. And yes, that is absurd. It was "working as expected" by the C core team though, so any discussions with them probably did not lead anywhere. Bringing it to my attention at that point though would have helped. There are legitimate reasons for GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA related to BDP monitoring, but it caused problems to expose the option and the option should not have impacted keepalive. The very existence of the option is what was causing problems here, in a "tail wagging the dog" way. They have carved out some time to fix the implementation; it's not trivial for a variety of reasons. Java should behave as you expect. The 10 second minimum is defined in https://github.com/grpc/proposal/blob/master/A8-client-side-keepalive.md . The Java documentation states that low values may be increased, but it does not define 10 seconds specifically. Very recently Java did receive something similar to GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA internally, but it isn't exposed as an option and it does not impact keepalive. When gRPC clients receive a GOAWAY with |
It sounds like you are saying a TCP LB is being used (not L7). You should be aware there is a 10 minute idle timeout. https://cloud.google.com/compute/docs/troubleshooting/general-tips#communicatewithinternet . But it doesn't seem you are hitting it. In order to avoid that you need a keepalive time less than 10 minutes (say, 9 minutes). I'd suggest the gRPC server to configure this keepalive. You could also be hitting a default 30 second idle timeout. This value can be increased, but probably not above 10 minutes. Again, it would be best for the server to manage the keepalive here, as it is controlling the TCP proxy settings and can update them when things change. |
@ejona86 thanks for those details! Unfortunate to hear it does in fact work this way, but good to have confirmation. @ulfjack :
Interesting! We should also get that fixed. Bazel should add a timeout to all outgoing RPCs. Retry logic here is a little tricky due to the Execute/WaitExecution dance and the need for supporting arbitrarily long actions, but I think the result should be something like:
I'd definitely recommend it. Client-side pings happen at the connection level, but different HTTP2 proxies may also apply idle timeouts to the carried streams IIUC. For Execute requests RBE intentionally sends a no-op streaming response once per minute, which has proven sufficient to keep streams alive in all networking environments our clients run in. I'm not fully understanding the scenario you're running into with dropped connections, but this may also suffice to avoid your issue, with server-side changes only. That should be quick and easy for you to land. |
I can't say that I still remember the details of this, but here are the contents of the e-mail: My e-mail
Response from the gRPC team
|
I am using a TCP load balancer rather than an HTTPS load balancer in this cluster because the HTTPS load balancer is significantly more expensive. We previously ran into the 30s idle timeout on the HTTPS load balancer but not the TCP load balancer. As far as I can tell, there is no way to set an idle timeout on an external TCP load balancer in GCP, and it's not clear from the GCP documentation whether there is an idle timeout in this case (it doesn't mention external TCP load balancers, and I also couldn't find a setting for this). I'm also fairly sure that the connection was active at the time of the timeout, although I can't say this with complete certainty. If I'm right, then a server-side keep-alive ping isn't going to do diddly. Why should this block merging PR #11957? The Java gRPC library doesn't have this bug, and it's an opt-in setting. What's the purpose of setting a deadline on an execute call given that the client is just going to retry? This seems like a blatant abuse of functionality that is intended for something else. If we want to detect that the connection is dead, the right way to do that is to use keep-alives. |
All these places can throw an unchecked exception in theory. By inserting code there, which randomly throws At this point, I believe the root cause of the hanging issue is that an unchecked exception was thrown during the RPC so that a Future never completes. It's hard to reproduce locally because the conditions are hard to meet e.g.
The issue is fixed by the PRs in theory. Can you help to verify? |
Where were the hangs? If it is stuck at remote execution, can you try with |
Maybe unrelated, but I see this error when I enable
I have had it running for 4 hours without any major hangs. |
Well, I spoke too soon. It hung itself up just after my previous reply. Output looks like this:
|
Oh no 😱 What does the stack trace say and is there anything interesting in the jvm.out? I think Chi added some more logging that might be useful for analysis. |
|
Note that I was able to reproduce a hang somewhat regularly with illegal GO_AWAY behavior on the server side. I haven't tried a build with Chi's fixes yet, but if you're working with a server that might send GO_AWAY frames that's a possibility. |
The cluster I'm currently working with is not configured to send |
Attached a stack trace. It's possible I removed I ran Bazel in a loop since last night, and it got stuck 4 times. |
Nope, no output in |
output from server is stored at
This stack trace shows that Bazel was stuck at remote execution calls (without Hangs at remote execution calls is explaind by this design doc and should be fixed with |
Hmm, ok. The stacktrace is from stuck builds with |
Interesting. Remote execution with |
Ok, new stack trace. |
And a matching log file. |
Can't find hints from Since both |
Closing as the issue seems to be fixed. Feel free to re-open if it happens again. |
Failed doing so will cause gRPC hanging forever. This could be one of causes that leads to bazelbuild#11782. Closes bazelbuild#12422. PiperOrigin-RevId: 340995977
Failed doing so, will cause Bazel hanging forever. This could be one of causes for bazelbuild#11782. Closes bazelbuild#12426. PiperOrigin-RevId: 341554621 # Conflicts: # src/main/java/com/google/devtools/build/lib/remote/GrpcCacheClient.java # src/main/java/com/google/devtools/build/lib/remote/http/HttpCacheClient.java
We're still seeing the issue with |
Does that mean you don't have the issue if |
Also seeing this happen with a gRPC-based remote-cache. We also have a TCP load balancer in the loop, although it's on AWS rather than GCP. Adding |
We intermittently see an issue where bazel will hang indefinitely while building a target:
Seemingly while trying to fetch it from the remote cache. In these cases once we kill the bazel process no execution logs or chrome trace logs are produced, so I don't have that info, but when I
sample
the process on macOS, orkill -QUIT PID
it I get these logs:sample log:
jvm.out produced by sigquit https://gist.github.com/keith/3c77f7e49c108964596440a251c05929
It's worth noting that during this time bazel is consuming 0% CPU, so it seems likely that it's locked waiting on a response that has already timed out. Currently we're using a gRPC remote cache, with the default remote timeout of 60, and the default value for
--cpus
.What operating system are you running Bazel on?
macOS
What's the output of
bazel info release
?3.3.0 we've heard reports from others that this repros on 3.2.0 as well, we have since updated to 3.4.1 but we don't see this often enough to know if it's an issue there yet as well.
Have you found anything relevant by searching the web?
There are various related sounding issues but AFAICT they're all closed
The text was updated successfully, but these errors were encountered: