Add --grpc_keepalive_time/grpc_keepalive_timeout #11957

ulfjack · 2020-08-17T11:01:04Z

This allows controlling keep-alive pings in the gRPC library. We have
seen cases where GCP's external TCP load balancer silently drops
connections without telling the client. If this happens when the client
is waiting for replies from the server (e.g., when all worker threads
are in blocking 'execute' calls), then the client does not notice the
connection loss and hangs.

In our testing, the client unblocked after ~2 hours without this change,
although we are not sure why (maybe a default Linux Kernel timeout on
TCP connections?). With this flag set to 10 minutes, and running builds
over night, we see repeated 10-minute gaps in remote service utilization,
which seems pretty clear evidence that this is the underlying problem.

The gRPC library has two settings: keep-alive time and keep-alive
timeout. The keep-alive time is the time to wait after the most recent
received packet before sending a keep-alive ping, and the keep-alive
timeout is the time to wait for a reply before concluding that the
connection is dead. The default keep-alive timeout setting is 20s, but
the default keep-alive time setting is infinity (i.e., disable
keep-alive pings).

The gRPC documentation also says to ask the service owner before enabling
keep-alive pings based on the concern that these could generate a lot of
hidden traffic. I don't immediately see how these concerns apply to the
REAPI, and the REAPI also does not have a single service owner. For now,
I'm making this opt-in.

This is difficult to test automatically because there is no easy way to
drop a TCP connection without telling the other end point (for good
reason!).

Change-Id: I5d59737a21515b5d70c13cbdd5037f0a434ec74f

This allows controlling keep-alive pings in the gRPC library. We have seen cases where GCP's external TCP load balancer silently drops connections without telling the client. If this happens when the client is waiting for replies from the server (e.g., when all worker threads are in blocking 'execute' calls), then the client does not notice the connection loss and hangs. In our testing, the client unblocked after ~2 hours without this change, although we are not sure why (maybe a default Linux Kernel timeout on TCP connections?). With this flag set to 10 minutes, and running builds over night, we see repeated 10-minute gaps in remote service utilization, which seems pretty clear evidence that this is the underlying problem. The gRPC library has two settings: keep-alive time and keep-alive timeout. The keep-alive time is the time to wait after the most recent received packet before sending a keep-alive ping, and the keep-alive timeout is the time to wait for a reply before concluding that the connection is dead. The default keep-alive timeout setting is 20s, but the default keep-alive time setting is infinity (i.e., disable keep-alive pings). The gRPC documentation also says to ask the service owner before enabling keep-alive pings based on the concern that these could generate a lot of hidden traffic. I don't immediately see how these concerns apply to the REAPI, and the REAPI also does not have a single service owner. For now, I'm making this opt-in. This is difficult to test automatically because there is no easy way to drop a TCP connection without telling the other end point (for good reason!). Change-Id: I5d59737a21515b5d70c13cbdd5037f0a434ec74f

ulfjack · 2020-08-17T11:01:56Z

@coeuvre @buchgr @meisterT

keith · 2020-08-17T15:27:48Z

Seems like this might help with #11782 (although there's a mention of the same issue being seen with http as well)

Bazel currently performs a full findMissingBlob call for every input of every action that is not an immediate cache hit. This causes a significant amount of network traffic as well as server load (unless the server caches blob existence as well). In particular, for a build of TensorFlow w/ remote execution over a slow-ish network, we see multi-minute delays on actions due to these findMissingBlob calls, and the client is not able to saturate a 50-machine cluster. This change carefully de-duplicates findMissingBlob calls as well as file uploads to the remote execution cluster. It is based on a previous simpler change that did not de-duplicate *concurrent* calls. However, we have found that concurrent calls are quite common - i.e., an action completes that unblocks a large number of actions that have significant overlaps in their input sets (e.g., protoc link completion unblocks all protoc compile actions), and we were still seeing multi-minute delays on action execution. With this change, a TensorFlow build w/ remote execution is down to ~20 minutes on a 50-machine cluster, and is able to fully saturate the cluster (apart from an issue with connection loss, see PR bazelbuild#11957). Change-Id: Ic39347a7a7a8dc7cfd463d78f0a80e0d26a970bc

coeuvre · 2020-08-20T02:39:36Z

Merging.

ulfjack · 2020-08-24T19:36:43Z

Ping?

coeuvre · 2020-08-25T01:23:26Z

I have imported and it's still under review.

Bazel currently performs a full findMissingBlob call for every input of every action that is not an immediate cache hit. This causes a significant amount of network traffic as well as server load (unless the server caches blob existence as well). In particular, for a build of TensorFlow w/ remote execution over a slow-ish network, we see multi-minute delays on actions due to these findMissingBlob calls, and the client is not able to saturate a 50-machine cluster. This change carefully de-duplicates findMissingBlob calls as well as file uploads to the remote execution cluster. It is based on a previous simpler change that did not de-duplicate *concurrent* calls. However, we have found that concurrent calls are quite common - i.e., an action completes that unblocks a large number of actions that have significant overlaps in their input sets (e.g., protoc link completion unblocks all protoc compile actions), and we were still seeing multi-minute delays on action execution. With this change, a TensorFlow build w/ remote execution is down to ~20 minutes on a 50-machine cluster, and is able to fully saturate the cluster (apart from an issue with connection loss, see PR bazelbuild#11957). Change-Id: Ic39347a7a7a8dc7cfd463d78f0a80e0d26a970bc

googlebot added the cla: yes label Aug 17, 2020

keith mentioned this pull request Aug 17, 2020

Bazel hangs indefinitely when building with remote cache #11782

Closed

coeuvre approved these changes Aug 20, 2020

View reviewed changes

bazel-io closed this in 761c162 Aug 27, 2020

ulfjack deleted the keep-alive-pings branch September 10, 2020 19:05

ulfjack mentioned this pull request Oct 9, 2020

add --experimental_remote_execute_timeout #12231

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --grpc_keepalive_time/grpc_keepalive_timeout #11957

Add --grpc_keepalive_time/grpc_keepalive_timeout #11957

ulfjack commented Aug 17, 2020

ulfjack commented Aug 17, 2020

keith commented Aug 17, 2020

coeuvre commented Aug 20, 2020

ulfjack commented Aug 24, 2020

coeuvre commented Aug 25, 2020

Add --grpc_keepalive_time/grpc_keepalive_timeout #11957

Add --grpc_keepalive_time/grpc_keepalive_timeout #11957

Conversation

ulfjack commented Aug 17, 2020

ulfjack commented Aug 17, 2020

keith commented Aug 17, 2020

coeuvre commented Aug 20, 2020

ulfjack commented Aug 24, 2020

coeuvre commented Aug 25, 2020