Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --grpc_keepalive_time/grpc_keepalive_timeout #11957

Closed
wants to merge 1 commit into from

Conversation

ulfjack
Copy link
Contributor

@ulfjack ulfjack commented Aug 17, 2020

This allows controlling keep-alive pings in the gRPC library. We have
seen cases where GCP's external TCP load balancer silently drops
connections without telling the client. If this happens when the client
is waiting for replies from the server (e.g., when all worker threads
are in blocking 'execute' calls), then the client does not notice the
connection loss and hangs.

In our testing, the client unblocked after ~2 hours without this change,
although we are not sure why (maybe a default Linux Kernel timeout on
TCP connections?). With this flag set to 10 minutes, and running builds
over night, we see repeated 10-minute gaps in remote service utilization,
which seems pretty clear evidence that this is the underlying problem.

The gRPC library has two settings: keep-alive time and keep-alive
timeout. The keep-alive time is the time to wait after the most recent
received packet before sending a keep-alive ping, and the keep-alive
timeout is the time to wait for a reply before concluding that the
connection is dead. The default keep-alive timeout setting is 20s, but
the default keep-alive time setting is infinity (i.e., disable
keep-alive pings).

The gRPC documentation also says to ask the service owner before enabling
keep-alive pings based on the concern that these could generate a lot of
hidden traffic. I don't immediately see how these concerns apply to the
REAPI, and the REAPI also does not have a single service owner. For now,
I'm making this opt-in.

This is difficult to test automatically because there is no easy way to
drop a TCP connection without telling the other end point (for good
reason!).

Change-Id: I5d59737a21515b5d70c13cbdd5037f0a434ec74f

This allows controlling keep-alive pings in the gRPC library. We have
seen cases where GCP's external TCP load balancer silently drops
connections without telling the client. If this happens when the client
is waiting for replies from the server (e.g., when all worker threads
are in blocking 'execute' calls), then the client does not notice the
connection loss and hangs.

In our testing, the client unblocked after ~2 hours without this change,
although we are not sure why (maybe a default Linux Kernel timeout on
TCP connections?). With this flag set to 10 minutes, and running builds
over night, we see repeated 10-minute gaps in remote service utilization,
which seems pretty clear evidence that this is the underlying problem.

The gRPC library has two settings: keep-alive time and keep-alive
timeout. The keep-alive time is the time to wait after the most recent
received packet before sending a keep-alive ping, and the keep-alive
timeout is the time to wait for a reply before concluding that the
connection is dead. The default keep-alive timeout setting is 20s, but
the default keep-alive time setting is infinity (i.e., disable
keep-alive pings).

The gRPC documentation also says to ask the service owner before enabling
keep-alive pings based on the concern that these could generate a lot of
hidden traffic. I don't immediately see how these concerns apply to the
REAPI, and the REAPI also does not have a single service owner. For now,
I'm making this opt-in.

This is difficult to test automatically because there is no easy way to
drop a TCP connection without telling the other end point (for good
reason!).

Change-Id: I5d59737a21515b5d70c13cbdd5037f0a434ec74f
@ulfjack
Copy link
Contributor Author

ulfjack commented Aug 17, 2020

@coeuvre @buchgr @meisterT

@keith
Copy link
Member

keith commented Aug 17, 2020

Seems like this might help with #11782 (although there's a mention of the same issue being seen with http as well)

ulfjack added a commit to ulfjack/bazel that referenced this pull request Aug 17, 2020
Bazel currently performs a full findMissingBlob call for every input of every
action that is not an immediate cache hit. This causes a significant amount
of network traffic as well as server load (unless the server caches blob
existence as well).

In particular, for a build of TensorFlow w/ remote execution over a slow-ish
network, we see multi-minute delays on actions due to these findMissingBlob
calls, and the client is not able to saturate a 50-machine cluster.

This change carefully de-duplicates findMissingBlob calls as well as file
uploads to the remote execution cluster. It is based on a previous simpler
change that did not de-duplicate *concurrent* calls. However, we have found
that concurrent calls are quite common - i.e., an action completes that
unblocks a large number of actions that have significant overlaps in their
input sets (e.g., protoc link completion unblocks all protoc compile
actions), and we were still seeing multi-minute delays on action execution.

With this change, a TensorFlow build w/ remote execution is down to ~20
minutes on a 50-machine cluster, and is able to fully saturate the cluster
(apart from an issue with connection loss, see PR bazelbuild#11957).

Change-Id: Ic39347a7a7a8dc7cfd463d78f0a80e0d26a970bc
@coeuvre
Copy link
Member

coeuvre commented Aug 20, 2020

Merging.

@ulfjack
Copy link
Contributor Author

ulfjack commented Aug 24, 2020

Ping?

@coeuvre
Copy link
Member

coeuvre commented Aug 25, 2020

I have imported and it's still under review.

@bazel-io bazel-io closed this in 761c162 Aug 27, 2020
ulfjack added a commit to ulfjack/bazel that referenced this pull request Sep 2, 2020
Bazel currently performs a full findMissingBlob call for every input of every
action that is not an immediate cache hit. This causes a significant amount
of network traffic as well as server load (unless the server caches blob
existence as well).

In particular, for a build of TensorFlow w/ remote execution over a slow-ish
network, we see multi-minute delays on actions due to these findMissingBlob
calls, and the client is not able to saturate a 50-machine cluster.

This change carefully de-duplicates findMissingBlob calls as well as file
uploads to the remote execution cluster. It is based on a previous simpler
change that did not de-duplicate *concurrent* calls. However, we have found
that concurrent calls are quite common - i.e., an action completes that
unblocks a large number of actions that have significant overlaps in their
input sets (e.g., protoc link completion unblocks all protoc compile
actions), and we were still seeing multi-minute delays on action execution.

With this change, a TensorFlow build w/ remote execution is down to ~20
minutes on a 50-machine cluster, and is able to fully saturate the cluster
(apart from an issue with connection loss, see PR bazelbuild#11957).

Change-Id: Ic39347a7a7a8dc7cfd463d78f0a80e0d26a970bc
@ulfjack ulfjack deleted the keep-alive-pings branch September 10, 2020 19:05
ulfjack added a commit to ulfjack/bazel that referenced this pull request Feb 18, 2021
Bazel currently performs a full findMissingBlob call for every input of every
action that is not an immediate cache hit. This causes a significant amount
of network traffic as well as server load (unless the server caches blob
existence as well).

In particular, for a build of TensorFlow w/ remote execution over a slow-ish
network, we see multi-minute delays on actions due to these findMissingBlob
calls, and the client is not able to saturate a 50-machine cluster.

This change carefully de-duplicates findMissingBlob calls as well as file
uploads to the remote execution cluster. It is based on a previous simpler
change that did not de-duplicate *concurrent* calls. However, we have found
that concurrent calls are quite common - i.e., an action completes that
unblocks a large number of actions that have significant overlaps in their
input sets (e.g., protoc link completion unblocks all protoc compile
actions), and we were still seeing multi-minute delays on action execution.

With this change, a TensorFlow build w/ remote execution is down to ~20
minutes on a 50-machine cluster, and is able to fully saturate the cluster
(apart from an issue with connection loss, see PR bazelbuild#11957).

Change-Id: Ic39347a7a7a8dc7cfd463d78f0a80e0d26a970bc
ulfjack added a commit to EngFlow/bazel that referenced this pull request Mar 5, 2021
Bazel currently performs a full findMissingBlob call for every input of every
action that is not an immediate cache hit. This causes a significant amount
of network traffic as well as server load (unless the server caches blob
existence as well).

In particular, for a build of TensorFlow w/ remote execution over a slow-ish
network, we see multi-minute delays on actions due to these findMissingBlob
calls, and the client is not able to saturate a 50-machine cluster.

This change carefully de-duplicates findMissingBlob calls as well as file
uploads to the remote execution cluster. It is based on a previous simpler
change that did not de-duplicate *concurrent* calls. However, we have found
that concurrent calls are quite common - i.e., an action completes that
unblocks a large number of actions that have significant overlaps in their
input sets (e.g., protoc link completion unblocks all protoc compile
actions), and we were still seeing multi-minute delays on action execution.

With this change, a TensorFlow build w/ remote execution is down to ~20
minutes on a 50-machine cluster, and is able to fully saturate the cluster
(apart from an issue with connection loss, see PR bazelbuild#11957).

Change-Id: Ic39347a7a7a8dc7cfd463d78f0a80e0d26a970bc
ulfjack added a commit to EngFlow/bazel that referenced this pull request Mar 5, 2021
Bazel currently performs a full findMissingBlob call for every input of every
action that is not an immediate cache hit. This causes a significant amount
of network traffic as well as server load (unless the server caches blob
existence as well).

In particular, for a build of TensorFlow w/ remote execution over a slow-ish
network, we see multi-minute delays on actions due to these findMissingBlob
calls, and the client is not able to saturate a 50-machine cluster.

This change carefully de-duplicates findMissingBlob calls as well as file
uploads to the remote execution cluster. It is based on a previous simpler
change that did not de-duplicate *concurrent* calls. However, we have found
that concurrent calls are quite common - i.e., an action completes that
unblocks a large number of actions that have significant overlaps in their
input sets (e.g., protoc link completion unblocks all protoc compile
actions), and we were still seeing multi-minute delays on action execution.

With this change, a TensorFlow build w/ remote execution is down to ~20
minutes on a 50-machine cluster, and is able to fully saturate the cluster
(apart from an issue with connection loss, see PR bazelbuild#11957).

Change-Id: Ic39347a7a7a8dc7cfd463d78f0a80e0d26a970bc
ulfjack added a commit to EngFlow/bazel that referenced this pull request May 26, 2021
Bazel currently performs a full findMissingBlob call for every input of every
action that is not an immediate cache hit. This causes a significant amount
of network traffic as well as server load (unless the server caches blob
existence as well).

In particular, for a build of TensorFlow w/ remote execution over a slow-ish
network, we see multi-minute delays on actions due to these findMissingBlob
calls, and the client is not able to saturate a 50-machine cluster.

This change carefully de-duplicates findMissingBlob calls as well as file
uploads to the remote execution cluster. It is based on a previous simpler
change that did not de-duplicate *concurrent* calls. However, we have found
that concurrent calls are quite common - i.e., an action completes that
unblocks a large number of actions that have significant overlaps in their
input sets (e.g., protoc link completion unblocks all protoc compile
actions), and we were still seeing multi-minute delays on action execution.

With this change, a TensorFlow build w/ remote execution is down to ~20
minutes on a 50-machine cluster, and is able to fully saturate the cluster
(apart from an issue with connection loss, see PR bazelbuild#11957).

Change-Id: Ic39347a7a7a8dc7cfd463d78f0a80e0d26a970bc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants