-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --grpc_keepalive_time/grpc_keepalive_timeout #11957
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This allows controlling keep-alive pings in the gRPC library. We have seen cases where GCP's external TCP load balancer silently drops connections without telling the client. If this happens when the client is waiting for replies from the server (e.g., when all worker threads are in blocking 'execute' calls), then the client does not notice the connection loss and hangs. In our testing, the client unblocked after ~2 hours without this change, although we are not sure why (maybe a default Linux Kernel timeout on TCP connections?). With this flag set to 10 minutes, and running builds over night, we see repeated 10-minute gaps in remote service utilization, which seems pretty clear evidence that this is the underlying problem. The gRPC library has two settings: keep-alive time and keep-alive timeout. The keep-alive time is the time to wait after the most recent received packet before sending a keep-alive ping, and the keep-alive timeout is the time to wait for a reply before concluding that the connection is dead. The default keep-alive timeout setting is 20s, but the default keep-alive time setting is infinity (i.e., disable keep-alive pings). The gRPC documentation also says to ask the service owner before enabling keep-alive pings based on the concern that these could generate a lot of hidden traffic. I don't immediately see how these concerns apply to the REAPI, and the REAPI also does not have a single service owner. For now, I'm making this opt-in. This is difficult to test automatically because there is no easy way to drop a TCP connection without telling the other end point (for good reason!). Change-Id: I5d59737a21515b5d70c13cbdd5037f0a434ec74f
Seems like this might help with #11782 (although there's a mention of the same issue being seen with http as well) |
ulfjack
added a commit
to ulfjack/bazel
that referenced
this pull request
Aug 17, 2020
Bazel currently performs a full findMissingBlob call for every input of every action that is not an immediate cache hit. This causes a significant amount of network traffic as well as server load (unless the server caches blob existence as well). In particular, for a build of TensorFlow w/ remote execution over a slow-ish network, we see multi-minute delays on actions due to these findMissingBlob calls, and the client is not able to saturate a 50-machine cluster. This change carefully de-duplicates findMissingBlob calls as well as file uploads to the remote execution cluster. It is based on a previous simpler change that did not de-duplicate *concurrent* calls. However, we have found that concurrent calls are quite common - i.e., an action completes that unblocks a large number of actions that have significant overlaps in their input sets (e.g., protoc link completion unblocks all protoc compile actions), and we were still seeing multi-minute delays on action execution. With this change, a TensorFlow build w/ remote execution is down to ~20 minutes on a 50-machine cluster, and is able to fully saturate the cluster (apart from an issue with connection loss, see PR bazelbuild#11957). Change-Id: Ic39347a7a7a8dc7cfd463d78f0a80e0d26a970bc
Merging. |
coeuvre
approved these changes
Aug 20, 2020
Ping? |
I have imported and it's still under review. |
ulfjack
added a commit
to ulfjack/bazel
that referenced
this pull request
Sep 2, 2020
Bazel currently performs a full findMissingBlob call for every input of every action that is not an immediate cache hit. This causes a significant amount of network traffic as well as server load (unless the server caches blob existence as well). In particular, for a build of TensorFlow w/ remote execution over a slow-ish network, we see multi-minute delays on actions due to these findMissingBlob calls, and the client is not able to saturate a 50-machine cluster. This change carefully de-duplicates findMissingBlob calls as well as file uploads to the remote execution cluster. It is based on a previous simpler change that did not de-duplicate *concurrent* calls. However, we have found that concurrent calls are quite common - i.e., an action completes that unblocks a large number of actions that have significant overlaps in their input sets (e.g., protoc link completion unblocks all protoc compile actions), and we were still seeing multi-minute delays on action execution. With this change, a TensorFlow build w/ remote execution is down to ~20 minutes on a 50-machine cluster, and is able to fully saturate the cluster (apart from an issue with connection loss, see PR bazelbuild#11957). Change-Id: Ic39347a7a7a8dc7cfd463d78f0a80e0d26a970bc
ulfjack
added a commit
to ulfjack/bazel
that referenced
this pull request
Feb 18, 2021
Bazel currently performs a full findMissingBlob call for every input of every action that is not an immediate cache hit. This causes a significant amount of network traffic as well as server load (unless the server caches blob existence as well). In particular, for a build of TensorFlow w/ remote execution over a slow-ish network, we see multi-minute delays on actions due to these findMissingBlob calls, and the client is not able to saturate a 50-machine cluster. This change carefully de-duplicates findMissingBlob calls as well as file uploads to the remote execution cluster. It is based on a previous simpler change that did not de-duplicate *concurrent* calls. However, we have found that concurrent calls are quite common - i.e., an action completes that unblocks a large number of actions that have significant overlaps in their input sets (e.g., protoc link completion unblocks all protoc compile actions), and we were still seeing multi-minute delays on action execution. With this change, a TensorFlow build w/ remote execution is down to ~20 minutes on a 50-machine cluster, and is able to fully saturate the cluster (apart from an issue with connection loss, see PR bazelbuild#11957). Change-Id: Ic39347a7a7a8dc7cfd463d78f0a80e0d26a970bc
ulfjack
added a commit
to EngFlow/bazel
that referenced
this pull request
Mar 5, 2021
Bazel currently performs a full findMissingBlob call for every input of every action that is not an immediate cache hit. This causes a significant amount of network traffic as well as server load (unless the server caches blob existence as well). In particular, for a build of TensorFlow w/ remote execution over a slow-ish network, we see multi-minute delays on actions due to these findMissingBlob calls, and the client is not able to saturate a 50-machine cluster. This change carefully de-duplicates findMissingBlob calls as well as file uploads to the remote execution cluster. It is based on a previous simpler change that did not de-duplicate *concurrent* calls. However, we have found that concurrent calls are quite common - i.e., an action completes that unblocks a large number of actions that have significant overlaps in their input sets (e.g., protoc link completion unblocks all protoc compile actions), and we were still seeing multi-minute delays on action execution. With this change, a TensorFlow build w/ remote execution is down to ~20 minutes on a 50-machine cluster, and is able to fully saturate the cluster (apart from an issue with connection loss, see PR bazelbuild#11957). Change-Id: Ic39347a7a7a8dc7cfd463d78f0a80e0d26a970bc
ulfjack
added a commit
to EngFlow/bazel
that referenced
this pull request
Mar 5, 2021
Bazel currently performs a full findMissingBlob call for every input of every action that is not an immediate cache hit. This causes a significant amount of network traffic as well as server load (unless the server caches blob existence as well). In particular, for a build of TensorFlow w/ remote execution over a slow-ish network, we see multi-minute delays on actions due to these findMissingBlob calls, and the client is not able to saturate a 50-machine cluster. This change carefully de-duplicates findMissingBlob calls as well as file uploads to the remote execution cluster. It is based on a previous simpler change that did not de-duplicate *concurrent* calls. However, we have found that concurrent calls are quite common - i.e., an action completes that unblocks a large number of actions that have significant overlaps in their input sets (e.g., protoc link completion unblocks all protoc compile actions), and we were still seeing multi-minute delays on action execution. With this change, a TensorFlow build w/ remote execution is down to ~20 minutes on a 50-machine cluster, and is able to fully saturate the cluster (apart from an issue with connection loss, see PR bazelbuild#11957). Change-Id: Ic39347a7a7a8dc7cfd463d78f0a80e0d26a970bc
ulfjack
added a commit
to EngFlow/bazel
that referenced
this pull request
May 26, 2021
Bazel currently performs a full findMissingBlob call for every input of every action that is not an immediate cache hit. This causes a significant amount of network traffic as well as server load (unless the server caches blob existence as well). In particular, for a build of TensorFlow w/ remote execution over a slow-ish network, we see multi-minute delays on actions due to these findMissingBlob calls, and the client is not able to saturate a 50-machine cluster. This change carefully de-duplicates findMissingBlob calls as well as file uploads to the remote execution cluster. It is based on a previous simpler change that did not de-duplicate *concurrent* calls. However, we have found that concurrent calls are quite common - i.e., an action completes that unblocks a large number of actions that have significant overlaps in their input sets (e.g., protoc link completion unblocks all protoc compile actions), and we were still seeing multi-minute delays on action execution. With this change, a TensorFlow build w/ remote execution is down to ~20 minutes on a 50-machine cluster, and is able to fully saturate the cluster (apart from an issue with connection loss, see PR bazelbuild#11957). Change-Id: Ic39347a7a7a8dc7cfd463d78f0a80e0d26a970bc
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This allows controlling keep-alive pings in the gRPC library. We have
seen cases where GCP's external TCP load balancer silently drops
connections without telling the client. If this happens when the client
is waiting for replies from the server (e.g., when all worker threads
are in blocking 'execute' calls), then the client does not notice the
connection loss and hangs.
In our testing, the client unblocked after ~2 hours without this change,
although we are not sure why (maybe a default Linux Kernel timeout on
TCP connections?). With this flag set to 10 minutes, and running builds
over night, we see repeated 10-minute gaps in remote service utilization,
which seems pretty clear evidence that this is the underlying problem.
The gRPC library has two settings: keep-alive time and keep-alive
timeout. The keep-alive time is the time to wait after the most recent
received packet before sending a keep-alive ping, and the keep-alive
timeout is the time to wait for a reply before concluding that the
connection is dead. The default keep-alive timeout setting is 20s, but
the default keep-alive time setting is infinity (i.e., disable
keep-alive pings).
The gRPC documentation also says to ask the service owner before enabling
keep-alive pings based on the concern that these could generate a lot of
hidden traffic. I don't immediately see how these concerns apply to the
REAPI, and the REAPI also does not have a single service owner. For now,
I'm making this opt-in.
This is difficult to test automatically because there is no easy way to
drop a TCP connection without telling the other end point (for good
reason!).
Change-Id: I5d59737a21515b5d70c13cbdd5037f0a434ec74f