-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add --experimental_remote_execute_timeout #12231
Conversation
src/main/java/com/google/devtools/build/lib/remote/GrpcRemoteExecutor.java
Outdated
Show resolved
Hide resolved
@jmmv for visibility |
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. ℹ️ Googlers: Go here for more info. |
47a41f4
to
5dd1fb5
Compare
I posted about my personal preference here: bazelbuild/remote-apis#57 (comment)
(Unfortunately, I have not had time to implement something like that yet, and it also requires server-side support.) I've also documented some of my previous explorations here: #11782 I added support for client-side keep-alive here: #11957 It might be worth trying if enabling keep-alive helps with the problems you're seeing. Is that something you could try? If they do help, then you might be running into the same problem I've seen before. I still suspect that GCP's load balancer is silently dropping connections, but I can't confirm either way. |
@coeuvre Is this useful / relevant to your recent work on the remote execution timeouts? 🤔 |
Hi Philipp, thanks for mentioning here. This PR tries to solve two problems:
My recent work is primarily focus on problem 2 and the initial idea is same as this PR: add a hard timeout to remote execution calls. However, after discussed with Jakob internally, I believe this is not a good idea since execution times for actions vary and it's hard to pick an effective value. I am working on a design doc of another way to fix this and will public it once finished. |
We were trying to address operation hangs by enforcing grpc timeouts on both sides (bazel & buildfarm). The motivation now:
Thanks for this advice. I was having trouble getting it to work predictably with buildfarm. I might have to change buildfarm's keepalive settings- or see if its related to netty or something. This might be a better way to prevent connections from staying open. Maybe less granularity on particular calls, and I'm not sure if it would be the same problem when a server always keeps connections alive. Will reply as I learn more.
Makes sense. Our intention was to set the timeout at the longest running test execution + buffer time. |
@luxe the BuildFarm source code on GitHub says that keep-alive isn't allowed by the server. It's a bit odd that gRPC's ServerBuilder class (https://grpc.github.io/grpc-java/javadoc/io/grpc/ServerBuilder.html) doesn't have a method for this. I think BuildFarm is actually using NettyServerBuilder, which has a |
@ulfjack Can you provide a link indicating the comment/code you're referring to here I'd like to restart consideration for this change - being experimental and disabled by default, I see no reason why it shouldn't be available for specification. @luxe Please rebase and resolve conflicts to enable further consideration. |
Looks like there is already an |
Hello @luxe, Could you please have a look on the buildkite presubmit check failures and any update on the above PR. Thanks! |
There are also Closing. |
Problem:
Bazel remote executions can experience indefinite wait times due to an unfulfilled remote execution request. This situation can happen for two reasons. First, the remote server may not honor the action timeout provided (this is more often a bug in the server's implementation as it violates the assumptions of REAPI). Second, an action timeout may not be specified and the remote server does not impose one (though it should).
Here is what you would see in these situations:
Solution:
In both cases, we can blame the server for this, and claim that bazel is operating as it should. However, we want to guarantee to our bazel users that their builds will not hang indefinitely. This is particularly important across CI jobs, where the specific build innovation information is not being monitored and a hung build can delay other jobs due to a much longer CI timeout. If we could provide a timeout for remote executions on the client side, we would detect these issues sooner, and even allow bazel to retry given the existing retry mechanism (--remote_retries).
Code:
We've added a new remote flag called: --experimental_remote_execute_timeout.
Only when this flag is used will a deadline be imposed on the grpc execute call.
This is the result: