-
Notifications
You must be signed in to change notification settings - Fork 18k
x/build/cmd/coordinator: sometimes TryBot builds fail to schedule a machine and get stuck even after quota becomes available #55947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Nice detective work. Infinite loop in checkDep (or whatever it is) looks more actionable than general slowness... |
Change https://go.dev/cl/435496 mentions this issue: |
The previous logic in checkDep had a small possibility of looping forever if the third call to maintner was okay, but all successive ones fail. I'm not sure if that's what happened in go.dev/issue/55947, but rewrite the code to avoid that possibility anyway. For golang/go#55947. Change-Id: I28cd14cf8aa82b80d446ec9dbc3b118d4ef8b0fc Reviewed-on: https://go-review.googlesource.com/c/build/+/435496 TryBot-Result: Gopher Robot <gobot@golang.org> Auto-Submit: Dmitri Shuralyov <dmitshur@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org> Reviewed-by: Than McIntosh <thanm@google.com> Reviewed-by: Heschi Kreinick <heschi@google.com>
We should wait after CL 435496 is deployed and see if this happens again. It's possible that CL is all that's needed to fix this problem. |
We have learned that the overall problem of quota exhaustion still happens after coordinator restarts during a busy time; @prattmic and @heschi have made more progress on understanding that, but it's not the problem that this issue was tracking. I suspect this particular instance was really a rare case of |
Consider the TryBot run on CL 431956. Most of its builds have finished, with exception of linux-amd64-boringcrypto, linux-amd64-nounified, and linux-amd64-unified which are in waiting_for_machine state upwards of 20+ hrs:
Those TryBot runs may have been started at a busy time, but they aren't being scheduled even after the GCE pool capacity is available.
This is a problem for TryBot speed (meta tracking issue #17104).
CC @golang/release, @thanm, @cherrymui.
The text was updated successfully, but these errors were encountered: