-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make webhook-based scale race-free #1477
Conversation
This prevents race condition in the webhook-based autoscaler when it received another webhook event while processing another webhook event and both ended up scaling up the same horizontal runner autoscaler. Ref #1321
74e9a06
to
6e89a35
Compare
@toast-gear Thanks for your review! I've addressed everything in the recent commits |
I've deployed this to our staging environment. So far it's looking good 👍 |
We're having an intermittent issue where there is a large delay in scaling up, but it does seem to happen eventually. It seems like it's when there are some other jobs running and for some reason the controller is waiting for those jobs/runners to finish before actioning the scale up from the webhook? edit: I'll turn on debug logging and see if I can get anything useful edit2: Here it decides to unstick itself and start 31 runners at once. Particularly this line.
edit3: This may have been fixed by me setting edit4: It was just the scale down delay blocking scale up 🤷 All is good for me now. |
@cablespaghetti Thanks for your detailed report!
You seem to have discovered a new and unrelated bug while testing this! Thank you so much. I'll try to fix it asap and include it in the upcoming 0.25.0. |
I dug it deeper and still unsure what was going on. |
@cablespaghetti Can you confirm that your story was like this- you first maxed out the number of runners at ARC ensures that the completed runner pods are recreated only after RunnerDeployment and RunnerReplicaSet's |
Yes I think setting maxReplicas might be the root cause of our problem then. Thanks for investigating 👍 |
@cablespaghetti Thanks a lot for reporting & confirming! I believe #1568 fixes the issue. I'd appreciate it if you could give it a try 🙏 Both this PR and #1568 have been merged so you might better build a canary version of ARC from our current main branch if you have any chance. |
@mumoshu Sorry just saw this. I'll run 0.25.0 and report back if we have a repeat of the problem. Thanks for being so responsive in fixing bugs! |
This prevents race condition in the webhook-based autoscaler when it received a webhook event while processing another webhook event and both ended up scaling up the same horizontal runner autoscaler at the same time.
Ref #1321