Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make webhook-based scale race-free #1477

Merged
merged 9 commits into from
Jun 27, 2022
Merged

Conversation

mumoshu
Copy link
Collaborator

@mumoshu mumoshu commented May 24, 2022

This prevents race condition in the webhook-based autoscaler when it received a webhook event while processing another webhook event and both ended up scaling up the same horizontal runner autoscaler at the same time.

Ref #1321

@mumoshu mumoshu added this to the v0.25.0 milestone May 24, 2022
@mumoshu mumoshu requested a review from toast-gear as a code owner May 24, 2022 02:27
This prevents race condition in the webhook-based autoscaler when it received another webhook event while processing another webhook event and both ended up
scaling up the same horizontal runner autoscaler.

Ref #1321
@mumoshu mumoshu force-pushed the async-webhook-based-scale branch from 74e9a06 to 6e89a35 Compare May 24, 2022 11:46
@mumoshu mumoshu changed the title Make webhook-based scale operation asynchronous Make webhook-based scale race-free Jun 19, 2022
@mumoshu
Copy link
Collaborator Author

mumoshu commented Jun 21, 2022

@toast-gear Thanks for your review! I've addressed everything in the recent commits

toast-gear
toast-gear previously approved these changes Jun 21, 2022
@cablespaghetti
Copy link

I've deployed this to our staging environment. So far it's looking good 👍

@cablespaghetti
Copy link

cablespaghetti commented Jun 23, 2022

We're having an intermittent issue where there is a large delay in scaling up, but it does seem to happen eventually. It seems like it's when there are some other jobs running and for some reason the controller is waiting for those jobs/runners to finish before actioning the scale up from the webhook?

edit: I'll turn on debug logging and see if I can get anything useful

edit2:
Here's some log output when it's running 3 or 4 runners with a whole lot of jobs waiting. It seems to know it needs something like 32 runners but doesn't scale up until I think some job finishes a few minutes later.

stucknotscaling.txt

Here it decides to unstick itself and start 31 runners at once. Particularly this line.

2022-06-23T13:37:08Z    DEBUG   actions-runner-controller.runnerreplicaset      Created replica(s)      {"runnerreplicaset": "actions-runner-controller/runner-amd64-5lktq", "lastSyncTime": "2022-06-23T13:28:20Z", "effectiveTime": "2022-06-23 13:28:53 +0000 UTC", "templateHashDesired": "89f86cdf5", "replicasDesired": 32, "replicasPending": 0, "replicasRunning": 1, "replicasMaybeRunning": 1, "templateHashObserved": ["89f86cdf5"], "created": 31}

decides-to-start-runners.txt

edit3: This may have been fixed by me setting scaleDownDelaySecondsAfterScaleOut: 30 I'll let you know...

edit4: It was just the scale down delay blocking scale up 🤷 All is good for me now.

@mumoshu
Copy link
Collaborator Author

mumoshu commented Jun 27, 2022

@cablespaghetti Thanks for your detailed report!

the scale down delay blocking scale up

You seem to have discovered a new and unrelated bug while testing this! Thank you so much. I'll try to fix it asap and include it in the upcoming 0.25.0.

@mumoshu mumoshu merged commit e2c8163 into master Jun 27, 2022
@mumoshu mumoshu deleted the async-webhook-based-scale branch June 27, 2022 09:31
@mumoshu
Copy link
Collaborator Author

mumoshu commented Jun 27, 2022

I dug it deeper and still unsure what was going on. Calculated desired replicas of 32 in the first log indicates that it already passed through the scaleDownDelayAfterScaleOut check, so theoretically it shouldn't be due to scaleDownDelayAfterScaleOut 🤔

@mumoshu
Copy link
Collaborator Author

mumoshu commented Jun 27, 2022

@cablespaghetti Can you confirm that your story was like this- you first maxed out the number of runners at 32 as you set maxReplicas: 32 in your HRA spec and by receiving approx 30 webhook events to trigger scale-ups? Then you had to wait until the scaleDownDelayAfterScaleOut passed before ARC finally "recreate" the completed pods to match the number of active runners back to 32?

ARC ensures that the completed runner pods are recreated only after RunnerDeployment and RunnerReplicaSet's effectiveTime field or replicas field is updated by HRA so that it won't end up flapping(=ARC recreates a pod but then it receives a webhook event of workflow_job with status=completed which decreases the desired replicas and therefore terminating the just recreated pod....). I think I found in ARC's codebase that RunnerDeployment doesn't propagate EffectiveTime to RunnerReplicaSet when replicas isn't updated, which might explain your issue...

@cablespaghetti
Copy link

@cablespaghetti Can you confirm that your story was like this- you first maxed out the number of runners at 32 as you set maxReplicas: 32 in your HRA spec and by receiving approx 30 webhook events to trigger scale-ups? Then you had to wait until the scaleDownDelayAfterScaleOut passed before ARC finally "recreate" the completed pods to match the number of active runners back to 32?

ARC ensures that the completed runner pods are recreated only after RunnerDeployment and RunnerReplicaSet's effectiveTime field or replicas field is updated by HRA so that it won't end up flapping(=ARC recreates a pod but then it receives a webhook event of workflow_job with status=completed which decreases the desired replicas and therefore terminating the just recreated pod....). I think I found in ARC's codebase that RunnerDeployment doesn't propagate EffectiveTime to RunnerReplicaSet when replicas isn't updated, which might explain your issue...

Yes I think setting maxReplicas might be the root cause of our problem then. Thanks for investigating 👍

@mumoshu
Copy link
Collaborator Author

mumoshu commented Jun 28, 2022

@cablespaghetti Thanks a lot for reporting & confirming! I believe #1568 fixes the issue. I'd appreciate it if you could give it a try 🙏 Both this PR and #1568 have been merged so you might better build a canary version of ARC from our current main branch if you have any chance.

@cablespaghetti
Copy link

@mumoshu Sorry just saw this. I'll run 0.25.0 and report back if we have a repeat of the problem. Thanks for being so responsive in fixing bugs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants