Make webhook-based scale race-free #1477

mumoshu · 2022-05-24T02:27:55Z

This prevents race condition in the webhook-based autoscaler when it received a webhook event while processing another webhook event and both ended up scaling up the same horizontal runner autoscaler at the same time.

Ref #1321

This prevents race condition in the webhook-based autoscaler when it received another webhook event while processing another webhook event and both ended up scaling up the same horizontal runner autoscaler. Ref #1321

cmd/githubwebhookserver/main.go

controllers/horizontal_runner_autoscaler_webhook_worker.go

…aler servers

mumoshu · 2022-06-21T11:25:23Z

@toast-gear Thanks for your review! I've addressed everything in the recent commits

cablespaghetti · 2022-06-22T16:29:23Z

I've deployed this to our staging environment. So far it's looking good 👍

cablespaghetti · 2022-06-23T09:09:45Z

We're having an intermittent issue where there is a large delay in scaling up, but it does seem to happen eventually. It seems like it's when there are some other jobs running and for some reason the controller is waiting for those jobs/runners to finish before actioning the scale up from the webhook?

edit: I'll turn on debug logging and see if I can get anything useful

edit2:
Here's some log output when it's running 3 or 4 runners with a whole lot of jobs waiting. It seems to know it needs something like 32 runners but doesn't scale up until I think some job finishes a few minutes later.

stucknotscaling.txt

Here it decides to unstick itself and start 31 runners at once. Particularly this line.

2022-06-23T13:37:08Z    DEBUG   actions-runner-controller.runnerreplicaset      Created replica(s)      {"runnerreplicaset": "actions-runner-controller/runner-amd64-5lktq", "lastSyncTime": "2022-06-23T13:28:20Z", "effectiveTime": "2022-06-23 13:28:53 +0000 UTC", "templateHashDesired": "89f86cdf5", "replicasDesired": 32, "replicasPending": 0, "replicasRunning": 1, "replicasMaybeRunning": 1, "templateHashObserved": ["89f86cdf5"], "created": 31}

decides-to-start-runners.txt

edit3: This may have been fixed by me setting scaleDownDelaySecondsAfterScaleOut: 30 I'll let you know...

edit4: It was just the scale down delay blocking scale up 🤷 All is good for me now.

mumoshu · 2022-06-27T00:48:09Z

@cablespaghetti Thanks for your detailed report!

the scale down delay blocking scale up

You seem to have discovered a new and unrelated bug while testing this! Thank you so much. I'll try to fix it asap and include it in the upcoming 0.25.0.

mumoshu · 2022-06-27T09:55:24Z

I dug it deeper and still unsure what was going on. Calculated desired replicas of 32 in the first log indicates that it already passed through the scaleDownDelayAfterScaleOut check, so theoretically it shouldn't be due to scaleDownDelayAfterScaleOut 🤔

mumoshu · 2022-06-27T11:37:42Z

@cablespaghetti Can you confirm that your story was like this- you first maxed out the number of runners at 32 as you set maxReplicas: 32 in your HRA spec and by receiving approx 30 webhook events to trigger scale-ups? Then you had to wait until the scaleDownDelayAfterScaleOut passed before ARC finally "recreate" the completed pods to match the number of active runners back to 32?

ARC ensures that the completed runner pods are recreated only after RunnerDeployment and RunnerReplicaSet's effectiveTime field or replicas field is updated by HRA so that it won't end up flapping(=ARC recreates a pod but then it receives a webhook event of workflow_job with status=completed which decreases the desired replicas and therefore terminating the just recreated pod....). I think I found in ARC's codebase that RunnerDeployment doesn't propagate EffectiveTime to RunnerReplicaSet when replicas isn't updated, which might explain your issue...

Ref #1477 (comment)

…1568) Ref #1477 (comment)

cablespaghetti · 2022-06-28T09:51:54Z

@cablespaghetti Can you confirm that your story was like this- you first maxed out the number of runners at 32 as you set maxReplicas: 32 in your HRA spec and by receiving approx 30 webhook events to trigger scale-ups? Then you had to wait until the scaleDownDelayAfterScaleOut passed before ARC finally "recreate" the completed pods to match the number of active runners back to 32?

ARC ensures that the completed runner pods are recreated only after RunnerDeployment and RunnerReplicaSet's effectiveTime field or replicas field is updated by HRA so that it won't end up flapping(=ARC recreates a pod but then it receives a webhook event of workflow_job with status=completed which decreases the desired replicas and therefore terminating the just recreated pod....). I think I found in ARC's codebase that RunnerDeployment doesn't propagate EffectiveTime to RunnerReplicaSet when replicas isn't updated, which might explain your issue...

Yes I think setting maxReplicas might be the root cause of our problem then. Thanks for investigating 👍

mumoshu · 2022-06-28T09:55:00Z

@cablespaghetti Thanks a lot for reporting & confirming! I believe #1568 fixes the issue. I'd appreciate it if you could give it a try 🙏 Both this PR and #1568 have been merged so you might better build a canary version of ARC from our current main branch if you have any chance.

cablespaghetti · 2022-07-06T09:38:34Z

@mumoshu Sorry just saw this. I'll run 0.25.0 and report back if we have a repeat of the problem. Thanks for being so responsive in fixing bugs!

mumoshu added this to the v0.25.0 milestone May 24, 2022

mumoshu requested a review from toast-gear as a code owner May 24, 2022 02:27

Make webhook-based scale operation asynchronous

6e89a35

This prevents race condition in the webhook-based autoscaler when it received another webhook event while processing another webhook event and both ended up scaling up the same horizontal runner autoscaler. Ref #1321

mumoshu force-pushed the async-webhook-based-scale branch from 74e9a06 to 6e89a35 Compare May 24, 2022 11:46

mumoshu mentioned this pull request May 24, 2022

Race condition in upscale webhook #1321

Closed

toast-gear reviewed Jun 9, 2022

View reviewed changes

cmd/githubwebhookserver/main.go Outdated Show resolved Hide resolved

toast-gear reviewed Jun 9, 2022

View reviewed changes

controllers/horizontal_runner_autoscaler_webhook_worker.go Outdated Show resolved Hide resolved

mumoshu changed the title ~~Make webhook-based scale operation asynchronous~~ Make webhook-based scale race-free Jun 19, 2022

mumoshu added 5 commits June 19, 2022 07:59

Fix typos

920b949

Update rather than Patch HRA to avoid race among webhook-based autosc…

bce4dcf

…aler servers

Batch capacity reservation updates for efficient use of apiserver

82b3b20

Fix potential never-ending HRA update conflicts in batch update

391a7ac

Extract batchScaler out of webhook-based autoscaler for testability

5f0ae00

toast-gear previously approved these changes Jun 21, 2022

View reviewed changes

Fix log levels and batch scaler hang on start

ab2c43d

mumoshu dismissed toast-gear’s stale review via ab2c43d June 22, 2022 00:05

mumoshu added 2 commits June 22, 2022 00:48

Correlate webhook event with scale trigger amount in logs

255ee1c

Fix log message

2d20b3c

toast-gear approved these changes Jun 24, 2022

View reviewed changes

mumoshu merged commit e2c8163 into master Jun 27, 2022

mumoshu deleted the async-webhook-based-scale branch June 27, 2022 09:31

mumoshu added a commit that referenced this pull request Jun 27, 2022

Fix completed runner pod recreation not to be blocked after max out

906f2b9

Ref #1477 (comment)

mumoshu mentioned this pull request Jun 27, 2022

Fix completed runner pod recreation not to be blocked after max out #1568

Merged

genisd mentioned this pull request Jun 27, 2022

Upstream pr 1568 #1569

Closed

genisd mentioned this pull request Jun 27, 2022

Upstream pr 1568 gynzy/actions-runner-controller#1

Closed

mumoshu added a commit that referenced this pull request Jun 28, 2022

Fix completed runner pod recreation not to be blocked after max out (#…

af96de6

…1568) Ref #1477 (comment)

Smirl mentioned this pull request Feb 6, 2023

Ephemeral RunnerDeployments do not scale up for 10 minutes or until all jobs are complete when maxReplicas is set #2254

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make webhook-based scale race-free #1477

Make webhook-based scale race-free #1477

mumoshu commented May 24, 2022 •

edited

Loading

mumoshu commented Jun 21, 2022

cablespaghetti commented Jun 22, 2022

cablespaghetti commented Jun 23, 2022 •

edited

Loading

mumoshu commented Jun 27, 2022

mumoshu commented Jun 27, 2022

mumoshu commented Jun 27, 2022 •

edited

Loading

cablespaghetti commented Jun 28, 2022

mumoshu commented Jun 28, 2022

cablespaghetti commented Jul 6, 2022

Make webhook-based scale race-free #1477

Make webhook-based scale race-free #1477

Conversation

mumoshu commented May 24, 2022 • edited Loading

mumoshu commented Jun 21, 2022

cablespaghetti commented Jun 22, 2022

cablespaghetti commented Jun 23, 2022 • edited Loading

mumoshu commented Jun 27, 2022

mumoshu commented Jun 27, 2022

mumoshu commented Jun 27, 2022 • edited Loading

cablespaghetti commented Jun 28, 2022

mumoshu commented Jun 28, 2022

cablespaghetti commented Jul 6, 2022

mumoshu commented May 24, 2022 •

edited

Loading

cablespaghetti commented Jun 23, 2022 •

edited

Loading

mumoshu commented Jun 27, 2022 •

edited

Loading