Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in upscale webhook #1321

Closed
genisd opened this issue Apr 8, 2022 · 28 comments
Closed

Race condition in upscale webhook #1321

genisd opened this issue Apr 8, 2022 · 28 comments
Labels
bug Something isn't working
Milestone

Comments

@genisd
Copy link
Contributor

genisd commented Apr 8, 2022

There is a race condition in the web base upscale, where 2 events occur at the same time.
Essentially two events should scale +1, yielding in 2extra runners. But due to concurrency only 1 extra runner is booted.

Here a log from the webhook server, 4events are handled. But only 3 runners got reserved/booted.
I've been suspecting this and I'm super happy I found this, because it means we can fix it.

Could explain many issues reported (and we ourselves are hitting this multiple times daily).
#1279
#1238

2022-04-08T13:55:31Z	DEBUG	controllers.webhookbasedautoscaler	Patching hra runner-16-ramdisk for capacityReservations update	{"before": 0, "expired": 0, "amount": 1, "after": 1}
2022-04-08T13:55:31Z	INFO	controllers.webhookbasedautoscaler	scaled runner-16-ramdisk by 1
2022-04-08T13:55:33Z	DEBUG	controllers.webhookbasedautoscaler	Found some runner groups are managed by ARC	{"event": "workflow_job", "hookID": "332460040", "delivery": "8f26cfa0-b743-11ec-93e7-d12e236a53a0", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}"}
2022-04-08T13:55:33Z	DEBUG	controllers.webhookbasedautoscaler	groups	{"event": "workflow_job", "hookID": "332460040", "delivery": "8f26cfa0-b743-11ec-93e7-d12e236a53a0", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}"}
2022-04-08T13:55:33Z	DEBUG	controllers.webhookbasedautoscaler	job scale up target found	{"event": "workflow_job", "hookID": "332460040", "delivery": "8f26cfa0-b743-11ec-93e7-d12e236a53a0", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "enterprise": "", "organization": "companyname", "repository": "companyname-frontend", "key": "companyname"}
2022-04-08T13:55:33Z	DEBUG	controllers.webhookbasedautoscaler	Patching hra runner-16-ramdisk for capacityReservations update	{"before": 1, "expired": 0, "amount": 1, "after": 2}
2022-04-08T13:55:33Z	DEBUG	controllers.webhookbasedautoscaler	Found some runner groups are managed by ARC	{"event": "workflow_job", "hookID": "332460040", "delivery": "8eb1c6b0-b743-11ec-8966-73bcb8c0ea68", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}"}
2022-04-08T13:55:33Z	DEBUG	controllers.webhookbasedautoscaler	groups	{"event": "workflow_job", "hookID": "332460040", "delivery": "8eb1c6b0-b743-11ec-8966-73bcb8c0ea68", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}"}
2022-04-08T13:55:33Z	DEBUG	controllers.webhookbasedautoscaler	job scale up target found	{"event": "workflow_job", "hookID": "332460040", "delivery": "8eb1c6b0-b743-11ec-8966-73bcb8c0ea68", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "enterprise": "", "organization": "companyname", "repository": "companyname-frontend", "key": "companyname"}
2022-04-08T13:55:33Z	DEBUG	controllers.webhookbasedautoscaler	Patching hra runner-16-ramdisk for capacityReservations update	{"before": 1, "expired": 0, "amount": 1, "after": 2}
2022-04-08T13:55:33Z	INFO	controllers.webhookbasedautoscaler	scaled runner-16-ramdisk by 1
2022-04-08T13:55:33Z	INFO	controllers.webhookbasedautoscaler	scaled runner-16-ramdisk by 1
2022-04-08T13:55:34Z	DEBUG	controllers.webhookbasedautoscaler	Found some runner groups are managed by ARC	{"event": "workflow_job", "hookID": "332460040", "delivery": "8f55cee0-b743-11ec-84dd-7ac8ac6c725b", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}"}
2022-04-08T13:55:34Z	DEBUG	controllers.webhookbasedautoscaler	groups	{"event": "workflow_job", "hookID": "332460040", "delivery": "8f55cee0-b743-11ec-84dd-7ac8ac6c725b", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}"}
2022-04-08T13:55:34Z	DEBUG	controllers.webhookbasedautoscaler	job scale up target found	{"event": "workflow_job", "hookID": "332460040", "delivery": "8f55cee0-b743-11ec-84dd-7ac8ac6c725b", "workflowJob.status": "queued", "workflowJob.labels": ["runner-16-ramdisk", "self-hosted"], "repository.name": "companyname-frontend", "repository.owner.login": "companyname", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "enterprise": "", "organization": "companyname", "repository": "companyname-frontend", "key": "companyname"}
2022-04-08T13:55:34Z	DEBUG	controllers.webhookbasedautoscaler	Patching hra runner-16-ramdisk for capacityReservations update	{"before": 2, "expired": 0, "amount": 1, "after": 3}
2022-04-08T13:55:34Z	INFO	controllers.webhookbasedautoscaler	scaled runner-16-ramdisk by 1

This is on k8s 1.22 with action-runner-release 0.22.2.

I propose a mutex surrounding the patch of the k8s resource to avoid the stale write as seen above.

@genisd genisd changed the title Race condition in webhook can yield in the less than the required amount of runners Race condition in upscale webhook Apr 8, 2022
@toast-gear toast-gear added the bug Something isn't working label Apr 8, 2022
@mumoshu
Copy link
Collaborator

mumoshu commented Apr 10, 2022

@genisd Hey! Thanks for reporting.

I believe this is caused by the fact that we use MergePatch for updating capacityReservations, as MergePatch doesn't conflict on data race. Apparently this isn't as easy as I initially thought.

To be clear, adding a mutex around the patch operation doesn't resolve the issue alone, as conceptually you need "the latest capacityReservations at the update time" to avoid data race, and there's no way to ensure that, as K8s doesn't have a transaction.

Wrapping the entire get-and-patch-HRA logic into a mutex can fix it, but I'm unsure how much it scales to the high number of concurrent jobs. Wapping it within a mutex means we serialize the whole get-and-patch-HRA logic so if the logic takes 2 seconds to complete for example, that means we can process only one workflow_job webhook event per 2 seconds per HRA, which sounds quite slow.

Another strategy would be to use Update instead of Patch to detect conflicts and retry the whole get-and-update-HRA logic on conflict, and cross the fingers it will eventually succeed. Note that though, depending on how many concurrent worflow_jobs are queued, the max retry count can be very high. If you had 100 jobs to be triggered concurrently, in the worst case the webhook autoscaler can conflict on updating capacityReservations 99 times. Even worse things can happen if you keep triggering a lot of jobs before the all the previous capacityReservation updates are completed.

That said, the best way can be to introduce another dedicated CRD(like WorkflowJob) so that we can create one unique WorkflowJob custom resource per a new workflow_job event of queued. We'd be able to ensure the uniqueness of the WorkflowJob resource by including id from the webhook event into the name of the resource. This way we'll never be writing to the same WorkflowJob concurrently so there will never be data race nor conflict.

@genisd
Copy link
Contributor Author

genisd commented Apr 10, 2022

Wrapping the entire get-and-patch-HRA logic into a mutex can fix it, but I'm unsure how much it scales to the high number of concurrent jobs.Wapping it within a mutex means we serialize the whole get-and-patch-HRA logic so if the logic takes 2 seconds to complete for example, that means we can process only one workflow_job webhook event per 2 seconds per HRA, which sounds quite slow.

I don't know how fast the patching really is (i lack subsecond data in my logs).
If I check on the loadbalancer it loks like about 200 to 300ms.
At 5s cutoff/timeout with 300ms execution 20 jobs would already be problematic (last call would be at ~6s).

Otherwise perhaps an opt-out would be feasible too.

Another strategy would be to use Update instead of Patch to detect conflicts and retry the whole get-and-update-HRA logic on conflict, and cross the fingers it will eventually succeed.

Sounds like the mutex approach, same issues at least.

That said, the best way can be to introduce another dedicated CRD(like WorkflowJob) so that we can create one unique WorkflowJob custom resource per a new workflow_job event of queued. We'd be able to ensure the uniqueness of the WorkflowJob resource by including id from the webhook event into the name of the resource. This way we'll never be writing to the same WorkflowJob concurrently so there will never be data race nor conflict.

Of course if we can solve this issue with a non-blocking approach that would be best. But it does sound like a bit of refactor work.
I'm not that experienced here but highly motivated to get this issue fixed 😇

I have a proof of concept PR with a mutex which might work (with all the issues discussed).
That was just 4 lines of code (if it works haha),

@genisd
Copy link
Contributor Author

genisd commented Apr 11, 2022

So I'm running a mutex in our environment.
The codebase is not structured to gate the read/write critical path only, so in my test I'm currently gating all http requests 🙈

The latencies are not super horrible which means that for us this is feasible for now.
95% percentile sits mostly at about 100ms. So for medium sized environments that approach probably would work, but not for really large environments. (Edit: really have to see how this goes when hooks start rolling in)

@cablespaghetti
Copy link

Hi @genisd. Are you able to share your code? I was trying to set up a mixture of webhook and PercentageRunnersBusy scaling in our environment but have an issue where the webhook scaled pods all need to have completed before PercentageRunnersBusy can do anything, and then it scales to our maximum count immediately 🤷

We're running at quite a small scale right now so your mutex option could do the job. Thanks!

@genisd
Copy link
Contributor Author

genisd commented May 23, 2022

I can give you the code, but running webhook mixed with PercentageRunnrsBusy hasn't really worked for me (I have tried).
So far the best setup is like, to set a decent scaledown delay.

I'll share the code tomorrow

mumoshu added a commit that referenced this issue May 24, 2022
This prevents race condition in the webhook-based autoscaler when it received another webhook event while processing another webhook event and both ended up
scaling up the same horizontal runner autoscaler.

Ref #1321
@mumoshu
Copy link
Collaborator

mumoshu commented May 24, 2022

@genisd @cablespaghetti Hey! I finally had some time to actually work on this, and here's my attempt to fix it without affecting the overall throughput of the webhook server: #1477
I would greatly appreciate it if you could give it a shot and report any issues. You should be able to build and test it in a local cluster following our contribution guide.

@cablespaghetti
Copy link

From my limited testing this branch does seem to work reliably. Thanks! 🎉

I am getting a LOT of log messages about unrelated PVCs/PVs when I'm not using persistent volumes with my runners though, but that's unrelated I imagine.

2022-05-25T15:20:40Z	INFO	actions-runner-controller.runnerpersistentvolumeclaim	Reconciling runner pvc	{"pvc": "observability/prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0"}
2022-05-25T15:20:40Z	INFO	actions-runner-controller.runnerpersistentvolumeclaim	Reconciling runner pvc	{"pvc": "monolith-planning-page-capacity-test/data-pet-db-postgresql-0"}
2022-05-25T15:20:42Z	DEBUG	actions-runner-controller.runnerpersistentvolume	Retrying sync until pvc gets removed	{"pv": "/pvc-c91139d3-e44d-44cc-bf94-45b1d8e478e6", "requeueAfter": "10s"}
2022-05-25T15:20:42Z	DEBUG	actions-runner-controller.runnerpersistentvolume	Retrying sync until pvc gets removed	{"pv": "/pvc-e6e8810f-8933-42ff-a122-a70ed466be26", "requeueAfter": "10s"}

@mumoshu
Copy link
Collaborator

mumoshu commented May 26, 2022

@cablespaghetti Thanks for testing! Glad to hear it worked for you.

Regarding excessive logs for PVCs and PVs, that's indeed a bug we introduced via another unrelated change! It's going to be fixed via #1487. Thanks again for reporting 🙏

@cablespaghetti
Copy link

I am unfortunately getting quite a few queued jobs which aren't getting pods today :( So it might be better than before but not 100% fixed. Could this be due to me running two instances of the webhook pods and two of the controller or should that work ok?

@rauny-brandao
Copy link

rauny-brandao commented Jun 6, 2022

@mumoshu

I've updated to the latest version (0.24.0) using helm and I'm seeing messages related with pvcs:

actions-runner-controller-84b49d96d9-wkfh2 manager 2022-06-06T08:56:29Z	DEBUG	actions-runner-controller.runnerpersistentvolume	Retrying sync until pvc gets remove{"pv": "/pvc-6d62ea20-5c9d-11ea-8eb6-022486ecee8b", "requeueAfter": "10s"}
actions-runner-controller-84b49d96d9-wkfh2 manager 2022-06-06T08:56:29Z	DEBUG	actions-runner-controller.runnerpersistentvolume	Retrying sync until pvc gets remove{"pv": "/pvc-9e240b56-2588-11ea-ad33-022486ecee8b", "requeueAfter": "10s"}
actions-runner-controller-84b49d96d9-wkfh2 manager 2022-06-06T08:56:29Z	DEBUG	actions-runner-controller.runnerpersistentvolume	Retrying sync until pvc gets remove{"pv": "/pvc-c8a7e0c4-3f5f-40f8-b4a7-4bd33edf3687", "requeueAfter": "10s"}
actions-runner-controller-84b49d96d9-wkfh2 manager 2022-06-06T08:56:29Z	DEBUG	actions-runner-controller.runnerpersistentvolume	Retrying sync until pvc gets remove{"pv": "/pvc-f3c47fa7-bc48-11e9-b794-0adb1bf7ee2a", "requeueAfter": "10s"}

Those pvcs are from an application unrelated with ARC, is this expected?

Thank you!

@mumoshu
Copy link
Collaborator

mumoshu commented Jun 6, 2022

Could this be due to me running two instances of the webhook pods

@cablespaghetti Ah, good catch! That must be the thing that isn't handled by my quick fix. Let me add one more fix so that it can use optimistic locking on HRA spec update to fully fix the race, and perhaps batching updates happened in several seconds so that it won't stress your K8s API server a lot.

@mumoshu
Copy link
Collaborator

mumoshu commented Jun 6, 2022

@raunyoliveira-hotmart That's one of the expected debug log messages so you shouldn't worry about it! (Note that our helm chart enables the debug logging by default

// See #1511

@shinji62
Copy link

Hi, not sure is related but my HRA spec is full of capacityReservations, I don't see any of them being clean even after the expirations time pass. I am using k8s 1.21 and ARC 0.24.0

capacityReservations:
  - effectiveTime: "2022-06-15T07:36:27Z"
    expirationTime: "2022-06-15T07:41:27Z"
    replicas: 1
  - effectiveTime: "2022-06-15T07:36:28Z"
    expirationTime: "2022-06-15T07:41:28Z"
    replicas: 1
  - effectiveTime: "2022-06-15T07:36:28Z"
    expirationTime: "2022-06-15T07:41:28Z"
    replicas: 1
  - effectiveTime: "2022-06-15T07:36:28Z"
    expirationTime: "2022-06-15T07:41:28Z"
    replicas: 1
  - effectiveTime: "2022-06-15T07:36:29Z"
    expirationTime: "2022-06-15T07:41:29Z"
    replicas: 1
  - effectiveTime: "2022-06-15T07:36:29Z"
    expirationTime: "2022-06-15T07:41:29Z"
    replicas: 1
  - effectiveTime: "2022-06-15T07:36:29Z"
    expirationTime: "2022-06-15T07:41:29Z"
    replicas: 1
  - effectiveTime: "2022-06-15T07:36:30Z"
    expirationTime: "2022-06-15T07:41:30Z"
    replicas: 1
  - effectiveTime: "2022-06-15T07:36:30Z"
    expirationTime: "2022-06-15T07:41:30Z"
    replicas: 1
  - effectiveTime: "2022-06-15T07:36:31Z"
    expirationTime: "2022-06-15T07:41:31Z"
    replicas: 1
  - effectiveTime: "2022-06-15T07:36:32Z"
    expirationTime: "2022-06-15T07:41:32Z"
    replicas: 1
  - effectiveTime: "2022-06-15T07:36:32Z"
    expirationTime: "2022-06-15T07:41:32Z"
    replicas: 1
  - effectiveTime: "2022-06-15T07:36:32Z"
    expirationTime: "2022-06-15T07:41:32Z"
    replicas: 1

@mumoshu
Copy link
Collaborator

mumoshu commented Jun 15, 2022

@shinji62 We do remove expired capacity reservations in https://github.com/actions-runner-controller/actions-runner-controller/blob/master/controllers/horizontal_runner_autoscaler_webhook.go#L786 so the only reason an issue like yours can happen would be that you have some clock skew! (but probably not
It would be great if you could read ARC's log and see what's happening, as it should contain when it reconciled the HRA and how many desired replicas ARC thought needed, etc.

@mumoshu
Copy link
Collaborator

mumoshu commented Jun 15, 2022

// Almost certainly it isn't related to this issue though

@Uriende
Copy link

Uriende commented Jun 15, 2022

We have a similar problem to the one mentioned in this issue but I'm not sure it's due to the same cause.
I did see the race condition happen once or twice in the logs but another theory.

Assuming a 10 jobs start and call the webhook 10 times and it would slowly start 10 new runners
If a different 2 job would finish at about the same time that would reduce the hra by 2
So we'll probably end up with only 8 new runners.

Am I wrong in assuming this could also be a possible issue?
I might test it later and see if I can verify that this happens to us. But wanted to share my thoughts here anyhow.

@mumoshu
Copy link
Collaborator

mumoshu commented Jun 17, 2022

@Uriende Hey! Thanks for reporting. It might be the same issue. I'm going to complete #1477 shortly (as described in #1321 (comment)). After that I'd appreciate it if could give it a shot by building ARC from that branch and trying to reproduce the issue.

@mumoshu
Copy link
Collaborator

mumoshu commented Jun 22, 2022

@cablespaghetti @genisd @Uriende #1477 is fully ready to be tested! Would you mind giving it a shot and report back issues if any? Thanks.

@Uriende
Copy link

Uriende commented Jun 22, 2022

I'll try to give it a go today or tommorow. Thanks! @mumoshu

cablespaghetti pushed a commit to cablespaghetti/actions-runner-controller that referenced this issue Jun 22, 2022
@genisd
Copy link
Contributor Author

genisd commented Jun 23, 2022

I'm currently extremely busy with other stuff, and with maternity leave around the corner I cannot promise anything sorry 😞
I did look at the code and it looked better than my hack 😄

@mumoshu
Copy link
Collaborator

mumoshu commented Jun 24, 2022

@genisd Congrats 🎉 Take your time!

mumoshu added a commit that referenced this issue Jun 27, 2022
* Make webhook-based scale operation asynchronous

This prevents race condition in the webhook-based autoscaler when it received another webhook event while processing another webhook event and both ended up
scaling up the same horizontal runner autoscaler.

Ref #1321

* Fix typos

* Update rather than Patch HRA to avoid race among webhook-based autoscaler servers

* Batch capacity reservation updates for efficient use of apiserver

* Fix potential never-ending HRA update conflicts in batch update

* Extract batchScaler out of webhook-based autoscaler for testability

* Fix log levels and batch scaler hang on start

* Correlate webhook event with scale trigger amount in logs

* Fix log message
@Uriende
Copy link

Uriende commented Jun 27, 2022

Took longer than I though to get to it. But we're now running the branch. I'll report back how things are once it runs for a bit.

@toast-gear
Copy link
Collaborator

@Uriende any updates for us?

@genisd
Copy link
Contributor Author

genisd commented Jul 11, 2022

I'm running tagged 0.25.1 as of right now. So far so good, I'll keep an eye on it for the next few days

@Uriende
Copy link

Uriende commented Jul 11, 2022

Forgot to update. So far no issues and it seems to have fixed our issues.

@genisd
Copy link
Contributor Author

genisd commented Jul 12, 2022

I think we can close this, 0.25.1 seems to be running great.

@genisd
Copy link
Contributor Author

genisd commented Jul 12, 2022

So I'm going to close this.
Fixed in the 0.25 release

@genisd genisd closed this as completed Jul 12, 2022
@mumoshu
Copy link
Collaborator

mumoshu commented Jul 12, 2022

@genisd Thanks for all your help and support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants