-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition in upscale webhook #1321
Comments
@genisd Hey! Thanks for reporting. I believe this is caused by the fact that we use MergePatch for updating capacityReservations, as MergePatch doesn't conflict on data race. Apparently this isn't as easy as I initially thought. To be clear, adding a mutex around the patch operation doesn't resolve the issue alone, as conceptually you need "the latest capacityReservations at the update time" to avoid data race, and there's no way to ensure that, as K8s doesn't have a transaction. Wrapping the entire get-and-patch-HRA logic into a mutex can fix it, but I'm unsure how much it scales to the high number of concurrent jobs. Wapping it within a mutex means we serialize the whole get-and-patch-HRA logic so if the logic takes 2 seconds to complete for example, that means we can process only one workflow_job webhook event per 2 seconds per HRA, which sounds quite slow. Another strategy would be to use Update instead of Patch to detect conflicts and retry the whole get-and-update-HRA logic on conflict, and cross the fingers it will eventually succeed. Note that though, depending on how many concurrent worflow_jobs are queued, the max retry count can be very high. If you had 100 jobs to be triggered concurrently, in the worst case the webhook autoscaler can conflict on updating capacityReservations 99 times. Even worse things can happen if you keep triggering a lot of jobs before the all the previous capacityReservation updates are completed. That said, the best way can be to introduce another dedicated CRD(like |
I don't know how fast the patching really is (i lack subsecond data in my logs). Otherwise perhaps an opt-out would be feasible too.
Sounds like the mutex approach, same issues at least.
Of course if we can solve this issue with a non-blocking approach that would be best. But it does sound like a bit of refactor work. I have a proof of concept PR with a mutex which might work (with all the issues discussed). |
So I'm running a mutex in our environment. The latencies are not super horrible which means that for us this is feasible for now. |
Hi @genisd. Are you able to share your code? I was trying to set up a mixture of webhook and PercentageRunnersBusy scaling in our environment but have an issue where the webhook scaled pods all need to have completed before PercentageRunnersBusy can do anything, and then it scales to our maximum count immediately 🤷 We're running at quite a small scale right now so your mutex option could do the job. Thanks! |
I can give you the code, but running webhook mixed with PercentageRunnrsBusy hasn't really worked for me (I have tried). I'll share the code tomorrow |
This prevents race condition in the webhook-based autoscaler when it received another webhook event while processing another webhook event and both ended up scaling up the same horizontal runner autoscaler. Ref #1321
@genisd @cablespaghetti Hey! I finally had some time to actually work on this, and here's my attempt to fix it without affecting the overall throughput of the webhook server: #1477 |
From my limited testing this branch does seem to work reliably. Thanks! 🎉 I am getting a LOT of log messages about unrelated PVCs/PVs when I'm not using persistent volumes with my runners though, but that's unrelated I imagine.
|
@cablespaghetti Thanks for testing! Glad to hear it worked for you. Regarding excessive logs for PVCs and PVs, that's indeed a bug we introduced via another unrelated change! It's going to be fixed via #1487. Thanks again for reporting 🙏 |
I am unfortunately getting quite a few queued jobs which aren't getting pods today :( So it might be better than before but not 100% fixed. Could this be due to me running two instances of the webhook pods and two of the controller or should that work ok? |
I've updated to the latest version (0.24.0) using helm and I'm seeing messages related with pvcs:
Those pvcs are from an application unrelated with ARC, is this expected? Thank you! |
@cablespaghetti Ah, good catch! That must be the thing that isn't handled by my quick fix. Let me add one more fix so that it can use optimistic locking on HRA spec update to fully fix the race, and perhaps batching updates happened in several seconds so that it won't stress your K8s API server a lot. |
@raunyoliveira-hotmart That's one of the expected debug log messages so you shouldn't worry about it! (Note that our helm chart enables the debug logging by default // See #1511 |
Hi, not sure is related but my HRA spec is full of capacityReservations, I don't see any of them being clean even after the expirations time pass. I am using k8s 1.21 and ARC 0.24.0
|
@shinji62 We do remove expired capacity reservations in https://github.com/actions-runner-controller/actions-runner-controller/blob/master/controllers/horizontal_runner_autoscaler_webhook.go#L786 so the only reason an issue like yours can happen would be that you have some clock skew! (but probably not |
// Almost certainly it isn't related to this issue though |
We have a similar problem to the one mentioned in this issue but I'm not sure it's due to the same cause. Assuming a 10 jobs start and call the webhook 10 times and it would slowly start 10 new runners Am I wrong in assuming this could also be a possible issue? |
@Uriende Hey! Thanks for reporting. It might be the same issue. I'm going to complete #1477 shortly (as described in #1321 (comment)). After that I'd appreciate it if could give it a shot by building ARC from that branch and trying to reproduce the issue. |
@cablespaghetti @genisd @Uriende #1477 is fully ready to be tested! Would you mind giving it a shot and report back issues if any? Thanks. |
I'll try to give it a go today or tommorow. Thanks! @mumoshu |
I'm currently extremely busy with other stuff, and with maternity leave around the corner I cannot promise anything sorry 😞 |
@genisd Congrats 🎉 Take your time! |
* Make webhook-based scale operation asynchronous This prevents race condition in the webhook-based autoscaler when it received another webhook event while processing another webhook event and both ended up scaling up the same horizontal runner autoscaler. Ref #1321 * Fix typos * Update rather than Patch HRA to avoid race among webhook-based autoscaler servers * Batch capacity reservation updates for efficient use of apiserver * Fix potential never-ending HRA update conflicts in batch update * Extract batchScaler out of webhook-based autoscaler for testability * Fix log levels and batch scaler hang on start * Correlate webhook event with scale trigger amount in logs * Fix log message
Took longer than I though to get to it. But we're now running the branch. I'll report back how things are once it runs for a bit. |
@Uriende any updates for us? |
I'm running tagged |
Forgot to update. So far no issues and it seems to have fixed our issues. |
I think we can close this, 0.25.1 seems to be running great. |
So I'm going to close this. |
@genisd Thanks for all your help and support! |
There is a race condition in the web base upscale, where 2 events occur at the same time.
Essentially two events should scale +1, yielding in 2extra runners. But due to concurrency only 1 extra runner is booted.
Here a log from the webhook server, 4events are handled. But only 3 runners got reserved/booted.
I've been suspecting this and I'm super happy I found this, because it means we can fix it.
Could explain many issues reported (and we ourselves are hitting this multiple times daily).
#1279
#1238
This is on k8s 1.22 with action-runner-release 0.22.2.
I propose a mutex surrounding the patch of the k8s resource to avoid the stale write as seen above.
The text was updated successfully, but these errors were encountered: