-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline gets stuck randomly with "Job is waiting for a runner from XXX to come online" #3420
Comments
Hello! Thank you for filing an issue. The maintainers will triage your issue shortly. In the meantime, please take a look at the troubleshooting guide for bug reports. If this is a feature request, please review our contribution guidelines. |
not sure if this is the same issue, but im seeing similar behavior from 0.8.2 and 0.9.0. When i went to take at the cluster, i was seeing the runner pods try to start but immediately die with following error event:
This began last week when we were forced off of runner version @wasd171 If you have the chance, try to start the runner and watch the k8s event log for your runner namespace if this isnt the case for you i'll file a similar issue |
@rteeling-evernorth in our case the pods don't even get created |
gotcha. you have a different problem from me - my issue was caused by the fact that i had erroneously mirrored the arm actions-runner image for a bunch of intel runners. that shim task error was a bit of a red herring. anyways, good luck with your issue |
We are seeing this as well with 0.9.0 after upgrading from 0.8.2 It seems to occur for us when there is one job running (job that takes like 1h) and then another repo requests a runner. From what Ive quickly found in the logs is that it does scale out, but then for some reason that next runner pod starts up and then disappears again without running the job. The runnerset on GH GUI then shows 1 pending 1 running 2 assigned. But the autoscaling runnerset on ARC just shows 1 running 0 pending. Only when this long running job finishes it seems to notice again that there is still one pending job and fires up a pod. This does not happen always though, it can happily scale up to more than pod in other occasions. Due to an issue on our AKS I cannot access historical pod logs however, so its hard to check if there are any errors in that runner pod that seems to have gotten created and then disappearing again. From the changelog on 0.9.0 I see noticed #3371 changed something with respect to scaling. Perhaps it had unintended side effects? @nikola-jokic |
Great summary @mhuijgen, thanks! I think I might know what is going on here. I'll try to confirm my assumption and get back to you all with my findings. Thank you all for helping me to better understand the problem! If my assumption is right, the scale will self correct as @mhuijgen pointed out on job completed event, but that may take a long time. |
We're seeing the same since upgrading to 0.9.0. Just to add that restarting the listener will make a new (proper) runner pod to be created and the build will go through just fine. |
Killing the listener pod indeed corrects the issue as well. Just had a case where the autoscalingrunner set stated 0 pending/current/running and on GH GUI 1 job was assigned and never got picked up until I killed the listener and the listener restarted. Downgrading to 0.8.2 now until this is fixed... |
Hey @mhuijgen, the issue is with the controller. So you can use 0.8.3 in the meantime. Again, sorry this has happened, I personally tested the new scaling process many times, and I couldn't bump into this problem... 😞 I'm working on the fix right now. |
Not sure if this related, but I have the same issue on 0.9.0 The runner pod is created, but it's immediately deleted, then the controller never recreates the runner pod, resulting in pipeline get stuck. Most of the times, the error happens on runner initialization, one of them is:
Here's full log from Loki https://bin.hzmi.xyz/sadoneponi.log |
Same symptoms with GKE
I made the same observations as @mhuijgen and Hazmi35, and collected the runner and controller logs as well. Going a step further, here's what I can say. On the runner side, it fails when creating the session and quits immediately, indicating that the runner registration has been deleted from the server. Runner
On the controller side, the ephemeral runner registration is created once as expected, but a second time immediately afterwards (which is not expected), resulting in a detected conflict (the runner with a same name exists in GitHub) and causing the controller to delete the existing ephemeral runner registration from GitHub (the one created in the first attempt) and requeue reconciliation. First successful attempt to create an ephemeral runner registration:
Second unsuccessful attempt to create an ephemeral runner registration:
More interestingly, subsequently as the controller do not find the ephemeral runner (registration) in GitHub, it concludes immediately that the runner has finished, whereas it has failed.
Controller
In short, if I'm not mistaken:
Let's consider this as an assumption for now, based on logs and source code. I have struggled a bit to reproduce it on a local setup, since the code base is relatively new to me and I lack time (I'll be on vacation next week). |
Hey @clementnero, Thank you for this great investigation! The PR I created should take care of this problem as well (Long poll restarts the sequence, but during that 50s period, the controller will re-create a session and start the pod successfully), but I'm wondering why the controller didn't pick up the status update. I will dig into it as soon as I can. |
I'll keep an eye on this issue after the next 0.9.1 release Thank you all! |
@nikola-jokic we're experiencing the same issue as described here (randomly jobs get stuck waiting for a runner; killing listener pod unblocks it). For us, it started yesterday on 0.9.0. So today we upgraded to 0.9.1. Unfortunately, we hit the same issue several times today with 0.9.1. Do I understand correctly that the above issue should be fixed in 0.9.1? If so, is there any info I can share to support troubleshooting? |
Checks
Controller Version
0.9.0
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
Part of the pipeline works well, but at some point it gets stuck with the "Job is waiting for a runner from XXX to come online.". Restarting the pipeline fixes it. We have started experiencing it roughly since last week while still being on
![SCR-20240409-mzbh](https://private-user-images.githubusercontent.com/8266564/320852322-eea9cda7-b86d-45e7-8dc1-027e76bea83f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxNTQ2ODMsIm5iZiI6MTczOTE1NDM4MywicGF0aCI6Ii84MjY2NTY0LzMyMDg1MjMyMi1lZWE5Y2RhNy1iODZkLTQ1ZTctOGRjMS0wMjdlNzZiZWE4M2YucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIxMCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMTBUMDIyNjIzWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9OWRlYzYwNDdkMzAxMjY1NTcwNGFmYTBhMGNiMTA2OTEzZGRiMjAxNDY5Nzc3OWU3MjRiYzhiMjJjOGZmNDkxNCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.KUspfECjp1Kgt7zObghdXccwwlM-yB3IGoxn4J9_0sc)
0.6.1
but the issue has persisted on0.9.0
Describe the expected behavior
Pipeline does not get stuck
Additional Context
Controller Logs
Runner Pod Logs
The text was updated successfully, but these errors were encountered: