-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0.9.1 does not work with on-demand provisioned nodes #3450
Comments
Hey @wasd171, Thank you for reporting this! The root cause of the problem is that in 0.9.1, the assumption was that if there is an empty batch, we self-correct the count. My assumption was that 50s is enough time for the cluster to be ready, but in this case, it was obviously wrong. An important thing to point out is that it should self correct on the next run on the scale set. However, this should be fixed, so again, thank you for bringing this to our attention! Please, use the older version of ARC in the meantime. You can also use the older version of the listener that doesn't propagate patch ID, which may occasionally create more pods than necessary, but it would handle this case appropriately. |
@nikola-jokic , I want to check if this is the same issue I am having. I have a listener that scheduled 7 jobs to 7 arc-runner pods, but those pods were pending while waiting for a node to come online. After about a minute, all of those pods went away, and then 1-2 minutes later the node was ready for them to run on. The pods didn't come back up, though. Meanwhile, in github actions, the jobs were stuck waiting for a runner group. The listener logs in arc-systems shows that those jobs were assigned, but the pods they were assigned to had been terminated (probably by the controller) and so the jobs were lost never to be scheduled. Does this sound like the same thing, and would rolling back to 0.9.0 to address the problem? |
Hey @koreyGambill, I think so. Rolling back to 0.9.0 should fix the problem. I have to mention that in a very unlucky case, 0.9.0 can increase latency of starting a job. It is a rare case, and if you have a busy cluster, it is very unlikely to happen. However, if you want to be 100% sure everything is executed as quickly as possible in every situation, please rollback to 0.8.3. This version of controller can create and delete ephemeral runners more than 0.9.0, but will ensure that the runner is created as soon as possible. In the next release (0.9.2), the controller should be able to pick up everything right away, and decrease the number of created pods. |
If we could set the waiting time for pods in the gha-runner-scale-set-controller values.yaml, that would be great :) |
Hey @ywang-psee, We don't have a waiting time, it was a bug introduced in 0.9.1 😞 |
Think we are having the same problem. Also using Karpenter here, and actually saw the listener create the appropriate ephemeralrunnerset settings, then ephemeralrunners started creating, then actual pods, but they all get killed off very quickly and it's back to 0 before the queued jobs can be picked up. Have reverted to 0.9.0 and things seem to be working again. |
Yep, same here, we reverted to 0.9.0 as well! |
@nikola-jokic Hi, do you know when will release 0.9.2 happen ? |
Hey @chenghuang-mdsol, I cannot promise anything, but hopefully, later this week or next week. |
I need to dig into it but "rolling back" to 0.8.3 results in chart diff errors with regard to clusterrole and the namespace it's in. Chart behavior must have changed with regard to where accounts are created (which namespace). I set the namespace aligned with the quick start docs (arc-runners and arc-systems) - do most people just use kube-system? We are still testing so I can blow it all away but just wondering. Also for removal - will |
Hey everyone 👋 Would anyone be interested in testing out this fix before we release it please 🙏 To do it, you can follow these steps:
Thank you in advance! ❤️ |
Thanks @nikola-jokic
it took roughly 70~80 sec for the node to be ready and for the runner pod to start running, and the controller tolerated the start time. |
However, one thing I want to call out is that when I firstly tried to install this canary image and chart, the listener failed to boot up with the following missing role.
Maybe it was just a timing issue that role/rolebinding created after the controller got created, or just a one-off blip because I couldn't reproduce it after all. |
HI , i have same issues in 0.9.3 @nikola-jokic
EKS 1.27 ... the setup is dind |
Checks
Controller Version
0.9.1
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
We are using Karpenter to dynamically provision worker nodes for our pipelines. It takes around 2 minutes to provision the node and to start the runner pod on it. However, it seems that the controller attempts to scale the EphemeralRunnerSet down after 1 minute, while the pod is still in the Pending status. It leads to the pipeline being stuck
Might be related to #3420 and #3426
Describe the expected behavior
Controller does not scale the EphemeralRunnerSet down after 1 minute of runner pod being in the Pending state.
Additional Context
Controller Logs
Runner Pod Logs
The text was updated successfully, but these errors were encountered: