-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod stuck on CrashLoopBackOff state if runner registration happens after token is expired #1295
Comments
I'm seeing something similar, too. Set up a new cluster using Controller version 0.22.1. When demand is idle, I see many (or all) pods go into CrashLoopBackOff state until there is sufficient demand to scale up beyond the crashy pods (which don't seem to recover?) State at idle:
After loading up a bunch of jobs:
Logs from a
|
to be clear @nehalkpatel you are talking about the same situation of runner registration happening after runner registration token is expired? This issue isn't for general "I have runners getting into a CrashLoopBackOff state" |
I'm not entirely sure what the cause is. Perhaps it is an issue with the token expiring and the runner no longer being able to authenticate. I've reverted to controller version 0.17.0 and things seem a bit more stable (though k8s did complain about deprecated APIs |
I think this might be a bug in the runner-replicaset controller that should be responsible for recreating the runner and runner pod as the token approaches the expiration date. In 0.21.x, it has been the responsibility of the runner controller. Once the runner controller detected a runner token is close enough to the expiration it had been recreating the runner pod with the same name with the updated registration token. This was also triggering a race condition that sometimes result in a workflow job being pend forever. As the part of the fix made in 0.22.0, I moved most of the runner pod management logic to the runner-replicaset controller (and its library code). Almost certainly I missed appropriately modifying the runner token update logic to the new place. |
@nehalkpatel Just to be extra clear, did downgrading to 0.17.0 completely resolved this specific issue? |
My current theory is that it has been broken since its start but the runner controller's ability to restart runner pods on token expirations was silently fixing it so we had never noticed this. Since 0.22.0 it doesn't automatically restart a runner pod. Assuming a runner can be kept running forever without issues once it's successfully registered, we don't need to change ARC to update the controller to recreate the runner pod on token expiration. Instead, we should just make the hard-coded startup timeout value to anything more practical. I thought 3 minutes should be enough but apparently not. |
suggests that the runner software itself doesn't handle this scenario and should be fixed on their end too so the pod can be gracefully terminated by the process exiting rather than crashing. Would you mind raising an issue in actions/runner and referencing this issue? |
@s4nji To be extra sure- Can you try to reproduce the issue with older versions of ARC(like 0.21.x)? |
@s4nji Also- how does the pod log look like for the pod in CrashLoopback in your case? |
@mumoshu CrashLoopBackOff pod logs:
|
yeh so it looks like they aren't handling the scenario in the runner software so at least part of the fix involves github making some changes on their end to handle the exception so at an absolute minimum a helpful message is printed, please raise an issue in actions/runner for that, feel free to reference this issue in it @s4nji 🙏 |
@mumoshu @toast-gear |
@mumoshu - yes Downgrading to 0.17.0 does seem to have addressed the issue (no |
@snji could you try v0.22.2 please as this release contains @mumoshu 's startup timeout fix
@nehalkpatel could you raise a new issue for the other failing to scale down problems with full details (versions, yaml snippets, logs, kubectl describes etc), as a first step we ask you to upgrade to latest before raising the issue |
@toast-gear yes, we are planning to update to |
I think this was due to a mismatch across our clusters with the ARC version. Once I rebuilt the new cluster, deleted the old one, and used a consistent version of ARC, I'm no longer seeing scale-down issues. |
Downgrading to 0.17.0 and setting |
@s4nji any updates? v0.22.3 is out now with other fixes, could you upgrade and let us know if this issue is now resolved? |
@toast-gear we have upgraded to I think the current fix is sufficient/enough (if your pod takes longer than 30 minutes to start, you have other problems), but it would perhaps be better if the controller can assign new valid registration token to runner/pods with expired tokens that are still in |
that would be nice on paper however not really doable with the new architecture tbh. We now rely solely on the mutating webhook to inject reg tokens. Mutatingwebhook isn't a regular k8s controller that works like "check if the pod spec contains expired token in envvar and update it". It works more like "the pod is being updated/created for whatever reason. i'm going to inject the token but i dont care other fields and the lifecycle of the pod"
I'm going to close this off seen as we've resolved the core problem. |
Hi @mumoshu @toast-gear I still see the same "token expired" and pod CrashLoopBackOff error in ARC version 0.23.0. I installed ARC version 0.20 in Dec 2021. Everything was working fine till May 13, 2022. Pod stuck on CrashLoopBackOff state. I saw the following error messages in the pod log.
Then I uninstalled helm actions-runner-controller, deleted all summerwind relevant CRD, and installed ARC version 0.23.0. However, the same error still exists. I know this solution requires GHES version >= 3.3.0. In our case, it was working fine on GHES 3.2 till last week. Any help will be very appreciated. Environment
|
@bl02 Hey! If the issue still persists after reinstalling ARC and all the runner pods had been recreated, it's more likely that something has gone wrong in your GHES instance. |
@bl02 Also, you'd better ask GitHub support as well. It it reproduces after recreating runner pods, it's more likely it's not specific to ARC or K8s. |
Just curious, but how did you confirm you actually installed ARC 0.23.0? Can you share the relevant part of your values.yaml? |
Thanks for your fast response. Yes, all the runner pods were removed successfully before I reinstalled ARC. I also installed ARC 0.23 in a new fresh new K8S cluster, same error. I just wonder why GHES3.2 was working fine with ARC before. Our GHES instance didn't have update recently. |
@bl02 Thanks. ARC works by calling some GitHub API to obtain registration token that is then passed to each runner pod so that I don't know much about how a real GHES instance is deployed. But anyway... do you manage a VM or a baremetal machine to run your GHES? Are you sure the system clock of your GHES machine is not skewed a lot? |
I executed "helm repo update" before reinstall. Now in helm list, I can see APP Version is 0.23.0
If I describe the deployment actions-runner-controller, I can see the following information:
|
@bl02 Thanks. Could you also make sure that you don't have |
@mumoshu I don't have
|
@bl02 Thanks. Fine! Then it's even more likely that there's something went wrong in your GHES instance. |
@MichaelSp Hey! Unfortunately, none from my end. Have you already asked GitHub support about that? The only possible reason I can come up with is that your GHES instance is returning outdated registration token in the first place, which shouldn't happen, no way to be handled by ARC. |
@MichaelSp we're removing support for the |
@MichaelSp Ah thanks! That makes sense. We recently made |
// BTW, unfortunately, this turned out to be an umbrella issue of 3 different issues. This happens so often and that's why I enabled the lock app on this repo(https://github.com/actions-runner-controller/actions-runner-controller/blob/master/.github/lock.yml) so that people are encouraged to open dedicated issues(adding links to "similar" issues is very helpful tho). But apparently, the lock app isn't working as expected? 🤔 |
Describe the bug
If runner registration happens after runner registration token is expired, it fails repeatedly, enters
CrashLoopBackOff
state for indefinite period, and never gets removed or updated by the controller.To Reproduce
Currently we see this happening due to the time between pod creation and the runner registration exceeding the 3 minute time defined in the controller: so a pod is created with a token that is expiring within just slightly over 3 minutes, and is used for registration only after it has expired.
One way simulate the delay between
RegistrationTokenUpdated
and runner registration is to setSTARTUP_DELAY_IN_SECONDS
to above 3 minutes.Steps to reproduce the behavior:
STARTUP_DELAY_IN_SECONDS
value set to360
(6 minutes) and let the controller create a runner resource and pod of itExpected behavior
Runner / Pods with expired registration token should be assigned a new token or be removed.
Environment
0.22.0
Helm
0.17.0
Additional info
This also seems to affect
HorizontalRunnerAutoscaler
withPercentageRunnersBusy
strategy; the crashing pods seems to be counted as running, non-busy pod.When enough pods enter
CrashLoopBackOff
state and accumulate (enough to go belowscaleDownThreshold
), it triggers scale down repeatedly, removing the (finished) healthy pods and keeping the crashing pods until the minimum number of runners is reached, making scale up impossible until the failing pods are manually removed.The text was updated successfully, but these errors were encountered: