-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods stuck in terminating state #1369
Comments
ARC version? |
v0.22.3 on eks 1.21 with bottlerocket 1.7.1. runner v2.290.1 |
Hey @larhauga just wondering are you using vault for managing controller secrets in eks? |
@larhauga Hey. Do you use spot instances? Also, what kind of node groups do you use, unmanaged or managed? |
@mumoshu yes er do. And cluster-autoscaler. Managed nodegroup with bottlerocket. We are still working on fine tune this, but there may be other causes for nodes to disappear. So I think there is a need for a bugfix here to ensure that terminated pods are removed from GHA and that the finalizer is removed. |
I almost have the same issue on k8s cluster in aws. However the pod is not stucked completely, he is removed after several errors(after 4 times) because "HorizontalRunnerAutoscaler" has updated value. |
We have runners that end up stuck in termination as well, but the underlying nodes have disappeared. No errors in the pod description, logs aren't available as the node is gone. Logs just show this: It's continually trying to check a pod on a node that's gone otherwise. Perhaps we can get a default timeout to just reap it if it's been stuck for more than X minutes ? |
Unfortunately, this seems impossible due to that GitHub Actions API seems to block unregistration requests against the disappeared runners.
This sounds like a good idea! But I thought we already fixed it in v0.22.3 to remove the finalizer when the pod disappeared.
Yeah maybe. A timeout since the pod disappeared would work. |
@navi86 How long does each completed runner stuck terminating? If it successfully terminates after HRA scaled it down, it seems working correctly. What other people are reporting seems to be more about the pod never gets removed even if HRA scaled it down. |
@larhauga I thought we had already fixed this v0.22.3 via the change b09c540 but apparently not? I'd appreciate it if you could share the output from To be extra clear, you mentioned L115 of runner_graceful_stop.go in your report: But since v0.22.3 the one earlier if block should handle it via the updated |
@larhauga I still appreciate your verification, but I got to think it's that your pod had disappeared while Waiting, even before it's being created and started running. The relevant part from
And
That explains it for me because the fix made in v0.22.3 only addresses the case when it has already Terminated in terms of I wonder why Perhaps K8s automatically retry scheduling regardless of |
It turned out ARC is setting And apparently, kubelet keeps the pod https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_pods.go#L1383 That explains why it didn't short-circut as coded in: Because the That said, probably the simplest fix to this issue would be to let ARC create runner pods with |
… the runner container disappeared Ref #1369
FWIW, linking jupyterhub/zero-to-jupyterhub-k8s#1962 (comment) |
@larhauga @bkrugerp99 Let me confirm to be extra sure- so, all those pods stuck in Terminating lacks corresponding Node objects in your cluster, right? Or is it that Node objects are there but the underlying machines(like EC2 instances) are gone hence Node objects are stuck in some erroneous status(like Unknown?) too? |
You can If the node still exist, the best we can do would be #1395. |
FWIW, this can be also said as a K8s bug as the pod is stuck in |
@mumoshu In our case here, the underlying node has already been gone. When we null out finalizers, cleanup happens fine. |
@mumoshu after some analysis i think that the problem occurs because the finalizer can't exit 0 for some internal reason (the runner is marked offline or any reason....) So a GH api fails and the finalizer fails too. I saw also that you declare finalizer for custom resources Runner probably it's better to use a watcher to delete associated ressources when custom is deleted. So when the controller die for external reason the finalizer is unregistered of K8s after that resources still stuck in terminate state when you delete it. Probably a bug in k8s |
@Fred78290 Thanks. We already leverage "watch" facility to react on any changes in the underlying resources so that the controller can handle and remove the finalizer ASAP if needed. Also, we already have a logic to not block our custom finalizer in certain cases, which results in "clear the finalizer" as you said. It's probably missing some edge cases though. In your case, was the pod still present stuck in Terminating, or was it inexistent at all? That's the important point, as far as I can tell. |
…erminating when the container disappeared Ref #1369
…erminating when the container disappeared Ref #1369
…erminating when the container disappeared Ref #1369
We experience that some runner pods are stuck in terminating state.
The pod is still registered as a runner in github. No change if the runner is deleted forcefully in gha api.
The actions-controller is continuously logging:
actions-runner-controller.runnerpod Runner pod is annotated to wait for completion
.This seams to happend if a node is being deleted with a runner.
We are running spot nodes and cluster-autoscaler, which seams to be making this issue a bit more apparent.
The last log event for the runner pod is: https://github.com/actions-runner-controller/actions-runner-controller/blob/e7200f274d592729b46848218c1d9c54214065c9/controllers/runner_graceful_stop.go#L115
As the pod has an error exit code (127) we think that a forceful delete of the runner is needed. https://github.com/actions-runner-controller/actions-runner-controller/blob/af8d8f7e1da4b32d837428f013b7b68510347343/controllers/runner_graceful_stop.go#L115 will only requeue the failed reconcile, and in this case never delete the pod.
unregisterRunner should probably be run if exist != 0 exists.
The text was updated successfully, but these errors were encountered: