Pods stuck in terminating state #1369

larhauga · 2022-04-21T12:03:05Z

We experience that some runner pods are stuck in terminating state.
The pod is still registered as a runner in github. No change if the runner is deleted forcefully in gha api.
The actions-controller is continuously logging: actions-runner-controller.runnerpod Runner pod is annotated to wait for completion .

This seams to happend if a node is being deleted with a runner.

We are running spot nodes and cluster-autoscaler, which seams to be making this issue a bit more apparent.

The last log event for the runner pod is: https://github.com/actions-runner-controller/actions-runner-controller/blob/e7200f274d592729b46848218c1d9c54214065c9/controllers/runner_graceful_stop.go#L115

As the pod has an error exit code (127) we think that a forceful delete of the runner is needed. https://github.com/actions-runner-controller/actions-runner-controller/blob/af8d8f7e1da4b32d837428f013b7b68510347343/controllers/runner_graceful_stop.go#L115 will only requeue the failed reconcile, and in this case never delete the pod.

unregisterRunner should probably be run if exist != 0 exists.

Name:                      runner
Namespace:                 monorepo
Start Time:                Wed, 20 Apr 2022 14:17:24 +0200
Labels:                    actions-runner-controller/inject-registration-token=true
                           pod-template-hash=7cb656579c
                           runner-deployment-name=runner
                           runner-template-hash=54d74c6f69
                           runnerset-name=runner-f87wk-wr4qk
Annotations:               actions-runner-controller/token-expires-at: 2022-04-20T12:43:57Z
                           actions-runner/id: 115340
                           actions-runner/runner-completion-wait-start-timestamp: 2022-04-20T12:17:56Z
                           actions-runner/unregistration-request-timestamp: 2022-04-20T12:14:13Z
                           actions-runner/unregistration-start-timestamp: 2022-04-20T12:17:55Z
                           sync-time: 2022-04-20T11:57:34Z
Status:                    Terminating (lasts 138m)
Termination Grace Period:  0s
Controlled By:  Runner/runner-f87wk-wr4qk
Containers:
  runner:
    State:          Waiting
      Reason:       ContainerCreating
    Last State:     Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was deleted.  The container used to be Running
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False

The text was updated successfully, but these errors were encountered:

toast-gear · 2022-04-21T12:05:09Z

ARC version?

larhauga · 2022-04-21T12:10:25Z

ARC version?

v0.22.3 on eks 1.21 with bottlerocket 1.7.1. runner v2.290.1

rahul-FLS · 2022-04-21T18:31:11Z

Hey @larhauga just wondering are you using vault for managing controller secrets in eks?

mumoshu · 2022-04-22T00:03:10Z

@larhauga Hey. Do you use spot instances? Also, what kind of node groups do you use, unmanaged or managed?

larhauga · 2022-04-22T11:17:31Z

@larhauga Hey. Do you use spot instances? Also, what kind of node groups do you use, unmanaged or managed?

@mumoshu yes er do. And cluster-autoscaler. Managed nodegroup with bottlerocket. We are still working on fine tune this, but there may be other causes for nodes to disappear. So I think there is a need for a bugfix here to ensure that terminated pods are removed from GHA and that the finalizer is removed.

navi86 · 2022-04-25T10:38:37Z

I almost have the same issue on k8s cluster in aws. However the pod is not stucked completely, he is removed after several errors(after 4 times) because "HorizontalRunnerAutoscaler" has updated value.

bkrugerp99 · 2022-04-25T20:27:48Z

We have runners that end up stuck in termination as well, but the underlying nodes have disappeared. No errors in the pod description, logs aren't available as the node is gone.

Logs just show this:
2022-04-25T20:23:29Z INFO actions-runner-controller.runnerpod Runner pod is annotated to wait for completion {"runnerpod": "gha-runners/cpu-small-jljfk-bgd28"}

It's continually trying to check a pod on a node that's gone otherwise.

Perhaps we can get a default timeout to just reap it if it's been stuck for more than X minutes ?

mumoshu · 2022-04-26T00:04:47Z

ensure that terminated pods are removed from GHA

Unfortunately, this seems impossible due to that GitHub Actions API seems to block unregistration requests against the disappeared runners.

the finalizer is removed.

This sounds like a good idea! But I thought we already fixed it in v0.22.3 to remove the finalizer when the pod disappeared.

Perhaps we can get a default timeout to just reap it if it's been stuck for more than X minutes ?

Yeah maybe. A timeout since the pod disappeared would work.
But again, we already have another mechanism to handle the disappeared pods so I'm now wondering why it isn't working for you and @larhauga.

mumoshu · 2022-04-26T00:06:39Z

I almost have the same issue on k8s cluster in aws. However the pod is not stucked completely, he is removed after several errors(after 4 times) because "HorizontalRunnerAutoscaler" has updated value.

@navi86 How long does each completed runner stuck terminating? If it successfully terminates after HRA scaled it down, it seems working correctly. What other people are reporting seems to be more about the pod never gets removed even if HRA scaled it down.

mumoshu · 2022-04-26T00:44:36Z

@larhauga I thought we had already fixed this v0.22.3 via the change b09c540 but apparently not?

I'd appreciate it if you could share the output from kubectl get po -o yaml $RUNNER_POD from the stuck pod when you see it next time, so that I can perhaps isolate the cause.

To be extra clear, you mentioned L115 of runner_graceful_stop.go in your report:

https://github.com/actions-runner-controller/actions-runner-controller/blob/e7200f274d592729b46848218c1d9c54214065c9/controllers/runner_graceful_stop.go#L115-L118

But since v0.22.3 the one earlier if block should handle it via the updated runnerPodContainerIsStopped func:

https://github.com/actions-runner-controller/actions-runner-controller/blob/e7200f274d592729b46848218c1d9c54214065c9/controllers/runner_graceful_stop.go#L107-L114

mumoshu · 2022-04-26T01:10:57Z

@larhauga I still appreciate your verification, but I got to think it's that your pod had disappeared while Waiting, even before it's being created and started running.

The relevant part from kubectl describe po output you've shared is below, which seems to indicate that there's pod.Status.State.Waiting:

Containers:
  runner:
    State:          Waiting
      Reason:       ContainerCreating

And pod.Status.LastTerminationState.Terminated:

    Last State:     Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was deleted.  The container used to be Running
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000

That explains it for me because the fix made in v0.22.3 only addresses the case when it has already Terminated in terms of pod.Status.State.Terminated, not pod.Status.LastTerminationState.Terminated.

I wonder why pod.Status.State.Waiting is populated because the runner pod should have restartPolicy: Never it should have never reschedule any failed pod to anywhere and instead keep pod.Status.State.Terminated as is.

Perhaps K8s automatically retry scheduling regardless of restartPolicy assuming it's not a restart, when it failed while Waiting? 🤔

mumoshu · 2022-04-28T02:19:39Z

It turned out ARC is setting restartPolicy=OnFailure rather than Never:

https://github.com/actions-runner-controller/actions-runner-controller/blob/4053ab3e113e2cbc06d98319cee8a9ea797e2d76/controllers/runner_controller.go#L632-L634

And apparently, kubelet keeps the pod Running in this case:

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_pods.go#L1383

https://github.com/kubernetes/kubernetes/blob/537941765fe1304dd096c1a2d4d4e70f10768218/pkg/kubelet/kubelet_pods.go#L1407-L1425

That explains why it didn't short-circut as coded in:

https://github.com/actions-runner-controller/actions-runner-controller/blob/e7200f274d592729b46848218c1d9c54214065c9/controllers/runner_graceful_stop.go#L107-L114

Because the runnerPodOrContainerIsStopped functions thinks the pod to have stopped only when the pod is Succeeded or Failed (not Running as it is in this case), or the runner container is stuck at state.terminated != nil (not state.waiting != nil as it is in this case).

That said, probably the simplest fix to this issue would be to let ARC create runner pods with RestartPolicy=Never, so that newly create runner pods won't encounter this problem even when the node disappeared suddenly (due to e.g. it was hosted in an AWS EC2 spot instances).

… the runner container disappeared Ref #1369

mumoshu · 2022-04-29T01:26:54Z

FWIW, linking jupyterhub/zero-to-jupyterhub-k8s#1962 (comment)

mumoshu · 2022-04-29T01:30:59Z

@larhauga @bkrugerp99 Let me confirm to be extra sure- so, all those pods stuck in Terminating lacks corresponding Node objects in your cluster, right?

Or is it that Node objects are there but the underlying machines(like EC2 instances) are gone hence Node objects are stuck in some erroneous status(like Unknown?) too?

mumoshu · 2022-04-29T01:33:33Z

You can kubectl get po -oyaml $RUNNER_POD_NAME and see what spec.nodeName points to. If it points to an inexistent node, we could enhance ARC to check if the pod is tied to a disappeared hence inexistent node so that it can safely remove the pod as already stopped and will never come up later.

If the node still exist, the best we can do would be #1395.

mumoshu · 2022-04-29T03:59:58Z

FWIW, this can be also said as a K8s bug as the pod is stuck in ContainerCreating even if it's never going to be successfully created.

bkrugerp99 · 2022-05-04T18:35:36Z

@mumoshu In our case here, the underlying node has already been gone. When we null out finalizers, cleanup happens fine.

This is intended to fix #1369 mostly for RunnerSet-managed runner pods. It is "mostly" because this fix might work well for RunnerDeployment in cases that #1395 does not work, like in a case that the user explicitly set the runner pod restart policy to anything other than "Never". Ref #1369

Fred78290 · 2022-05-09T17:33:40Z

@mumoshu
Hello I've already saw this problem but in another story.

after some analysis i think that the problem occurs because the finalizer can't exit 0 for some internal reason (the runner is marked offline or any reason....)

So a GH api fails and the finalizer fails too.

I saw also that you declare finalizer for custom resources Runner probably it's better to use a watcher to delete associated ressources when custom is deleted.

So when the controller die for external reason the finalizer is unregistered of K8s after that resources still stuck in terminate state when you delete it. Probably a bug in k8s
The solution is to clear the finalizer in metadata field.

mumoshu · 2022-05-09T22:05:40Z

@Fred78290 Thanks. We already leverage "watch" facility to react on any changes in the underlying resources so that the controller can handle and remove the finalizer ASAP if needed. Also, we already have a logic to not block our custom finalizer in certain cases, which results in "clear the finalizer" as you said. It's probably missing some edge cases though.

In your case, was the pod still present stuck in Terminating, or was it inexistent at all? That's the important point, as far as I can tell.

) This is intended to fix #1369 mostly for RunnerSet-managed runner pods. It is "mostly" because this fix might work well for RunnerDeployment in cases that #1395 does not work, like in a case that the user explicitly set the runner pod restart policy to anything other than "Never". Ref #1369

…erminating when the container disappeared Ref #1369

… in Terminating when the container disappeared (#1395) Ref #1369

toast-gear added the bug Something isn't working label Apr 25, 2022

mumoshu added a commit that referenced this issue Apr 28, 2022

Fix runner pods created by RunnerDeployment stuck in Terminating when…

966b811

… the runner container disappeared Ref #1369

mumoshu mentioned this issue Apr 28, 2022

Force restartPolicy "Never" to prevent runner pods from stucking in Terminating when the container disappeared #1395

Merged

mumoshu mentioned this issue May 9, 2022

Fix runner pods managed by RunnerSet to not stuck in Terminating #1420

Merged

toast-gear closed this as completed in #1420 May 12, 2022

mumoshu added a commit that referenced this issue May 12, 2022

Force restartPolicy "Never" to prevent runner pods from stucking in T…

32b0fe4

…erminating when the container disappeared Ref #1369

mumoshu added a commit that referenced this issue May 12, 2022

Force restartPolicy "Never" to prevent runner pods from stucking in T…

c36e99a

…erminating when the container disappeared Ref #1369

mumoshu added a commit that referenced this issue May 12, 2022

Force restartPolicy "Never" to prevent runner pods from stucking in T…

988389a

…erminating when the container disappeared Ref #1369

toast-gear pushed a commit that referenced this issue May 14, 2022

fix: force restartPolicy "Never" to prevent runner pods from stucking…

759349d

… in Terminating when the container disappeared (#1395) Ref #1369

stockmaj mentioned this issue Nov 9, 2023

Pods stuck in Terminating state #3010

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods stuck in terminating state #1369

Pods stuck in terminating state #1369

larhauga commented Apr 21, 2022

toast-gear commented Apr 21, 2022

larhauga commented Apr 21, 2022 •

edited

Loading

rahul-FLS commented Apr 21, 2022

mumoshu commented Apr 22, 2022

larhauga commented Apr 22, 2022

navi86 commented Apr 25, 2022

bkrugerp99 commented Apr 25, 2022

mumoshu commented Apr 26, 2022 •

edited

Loading

mumoshu commented Apr 26, 2022 •

edited

Loading

mumoshu commented Apr 26, 2022 •

edited

Loading

mumoshu commented Apr 26, 2022

mumoshu commented Apr 28, 2022

mumoshu commented Apr 29, 2022

mumoshu commented Apr 29, 2022

mumoshu commented Apr 29, 2022 •

edited

Loading

mumoshu commented Apr 29, 2022

bkrugerp99 commented May 4, 2022

Fred78290 commented May 9, 2022

mumoshu commented May 9, 2022

Pods stuck in terminating state #1369

Pods stuck in terminating state #1369

Comments

larhauga commented Apr 21, 2022

toast-gear commented Apr 21, 2022

larhauga commented Apr 21, 2022 • edited Loading

rahul-FLS commented Apr 21, 2022

mumoshu commented Apr 22, 2022

larhauga commented Apr 22, 2022

navi86 commented Apr 25, 2022

bkrugerp99 commented Apr 25, 2022

mumoshu commented Apr 26, 2022 • edited Loading

mumoshu commented Apr 26, 2022 • edited Loading

mumoshu commented Apr 26, 2022 • edited Loading

mumoshu commented Apr 26, 2022

mumoshu commented Apr 28, 2022

mumoshu commented Apr 29, 2022

mumoshu commented Apr 29, 2022

mumoshu commented Apr 29, 2022 • edited Loading

mumoshu commented Apr 29, 2022

bkrugerp99 commented May 4, 2022

Fred78290 commented May 9, 2022

mumoshu commented May 9, 2022

larhauga commented Apr 21, 2022 •

edited

Loading

mumoshu commented Apr 26, 2022 •

edited

Loading

mumoshu commented Apr 26, 2022 •

edited

Loading

mumoshu commented Apr 26, 2022 •

edited

Loading

mumoshu commented Apr 29, 2022 •

edited

Loading