Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods stuck in terminating state #1369

Closed
larhauga opened this issue Apr 21, 2022 · 19 comments · Fixed by #1420
Closed

Pods stuck in terminating state #1369

larhauga opened this issue Apr 21, 2022 · 19 comments · Fixed by #1420
Labels
bug Something isn't working

Comments

@larhauga
Copy link
Contributor

We experience that some runner pods are stuck in terminating state.
The pod is still registered as a runner in github. No change if the runner is deleted forcefully in gha api.
The actions-controller is continuously logging: actions-runner-controller.runnerpod Runner pod is annotated to wait for completion .

This seams to happend if a node is being deleted with a runner.

We are running spot nodes and cluster-autoscaler, which seams to be making this issue a bit more apparent.

The last log event for the runner pod is: https://github.com/actions-runner-controller/actions-runner-controller/blob/e7200f274d592729b46848218c1d9c54214065c9/controllers/runner_graceful_stop.go#L115

As the pod has an error exit code (127) we think that a forceful delete of the runner is needed. https://github.com/actions-runner-controller/actions-runner-controller/blob/af8d8f7e1da4b32d837428f013b7b68510347343/controllers/runner_graceful_stop.go#L115 will only requeue the failed reconcile, and in this case never delete the pod.

unregisterRunner should probably be run if exist != 0 exists.

Name:                      runner
Namespace:                 monorepo
Start Time:                Wed, 20 Apr 2022 14:17:24 +0200
Labels:                    actions-runner-controller/inject-registration-token=true
                           pod-template-hash=7cb656579c
                           runner-deployment-name=runner
                           runner-template-hash=54d74c6f69
                           runnerset-name=runner-f87wk-wr4qk
Annotations:               actions-runner-controller/token-expires-at: 2022-04-20T12:43:57Z
                           actions-runner/id: 115340
                           actions-runner/runner-completion-wait-start-timestamp: 2022-04-20T12:17:56Z
                           actions-runner/unregistration-request-timestamp: 2022-04-20T12:14:13Z
                           actions-runner/unregistration-start-timestamp: 2022-04-20T12:17:55Z
                           sync-time: 2022-04-20T11:57:34Z
Status:                    Terminating (lasts 138m)
Termination Grace Period:  0s
Controlled By:  Runner/runner-f87wk-wr4qk
Containers:
  runner:
    State:          Waiting
      Reason:       ContainerCreating
    Last State:     Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was deleted.  The container used to be Running
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False

@toast-gear
Copy link
Collaborator

ARC version?

@larhauga
Copy link
Contributor Author

larhauga commented Apr 21, 2022

ARC version?

v0.22.3 on eks 1.21 with bottlerocket 1.7.1. runner v2.290.1

@rahul-FLS
Copy link

Hey @larhauga just wondering are you using vault for managing controller secrets in eks?

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 22, 2022

@larhauga Hey. Do you use spot instances? Also, what kind of node groups do you use, unmanaged or managed?

@larhauga
Copy link
Contributor Author

@larhauga Hey. Do you use spot instances? Also, what kind of node groups do you use, unmanaged or managed?

@mumoshu yes er do. And cluster-autoscaler. Managed nodegroup with bottlerocket. We are still working on fine tune this, but there may be other causes for nodes to disappear. So I think there is a need for a bugfix here to ensure that terminated pods are removed from GHA and that the finalizer is removed.

@navi86
Copy link

navi86 commented Apr 25, 2022

I almost have the same issue on k8s cluster in aws. However the pod is not stucked completely, he is removed after several errors(after 4 times) because "HorizontalRunnerAutoscaler" has updated value.

@bkrugerp99
Copy link

We have runners that end up stuck in termination as well, but the underlying nodes have disappeared. No errors in the pod description, logs aren't available as the node is gone.

Logs just show this:
2022-04-25T20:23:29Z INFO actions-runner-controller.runnerpod Runner pod is annotated to wait for completion {"runnerpod": "gha-runners/cpu-small-jljfk-bgd28"}

It's continually trying to check a pod on a node that's gone otherwise.

Perhaps we can get a default timeout to just reap it if it's been stuck for more than X minutes ?

@toast-gear toast-gear added the bug Something isn't working label Apr 25, 2022
@mumoshu
Copy link
Collaborator

mumoshu commented Apr 26, 2022

ensure that terminated pods are removed from GHA

Unfortunately, this seems impossible due to that GitHub Actions API seems to block unregistration requests against the disappeared runners.

the finalizer is removed.

This sounds like a good idea! But I thought we already fixed it in v0.22.3 to remove the finalizer when the pod disappeared.

Perhaps we can get a default timeout to just reap it if it's been stuck for more than X minutes ?

Yeah maybe. A timeout since the pod disappeared would work.
But again, we already have another mechanism to handle the disappeared pods so I'm now wondering why it isn't working for you and @larhauga.

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 26, 2022

I almost have the same issue on k8s cluster in aws. However the pod is not stucked completely, he is removed after several errors(after 4 times) because "HorizontalRunnerAutoscaler" has updated value.

@navi86 How long does each completed runner stuck terminating? If it successfully terminates after HRA scaled it down, it seems working correctly. What other people are reporting seems to be more about the pod never gets removed even if HRA scaled it down.

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 26, 2022

@larhauga I thought we had already fixed this v0.22.3 via the change b09c540 but apparently not?

I'd appreciate it if you could share the output from kubectl get po -o yaml $RUNNER_POD from the stuck pod when you see it next time, so that I can perhaps isolate the cause.

To be extra clear, you mentioned L115 of runner_graceful_stop.go in your report:

https://github.com/actions-runner-controller/actions-runner-controller/blob/e7200f274d592729b46848218c1d9c54214065c9/controllers/runner_graceful_stop.go#L115-L118

But since v0.22.3 the one earlier if block should handle it via the updated runnerPodContainerIsStopped func:

https://github.com/actions-runner-controller/actions-runner-controller/blob/e7200f274d592729b46848218c1d9c54214065c9/controllers/runner_graceful_stop.go#L107-L114

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 26, 2022

@larhauga I still appreciate your verification, but I got to think it's that your pod had disappeared while Waiting, even before it's being created and started running.

The relevant part from kubectl describe po output you've shared is below, which seems to indicate that there's pod.Status.State.Waiting:

Containers:
  runner:
    State:          Waiting
      Reason:       ContainerCreating

And pod.Status.LastTerminationState.Terminated:

    Last State:     Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was deleted.  The container used to be Running
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000

That explains it for me because the fix made in v0.22.3 only addresses the case when it has already Terminated in terms of pod.Status.State.Terminated, not pod.Status.LastTerminationState.Terminated.

I wonder why pod.Status.State.Waiting is populated because the runner pod should have restartPolicy: Never it should have never reschedule any failed pod to anywhere and instead keep pod.Status.State.Terminated as is.

Perhaps K8s automatically retry scheduling regardless of restartPolicy assuming it's not a restart, when it failed while Waiting? 🤔

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 28, 2022

It turned out ARC is setting restartPolicy=OnFailure rather than Never:

https://github.com/actions-runner-controller/actions-runner-controller/blob/4053ab3e113e2cbc06d98319cee8a9ea797e2d76/controllers/runner_controller.go#L632-L634

And apparently, kubelet keeps the pod Running in this case:

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_pods.go#L1383

https://github.com/kubernetes/kubernetes/blob/537941765fe1304dd096c1a2d4d4e70f10768218/pkg/kubelet/kubelet_pods.go#L1407-L1425

That explains why it didn't short-circut as coded in:

https://github.com/actions-runner-controller/actions-runner-controller/blob/e7200f274d592729b46848218c1d9c54214065c9/controllers/runner_graceful_stop.go#L107-L114

Because the runnerPodOrContainerIsStopped functions thinks the pod to have stopped only when the pod is Succeeded or Failed (not Running as it is in this case), or the runner container is stuck at state.terminated != nil (not state.waiting != nil as it is in this case).

That said, probably the simplest fix to this issue would be to let ARC create runner pods with RestartPolicy=Never, so that newly create runner pods won't encounter this problem even when the node disappeared suddenly (due to e.g. it was hosted in an AWS EC2 spot instances).

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 29, 2022

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 29, 2022

@larhauga @bkrugerp99 Let me confirm to be extra sure- so, all those pods stuck in Terminating lacks corresponding Node objects in your cluster, right?

Or is it that Node objects are there but the underlying machines(like EC2 instances) are gone hence Node objects are stuck in some erroneous status(like Unknown?) too?

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 29, 2022

You can kubectl get po -oyaml $RUNNER_POD_NAME and see what spec.nodeName points to. If it points to an inexistent node, we could enhance ARC to check if the pod is tied to a disappeared hence inexistent node so that it can safely remove the pod as already stopped and will never come up later.

If the node still exist, the best we can do would be #1395.

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 29, 2022

FWIW, this can be also said as a K8s bug as the pod is stuck in ContainerCreating even if it's never going to be successfully created.

@bkrugerp99
Copy link

@mumoshu In our case here, the underlying node has already been gone. When we null out finalizers, cleanup happens fine.

mumoshu added a commit that referenced this issue May 9, 2022
This is intended to fix #1369 mostly for RunnerSet-managed runner pods. It is "mostly" because this fix might work well for RunnerDeployment in cases that #1395 does not work, like in a case that the user explicitly set the runner pod restart policy to anything other than "Never".

Ref #1369
@Fred78290
Copy link

@mumoshu
Hello I've already saw this problem but in another story.

after some analysis i think that the problem occurs because the finalizer can't exit 0 for some internal reason (the runner is marked offline or any reason....)

So a GH api fails and the finalizer fails too.

I saw also that you declare finalizer for custom resources Runner probably it's better to use a watcher to delete associated ressources when custom is deleted.

So when the controller die for external reason the finalizer is unregistered of K8s after that resources still stuck in terminate state when you delete it. Probably a bug in k8s
The solution is to clear the finalizer in metadata field.

@mumoshu
Copy link
Collaborator

mumoshu commented May 9, 2022

@Fred78290 Thanks. We already leverage "watch" facility to react on any changes in the underlying resources so that the controller can handle and remove the finalizer ASAP if needed. Also, we already have a logic to not block our custom finalizer in certain cases, which results in "clear the finalizer" as you said. It's probably missing some edge cases though.

In your case, was the pod still present stuck in Terminating, or was it inexistent at all? That's the important point, as far as I can tell.

toast-gear pushed a commit that referenced this issue May 12, 2022
)

This is intended to fix #1369 mostly for RunnerSet-managed runner pods. It is "mostly" because this fix might work well for RunnerDeployment in cases that #1395 does not work, like in a case that the user explicitly set the runner pod restart policy to anything other than "Never".

Ref #1369
mumoshu added a commit that referenced this issue May 12, 2022
mumoshu added a commit that referenced this issue May 12, 2022
mumoshu added a commit that referenced this issue May 12, 2022
toast-gear pushed a commit that referenced this issue May 14, 2022
… in Terminating when the container disappeared (#1395)

Ref #1369
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants