Pod stuck on CrashLoopBackOff state if runner registration happens after token is expired #1295

s4nji · 2022-03-31T10:02:09Z

Describe the bug
If runner registration happens after runner registration token is expired, it fails repeatedly, enters CrashLoopBackOff state for indefinite period, and never gets removed or updated by the controller.

To Reproduce
Currently we see this happening due to the time between pod creation and the runner registration exceeding the 3 minute time defined in the controller: so a pod is created with a token that is expiring within just slightly over 3 minutes, and is used for registration only after it has expired.

One way simulate the delay between RegistrationTokenUpdated and runner registration is to set STARTUP_DELAY_IN_SECONDS to above 3 minutes.

Steps to reproduce the behavior:

Create a new RunnerDeployment resource with STARTUP_DELAY_IN_SECONDS value set to 360 (6 minutes) and let the controller create a runner resource and pod of it
Observe the registration token expiration of the newly created runner
4 minutes before its expiration, trigger creation of a new runner and pod (by deleting the pod)
A new runner (and pod) should spawn with a token that is expiring within about 3 minutes and registers itself after the token is expired
Registration fails repeatedly, and eventually pod enters CrashLoopBackOff state, and remains in this state indefinitely

Expected behavior
Runner / Pods with expired registration token should be assigned a new token or be removed.

Environment

Controller Version: 0.22.0
Deployment Method: Helm
Helm Chart Version: 0.17.0

Additional info
This also seems to affect HorizontalRunnerAutoscaler with PercentageRunnersBusy strategy; the crashing pods seems to be counted as running, non-busy pod.

When enough pods enter CrashLoopBackOff state and accumulate (enough to go below scaleDownThreshold), it triggers scale down repeatedly, removing the (finished) healthy pods and keeping the crashing pods until the minimum number of runners is reached, making scale up impossible until the failing pods are manually removed.

The text was updated successfully, but these errors were encountered:

nehalkpatel · 2022-03-31T15:00:09Z

I'm seeing something similar, too. Set up a new cluster using Controller version 0.22.1. When demand is idle, I see many (or all) pods go into CrashLoopBackOff state until there is sufficient demand to scale up beyond the crashy pods (which don't seem to recover?)

State at idle:

╰─ k get pods                                 
NAME                            READY   STATUS             RESTARTS   AGE
devices-gh-runner-clhpm-bvvl2   1/2     CrashLoopBackOff   123        11h
devices-gh-runner-clhpm-fstkl   1/2     CrashLoopBackOff   119        11h
devices-gh-runner-clhpm-g4k7h   1/2     CrashLoopBackOff   119        11h
devices-gh-runner-clhpm-h2xth   1/2     CrashLoopBackOff   118        11h
devices-gh-runner-clhpm-kp8ht   1/2     CrashLoopBackOff   119        11h
devices-gh-runner-clhpm-ksskn   1/2     CrashLoopBackOff   124        11h
devices-gh-runner-clhpm-ktqp2   1/2     CrashLoopBackOff   124        11h
devices-gh-runner-clhpm-kxvt6   1/2     CrashLoopBackOff   124        11h
devices-gh-runner-clhpm-m8qkd   1/2     CrashLoopBackOff   118        11h
devices-gh-runner-clhpm-qn9sq   1/2     CrashLoopBackOff   124        11h
devices-gh-runner-clhpm-qrh4z   1/2     CrashLoopBackOff   119        11h
devices-gh-runner-clhpm-t9rgm   1/2     CrashLoopBackOff   119        11h
devices-gh-runner-clhpm-vzn8z   1/2     CrashLoopBackOff   124        11h
devices-gh-runner-clhpm-wm9bv   1/2     CrashLoopBackOff   119        11h

After loading up a bunch of jobs:

╰─ k get pods
NAME                            READY   STATUS             RESTARTS   AGE
devices-gh-runner-clhpm-27vrb   2/2     Running            0          8m14s
devices-gh-runner-clhpm-7d2zj   2/2     Running            0          8m14s
devices-gh-runner-clhpm-bj7ps   2/2     Running            0          8m14s
devices-gh-runner-clhpm-bvvl2   1/2     CrashLoopBackOff   127        11h
devices-gh-runner-clhpm-f8tp9   2/2     Running            0          8m14s
devices-gh-runner-clhpm-fj7z2   2/2     Running            0          8m14s
devices-gh-runner-clhpm-fstkl   1/2     CrashLoopBackOff   124        11h
devices-gh-runner-clhpm-g4k7h   1/2     CrashLoopBackOff   123        11h
devices-gh-runner-clhpm-gcvxt   2/2     Running            0          8m14s
devices-gh-runner-clhpm-h2xth   1/2     CrashLoopBackOff   123        11h
devices-gh-runner-clhpm-jhrb9   2/2     Running            0          8m13s
devices-gh-runner-clhpm-js4s7   2/2     Running            0          8m13s
devices-gh-runner-clhpm-kp8ht   2/2     Running            124        11h
devices-gh-runner-clhpm-ksskn   1/2     CrashLoopBackOff   129        11h
devices-gh-runner-clhpm-ktqp2   1/2     CrashLoopBackOff   129        11h
devices-gh-runner-clhpm-kxvt6   1/2     CrashLoopBackOff   129        11h
devices-gh-runner-clhpm-m8qkd   1/2     CrashLoopBackOff   123        11h
devices-gh-runner-clhpm-mvrt2   2/2     Running            0          8m14s
devices-gh-runner-clhpm-qmbwl   2/2     Running            0          8m13s
devices-gh-runner-clhpm-qn9sq   1/2     CrashLoopBackOff   129        11h
devices-gh-runner-clhpm-qrh4z   1/2     CrashLoopBackOff   123        11h
devices-gh-runner-clhpm-t9rgm   1/2     CrashLoopBackOff   123        11h
devices-gh-runner-clhpm-vgmxt   2/2     Running            0          8m14s
devices-gh-runner-clhpm-vsjf4   2/2     Running            0          8m14s
devices-gh-runner-clhpm-vwcgk   1/2     NotReady           0          8m14s
devices-gh-runner-clhpm-vzn8z   1/2     CrashLoopBackOff   129        11h
devices-gh-runner-clhpm-wm9bv   1/2     CrashLoopBackOff   123        11h
devices-gh-runner-clhpm-zw7vc   2/2     Running            0          8m13s

Logs from a CrashLoopBackOff runner:

Configuring the runner.

...
|                                                                              |
|                       Self-hosted runner registration                        |
...

# Authentication

Object reference not set to an instance of an object.
Configuration failed. Retrying
Configuration failed!

toast-gear · 2022-03-31T16:17:45Z

to be clear @nehalkpatel you are talking about the same situation of runner registration happening after runner registration token is expired? This issue isn't for general "I have runners getting into a CrashLoopBackOff state"

nehalkpatel · 2022-03-31T16:25:35Z

I'm not entirely sure what the cause is. Perhaps it is an issue with the token expiring and the runner no longer being able to authenticate.

I've reverted to controller version 0.17.0 and things seem a bit more stable (though k8s did complain about deprecated APIs v1beta1 --> v1) when replacing the config.

mumoshu · 2022-04-01T00:01:00Z

I think this might be a bug in the runner-replicaset controller that should be responsible for recreating the runner and runner pod as the token approaches the expiration date.

In 0.21.x, it has been the responsibility of the runner controller. Once the runner controller detected a runner token is close enough to the expiration it had been recreating the runner pod with the same name with the updated registration token. This was also triggering a race condition that sometimes result in a workflow job being pend forever.

As the part of the fix made in 0.22.0, I moved most of the runner pod management logic to the runner-replicaset controller (and its library code). Almost certainly I missed appropriately modifying the runner token update logic to the new place.

mumoshu · 2022-04-01T00:44:38Z

things seem a bit more stable (though k8s did complain about deprecated APIs v1beta1 --> v1) when replacing the config.

@nehalkpatel Just to be extra clear, did downgrading to 0.17.0 completely resolved this specific issue?

mumoshu · 2022-04-01T00:52:52Z

My current theory is that it has been broken since its start but the runner controller's ability to restart runner pods on token expirations was silently fixing it so we had never noticed this.

Since 0.22.0 it doesn't automatically restart a runner pod. Assuming a runner can be kept running forever without issues once it's successfully registered, we don't need to change ARC to update the controller to recreate the runner pod on token expiration.

Instead, we should just make the hard-coded startup timeout value to anything more practical. I thought 3 minutes should be enough but apparently not.

…iration longer Ref #1295

toast-gear · 2022-04-01T06:24:45Z

@s4nji @nehalkpatel

Object reference not set to an instance of an object.

suggests that the runner software itself doesn't handle this scenario and should be fixed on their end too so the pod can be gracefully terminated by the process exiting rather than crashing. Would you mind raising an issue in actions/runner and referencing this issue?

mumoshu · 2022-04-01T07:18:36Z

@s4nji To be extra sure- Can you try to reproduce the issue with older versions of ARC(like 0.21.x)?

mumoshu · 2022-04-01T07:19:51Z

@s4nji Also- how does the pod log look like for the pod in CrashLoopback in your case?

s4nji · 2022-04-01T07:34:37Z

@mumoshu
I will try to see how ARC behaves with expired registration token on 0.21.x locally

CrashLoopBackOff pod logs:

Docker enabled runner detected and Docker daemon wait is enabled
Waiting until Docker is avaliable or the timeout is reached
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
Github endpoint URL https://github.com/
Passing --ephemeral to config.sh to enable the ephemeral runner.
Passing --disableupdate to config.sh to disable automatic runner updates.
Configuring the runner.

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication

Object reference not set to an instance of an object.
Configuration failed. Retrying
Configuring the runner.

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication

Object reference not set to an instance of an object.
Configuration failed. Retrying
Configuring the runner.

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication

Object reference not set to an instance of an object.
Configuration failed. Retrying
Configuring the runner.

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication

Object reference not set to an instance of an object.
Configuration failed. Retrying
Configuring the runner.

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication

Object reference not set to an instance of an object.
Configuration failed. Retrying
Configuring the runner.

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication

Object reference not set to an instance of an object.
Configuration failed. Retrying
Configuring the runner.

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication

Object reference not set to an instance of an object.
Configuration failed. Retrying
Configuring the runner.

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication

Object reference not set to an instance of an object.
Configuration failed. Retrying
Configuring the runner.

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication

Object reference not set to an instance of an object.
Configuration failed. Retrying
Configuring the runner.

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication

Object reference not set to an instance of an object.
Configuration failed. Retrying
Configuration failed!

toast-gear · 2022-04-01T08:12:28Z

yeh so it looks like they aren't handling the scenario in the runner software so at least part of the fix involves github making some changes on their end to handle the exception so at an absolute minimum a helpful message is printed, please raise an issue in actions/runner for that, feel free to reference this issue in it @s4nji 🙏

s4nji · 2022-04-01T10:49:38Z

@mumoshu
It does not seem to happen in ARC 0.21.1: after a pod fails to start due to expired registration token (identical logs with my previous message), the pod is removed, I see RegistrationTokenUpdated event firing on the runner resource, the runner gets updated with a valid token (expireat field included), and a new pod gets created with valid registration token, and all is fine.

@toast-gear
Looks like the several issues has been filed already (actions/runner#1739, actions/runner#1748) and a fix has been merged (actions/runner#1741) 4 days ago, but not yet included in the latest release (2 days ago v2.289.2).

nehalkpatel · 2022-04-01T17:20:13Z

@mumoshu - yes Downgrading to 0.17.0 does seem to have addressed the issue (no CrashLoopBackOff). However, I'm seeing other (likely unrelated) issues with failing to scale down.

…iration longer (#1296) Ref #1295

toast-gear · 2022-04-06T10:36:15Z

@snji could you try v0.22.2 please as this release contains @mumoshu 's startup timeout fix

However, I'm seeing other (likely unrelated) issues with failing to scale down.

@nehalkpatel could you raise a new issue for the other failing to scale down problems with full details (versions, yaml snippets, logs, kubectl describes etc), as a first step we ask you to upgrade to latest before raising the issue

s4nji · 2022-04-06T10:50:00Z

@toast-gear yes, we are planning to update to 0.22.2 this week and will write back here with the result (the 30m should really fix it though!)

nehalkpatel · 2022-04-06T15:23:38Z

@nehalkpatel could you raise a new issue for the other failing to scale down problems with full details (versions, yaml snippets, logs, kubectl describes etc), as a first step we ask you to upgrade to latest before raising the issue

I think this was due to a mismatch across our clusters with the ARC version. Once I rebuilt the new cluster, deleted the old one, and used a consistent version of ARC, I'm no longer seeing scale-down issues.

ebeigarts · 2022-04-07T13:36:41Z

Downgrading to 0.17.0 and setting dockerEnabled: false (docker was stuck in NotReady on 0.17.0) fixed this for me.

toast-gear · 2022-04-08T16:16:49Z

@s4nji any updates? v0.22.3 is out now with other fixes, could you upgrade and let us know if this issue is now resolved?

s4nji · 2022-04-08T16:42:11Z

@toast-gear we have upgraded to 0.22.3 and while we stopped getting random pods stuck on CrashLoopBackOff every hour, we still encounter some stuck pods — due to other reasons causing some pods to take longer than 30 minutes to start — so at the moment we still need to manually clean things up.

I think the current fix is sufficient/enough (if your pod takes longer than 30 minutes to start, you have other problems), but it would perhaps be better if the controller can assign new valid registration token to runner/pods with expired tokens that are still in NotReady state.

toast-gear · 2022-04-11T11:00:54Z

I think the current fix is sufficient/enough (if your pod takes longer than 30 minutes to start, you have other problems), but it would perhaps be better if the controller can assign new valid registration token to runner/pods with expired tokens that are still in NotReady state.

that would be nice on paper however not really doable with the new architecture tbh. We now rely solely on the mutating webhook to inject reg tokens. Mutatingwebhook isn't a regular k8s controller that works like "check if the pod spec contains expired token in envvar and update it". It works more like "the pod is being updated/created for whatever reason. i'm going to inject the token but i dont care other fields and the lifecycle of the pod"

@toast-gear we have upgraded to 0.22.3 and while we stopped getting random pods stuck on CrashLoopBackOff every hour, we still encounter some stuck pods — due to other reasons causing some pods to take longer than 30 minutes to start — so at the moment we still need to manually clean things up.

I'm going to close this off seen as we've resolved the core problem.

Bruce6X · 2022-05-18T07:48:26Z

Hi @mumoshu @toast-gear I still see the same "token expired" and pod CrashLoopBackOff error in ARC version 0.23.0.

I installed ARC version 0.20 in Dec 2021. Everything was working fine till May 13, 2022. Pod stuck on CrashLoopBackOff state. I saw the following error messages in the pod log.

2022-05-18 07:28:38.330  DEBUG --- Configuration failed. Retrying
2022-05-18 07:28:39.333  ERROR --- Configuration failed!

Http response code: Unauthorized from 'POST https://<our-ghes-domian>/api/v3/actions/runner-registration'
{"message":"Token expired.","documentation_url":"https://docs.github.com/enterprise/3.2/rest"}
Response status code does not indicate success: 401 (Unauthorized).

Then I uninstalled helm actions-runner-controller, deleted all summerwind relevant CRD, and installed ARC version 0.23.0. However, the same error still exists.

I know this solution requires GHES version >= 3.3.0. In our case, it was working fine on GHES 3.2 till last week. Any help will be very appreciated.

Environment

Hi Controller Version: 0.23.0
Helm chart: actions-runner-controller-0.18.0
Cert-manager: v1.8.0
Kubernetes: 1.22.6
GHES: 3.2.11

mumoshu · 2022-05-18T08:06:55Z

@bl02 Hey! If the issue still persists after reinstalling ARC and all the runner pods had been recreated, it's more likely that something has gone wrong in your GHES instance.
Could you verify all the runner pods are actually gone after you uninstalled ARC? If not, please remove runner pods and see how newly create runner pods behave. Thanks.

mumoshu · 2022-05-18T08:08:31Z

@bl02 Also, you'd better ask GitHub support as well. It it reproduces after recreating runner pods, it's more likely it's not specific to ARC or K8s.

mumoshu · 2022-05-18T08:10:40Z

Just curious, but how did you confirm you actually installed ARC 0.23.0? Can you share the relevant part of your values.yaml?

Bruce6X · 2022-05-18T08:16:29Z

Thanks for your fast response. Yes, all the runner pods were removed successfully before I reinstalled ARC. I also installed ARC 0.23 in a new fresh new K8S cluster, same error. I just wonder why GHES3.2 was working fine with ARC before. Our GHES instance didn't have update recently.

mumoshu · 2022-05-18T08:22:18Z

@bl02 Thanks. ARC works by calling some GitHub API to obtain registration token that is then passed to each runner pod so that actions/runner ran within the pod's container can use it to actually call the runner registration API.
If the runner pod recreated after a fresh install of ARC, it can't grab an outdated registration token. the only viable cause could be that your GHES instance is returning an outdated registration token to ARC and ARC just passes it as-is to the runner pod(as... it believes GHES will never return an outdated token!)
That's why I think it's rather an issue in your GHES instance, not ARC.

I don't know much about how a real GHES instance is deployed. But anyway... do you manage a VM or a baremetal machine to run your GHES? Are you sure the system clock of your GHES machine is not skewed a lot?

Bruce6X · 2022-05-18T08:27:06Z

Just curious, but how did you confirm you actually installed ARC 0.23.0? Can you share the relevant part of your values.yaml?

I executed "helm repo update" before reinstall. Now in helm list, I can see APP Version is 0.23.0

NAME                                    NAMESPACE                       REVISION        UPDATED                                         STATUS          CHART                              APP VERSION
actions-runner-controller    github-actions       1               2022-05-18 10:12:34.993704438 +0200 CEST        deployed        actions-runner-controller-0.18.0   0.23.0

If I describe the deployment actions-runner-controller, I can see the following information:

  labels:
    app.kubernetes.io/instance: actions-runner-controller
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: actions-runner-controller
    app.kubernetes.io/version: 0.23.0
    helm.sh/chart: actions-runner-controller-0.18.0

mumoshu · 2022-05-18T08:37:17Z

@bl02 Thanks. Could you also make sure that you don't have image.tag set in your values.yaml?

Bruce6X · 2022-05-18T08:57:40Z

@mumoshu I don't have image.tag set. In pod log, the image tag is v0.23.0

Containers:
  manager:
    Image:         summerwind/actions-runner-controller:v0.23.0

mumoshu · 2022-05-18T09:09:44Z

@bl02 Thanks. Fine! Then it's even more likely that there's something went wrong in your GHES instance.

MichaelSp · 2022-05-18T12:42:24Z

@mumoshu @bl02 Regarding the GHES instance. It runs on a GCP hosted VM. NTP is enabled so the system clock shouldn't be too far off. Anything else we can check?

/cc @stoe

mumoshu · 2022-05-19T00:36:24Z

@MichaelSp Hey! Unfortunately, none from my end. Have you already asked GitHub support about that? The only possible reason I can come up with is that your GHES instance is returning outdated registration token in the first place, which shouldn't happen, no way to be handled by ARC.

MichaelSp · 2022-05-19T12:25:46Z

I found something interesting in the GHES logs:

"ExceptionMessage":"Ephemeral runners are not supported in this version of GHES or GitHub Actions"

We are running GHES 3.2. Looks like we have to wait for 3.3.

/cc @stoe

toast-gear · 2022-05-19T14:29:09Z

@MichaelSp we're removing support for the --once flag extremely extremely soon #1196, you will need to implement one of the suggested solutions soon to avoid an outage. Upgrading to GHES to => 3.3 ASAP is the simpliest easiest solution to avoiding an outage.

mumoshu · 2022-05-20T01:29:38Z

@MichaelSp Ah thanks! That makes sense. We recently made --ephemeral the default in our runner image. Although we'd highly recommend you to upgrade GHES ASAP, to be clear and to be fair, we did add an escape hatch for GHES 3.2 users. That is, you can set RUNNER_FEATURE_FLAG_ONCE=true in runners to force the use of --once #1384. EDIT This escape hatch however is being removed entirely so it can only be used very short term.

mumoshu · 2022-05-20T01:33:26Z

// BTW, unfortunately, this turned out to be an umbrella issue of 3 different issues. This happens so often and that's why I enabled the lock app on this repo(https://github.com/actions-runner-controller/actions-runner-controller/blob/master/.github/lock.yml) so that people are encouraged to open dedicated issues(adding links to "similar" issues is very helpful tho). But apparently, the lock app isn't working as expected? 🤔

toast-gear added the bug Something isn't working label Mar 31, 2022

mumoshu added this to the v0.22.2 milestone Apr 1, 2022

mumoshu added a commit that referenced this issue Apr 1, 2022

Make the hard-coded runner startup timeout to avoid race on token exp…

9e7039b

…iration longer Ref #1295

mumoshu mentioned this issue Apr 1, 2022

Make the hard-coded runner startup timeout to avoid race on token expiration longer #1296

Merged

mumoshu added a commit that referenced this issue Apr 3, 2022

Make the hard-coded runner startup timeout to avoid race on token exp…

b614dcf

…iration longer (#1296) Ref #1295

toast-gear modified the milestones: v0.22.2, v0.22.3 Apr 6, 2022

toast-gear closed this as completed Apr 6, 2022

toast-gear reopened this Apr 6, 2022

toast-gear closed this as completed Apr 11, 2022

Pod stuck on CrashLoopBackOff state if runner registration happens after token is expired #1295

Pod stuck on CrashLoopBackOff state if runner registration happens after token is expired #1295

Comments

s4nji commented Mar 31, 2022 • edited Loading

nehalkpatel commented Mar 31, 2022 • edited Loading

toast-gear commented Mar 31, 2022

nehalkpatel commented Mar 31, 2022

mumoshu commented Apr 1, 2022 • edited Loading

mumoshu commented Apr 1, 2022

mumoshu commented Apr 1, 2022

toast-gear commented Apr 1, 2022 • edited Loading

mumoshu commented Apr 1, 2022

mumoshu commented Apr 1, 2022

s4nji commented Apr 1, 2022

toast-gear commented Apr 1, 2022 • edited Loading

s4nji commented Apr 1, 2022 • edited Loading

nehalkpatel commented Apr 1, 2022

toast-gear commented Apr 6, 2022 • edited Loading

s4nji commented Apr 6, 2022

nehalkpatel commented Apr 6, 2022

ebeigarts commented Apr 7, 2022 • edited Loading

toast-gear commented Apr 8, 2022

s4nji commented Apr 8, 2022 • edited Loading

toast-gear commented Apr 11, 2022

Bruce6X commented May 18, 2022

mumoshu commented May 18, 2022

mumoshu commented May 18, 2022

mumoshu commented May 18, 2022

Bruce6X commented May 18, 2022

mumoshu commented May 18, 2022 • edited Loading

Bruce6X commented May 18, 2022 • edited Loading

mumoshu commented May 18, 2022

Bruce6X commented May 18, 2022

mumoshu commented May 18, 2022 • edited Loading

MichaelSp commented May 18, 2022 • edited Loading

mumoshu commented May 19, 2022 • edited Loading

MichaelSp commented May 19, 2022 • edited Loading

toast-gear commented May 19, 2022 • edited Loading

mumoshu commented May 20, 2022 • edited by toast-gear Loading

mumoshu commented May 20, 2022 • edited Loading

s4nji commented Mar 31, 2022 •

edited

Loading

nehalkpatel commented Mar 31, 2022 •

edited

Loading

mumoshu commented Apr 1, 2022 •

edited

Loading

toast-gear commented Apr 1, 2022 •

edited

Loading

toast-gear commented Apr 1, 2022 •

edited

Loading

s4nji commented Apr 1, 2022 •

edited

Loading

toast-gear commented Apr 6, 2022 •

edited

Loading

ebeigarts commented Apr 7, 2022 •

edited

Loading

s4nji commented Apr 8, 2022 •

edited

Loading

mumoshu commented May 18, 2022 •

edited

Loading

Bruce6X commented May 18, 2022 •

edited

Loading

mumoshu commented May 18, 2022 •

edited

Loading

MichaelSp commented May 18, 2022 •

edited

Loading

mumoshu commented May 19, 2022 •

edited

Loading

MichaelSp commented May 19, 2022 •

edited

Loading

toast-gear commented May 19, 2022 •

edited

Loading

mumoshu commented May 20, 2022 •

edited by toast-gear

Loading

mumoshu commented May 20, 2022 •

edited

Loading