Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod stuck on CrashLoopBackOff state if runner registration happens after token is expired #1295

Closed
s4nji opened this issue Mar 31, 2022 · 36 comments
Labels
bug Something isn't working
Milestone

Comments

@s4nji
Copy link

s4nji commented Mar 31, 2022

Describe the bug
If runner registration happens after runner registration token is expired, it fails repeatedly, enters CrashLoopBackOff state for indefinite period, and never gets removed or updated by the controller.

To Reproduce
Currently we see this happening due to the time between pod creation and the runner registration exceeding the 3 minute time defined in the controller: so a pod is created with a token that is expiring within just slightly over 3 minutes, and is used for registration only after it has expired.

One way simulate the delay between RegistrationTokenUpdated and runner registration is to set STARTUP_DELAY_IN_SECONDS to above 3 minutes.

Steps to reproduce the behavior:

  1. Create a new RunnerDeployment resource with STARTUP_DELAY_IN_SECONDS value set to 360 (6 minutes) and let the controller create a runner resource and pod of it
  2. Observe the registration token expiration of the newly created runner
  3. 4 minutes before its expiration, trigger creation of a new runner and pod (by deleting the pod)
  4. A new runner (and pod) should spawn with a token that is expiring within about 3 minutes and registers itself after the token is expired
  5. Registration fails repeatedly, and eventually pod enters CrashLoopBackOff state, and remains in this state indefinitely

Expected behavior
Runner / Pods with expired registration token should be assigned a new token or be removed.

Environment

  • Controller Version: 0.22.0
  • Deployment Method: Helm
  • Helm Chart Version: 0.17.0

Additional info
This also seems to affect HorizontalRunnerAutoscaler with PercentageRunnersBusy strategy; the crashing pods seems to be counted as running, non-busy pod.

When enough pods enter CrashLoopBackOff state and accumulate (enough to go below scaleDownThreshold), it triggers scale down repeatedly, removing the (finished) healthy pods and keeping the crashing pods until the minimum number of runners is reached, making scale up impossible until the failing pods are manually removed.

@toast-gear toast-gear added the bug Something isn't working label Mar 31, 2022
@nehalkpatel
Copy link

nehalkpatel commented Mar 31, 2022

I'm seeing something similar, too. Set up a new cluster using Controller version 0.22.1. When demand is idle, I see many (or all) pods go into CrashLoopBackOff state until there is sufficient demand to scale up beyond the crashy pods (which don't seem to recover?)

State at idle:

╰─ k get pods                                 
NAME                            READY   STATUS             RESTARTS   AGE
devices-gh-runner-clhpm-bvvl2   1/2     CrashLoopBackOff   123        11h
devices-gh-runner-clhpm-fstkl   1/2     CrashLoopBackOff   119        11h
devices-gh-runner-clhpm-g4k7h   1/2     CrashLoopBackOff   119        11h
devices-gh-runner-clhpm-h2xth   1/2     CrashLoopBackOff   118        11h
devices-gh-runner-clhpm-kp8ht   1/2     CrashLoopBackOff   119        11h
devices-gh-runner-clhpm-ksskn   1/2     CrashLoopBackOff   124        11h
devices-gh-runner-clhpm-ktqp2   1/2     CrashLoopBackOff   124        11h
devices-gh-runner-clhpm-kxvt6   1/2     CrashLoopBackOff   124        11h
devices-gh-runner-clhpm-m8qkd   1/2     CrashLoopBackOff   118        11h
devices-gh-runner-clhpm-qn9sq   1/2     CrashLoopBackOff   124        11h
devices-gh-runner-clhpm-qrh4z   1/2     CrashLoopBackOff   119        11h
devices-gh-runner-clhpm-t9rgm   1/2     CrashLoopBackOff   119        11h
devices-gh-runner-clhpm-vzn8z   1/2     CrashLoopBackOff   124        11h
devices-gh-runner-clhpm-wm9bv   1/2     CrashLoopBackOff   119        11h

After loading up a bunch of jobs:

╰─ k get pods
NAME                            READY   STATUS             RESTARTS   AGE
devices-gh-runner-clhpm-27vrb   2/2     Running            0          8m14s
devices-gh-runner-clhpm-7d2zj   2/2     Running            0          8m14s
devices-gh-runner-clhpm-bj7ps   2/2     Running            0          8m14s
devices-gh-runner-clhpm-bvvl2   1/2     CrashLoopBackOff   127        11h
devices-gh-runner-clhpm-f8tp9   2/2     Running            0          8m14s
devices-gh-runner-clhpm-fj7z2   2/2     Running            0          8m14s
devices-gh-runner-clhpm-fstkl   1/2     CrashLoopBackOff   124        11h
devices-gh-runner-clhpm-g4k7h   1/2     CrashLoopBackOff   123        11h
devices-gh-runner-clhpm-gcvxt   2/2     Running            0          8m14s
devices-gh-runner-clhpm-h2xth   1/2     CrashLoopBackOff   123        11h
devices-gh-runner-clhpm-jhrb9   2/2     Running            0          8m13s
devices-gh-runner-clhpm-js4s7   2/2     Running            0          8m13s
devices-gh-runner-clhpm-kp8ht   2/2     Running            124        11h
devices-gh-runner-clhpm-ksskn   1/2     CrashLoopBackOff   129        11h
devices-gh-runner-clhpm-ktqp2   1/2     CrashLoopBackOff   129        11h
devices-gh-runner-clhpm-kxvt6   1/2     CrashLoopBackOff   129        11h
devices-gh-runner-clhpm-m8qkd   1/2     CrashLoopBackOff   123        11h
devices-gh-runner-clhpm-mvrt2   2/2     Running            0          8m14s
devices-gh-runner-clhpm-qmbwl   2/2     Running            0          8m13s
devices-gh-runner-clhpm-qn9sq   1/2     CrashLoopBackOff   129        11h
devices-gh-runner-clhpm-qrh4z   1/2     CrashLoopBackOff   123        11h
devices-gh-runner-clhpm-t9rgm   1/2     CrashLoopBackOff   123        11h
devices-gh-runner-clhpm-vgmxt   2/2     Running            0          8m14s
devices-gh-runner-clhpm-vsjf4   2/2     Running            0          8m14s
devices-gh-runner-clhpm-vwcgk   1/2     NotReady           0          8m14s
devices-gh-runner-clhpm-vzn8z   1/2     CrashLoopBackOff   129        11h
devices-gh-runner-clhpm-wm9bv   1/2     CrashLoopBackOff   123        11h
devices-gh-runner-clhpm-zw7vc   2/2     Running            0          8m13s

Logs from a CrashLoopBackOff runner:

Configuring the runner.

...
|                                                                              |
|                       Self-hosted runner registration                        |
...

# Authentication

Object reference not set to an instance of an object.
Configuration failed. Retrying
Configuration failed!

@toast-gear
Copy link
Collaborator

to be clear @nehalkpatel you are talking about the same situation of runner registration happening after runner registration token is expired? This issue isn't for general "I have runners getting into a CrashLoopBackOff state"

@nehalkpatel
Copy link

I'm not entirely sure what the cause is. Perhaps it is an issue with the token expiring and the runner no longer being able to authenticate.

I've reverted to controller version 0.17.0 and things seem a bit more stable (though k8s did complain about deprecated APIs v1beta1 --> v1) when replacing the config.

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 1, 2022

I think this might be a bug in the runner-replicaset controller that should be responsible for recreating the runner and runner pod as the token approaches the expiration date.

In 0.21.x, it has been the responsibility of the runner controller. Once the runner controller detected a runner token is close enough to the expiration it had been recreating the runner pod with the same name with the updated registration token. This was also triggering a race condition that sometimes result in a workflow job being pend forever.

As the part of the fix made in 0.22.0, I moved most of the runner pod management logic to the runner-replicaset controller (and its library code). Almost certainly I missed appropriately modifying the runner token update logic to the new place.

@mumoshu mumoshu added this to the v0.22.2 milestone Apr 1, 2022
@mumoshu
Copy link
Collaborator

mumoshu commented Apr 1, 2022

things seem a bit more stable (though k8s did complain about deprecated APIs v1beta1 --> v1) when replacing the config.

@nehalkpatel Just to be extra clear, did downgrading to 0.17.0 completely resolved this specific issue?

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 1, 2022

My current theory is that it has been broken since its start but the runner controller's ability to restart runner pods on token expirations was silently fixing it so we had never noticed this.

Since 0.22.0 it doesn't automatically restart a runner pod. Assuming a runner can be kept running forever without issues once it's successfully registered, we don't need to change ARC to update the controller to recreate the runner pod on token expiration.

Instead, we should just make the hard-coded startup timeout value to anything more practical. I thought 3 minutes should be enough but apparently not.

@toast-gear
Copy link
Collaborator

toast-gear commented Apr 1, 2022

@s4nji @nehalkpatel

Object reference not set to an instance of an object.

suggests that the runner software itself doesn't handle this scenario and should be fixed on their end too so the pod can be gracefully terminated by the process exiting rather than crashing. Would you mind raising an issue in actions/runner and referencing this issue?

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 1, 2022

@s4nji To be extra sure- Can you try to reproduce the issue with older versions of ARC(like 0.21.x)?

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 1, 2022

@s4nji Also- how does the pod log look like for the pod in CrashLoopback in your case?

@s4nji
Copy link
Author

s4nji commented Apr 1, 2022

@mumoshu
I will try to see how ARC behaves with expired registration token on 0.21.x locally

CrashLoopBackOff pod logs:
Docker enabled runner detected and Docker daemon wait is enabled
Waiting until Docker is avaliable or the timeout is reached
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
Github endpoint URL https://github.com/
Passing --ephemeral to config.sh to enable the ephemeral runner.
Passing --disableupdate to config.sh to disable automatic runner updates.
Configuring the runner.

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication

Object reference not set to an instance of an object.
Configuration failed. Retrying
Configuring the runner.

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication

Object reference not set to an instance of an object.
Configuration failed. Retrying
Configuring the runner.

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication

Object reference not set to an instance of an object.
Configuration failed. Retrying
Configuring the runner.

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication

Object reference not set to an instance of an object.
Configuration failed. Retrying
Configuring the runner.

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication

Object reference not set to an instance of an object.
Configuration failed. Retrying
Configuring the runner.

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication

Object reference not set to an instance of an object.
Configuration failed. Retrying
Configuring the runner.

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication

Object reference not set to an instance of an object.
Configuration failed. Retrying
Configuring the runner.

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication

Object reference not set to an instance of an object.
Configuration failed. Retrying
Configuring the runner.

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication

Object reference not set to an instance of an object.
Configuration failed. Retrying
Configuring the runner.

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication

Object reference not set to an instance of an object.
Configuration failed. Retrying
Configuration failed!

@toast-gear
Copy link
Collaborator

toast-gear commented Apr 1, 2022

yeh so it looks like they aren't handling the scenario in the runner software so at least part of the fix involves github making some changes on their end to handle the exception so at an absolute minimum a helpful message is printed, please raise an issue in actions/runner for that, feel free to reference this issue in it @s4nji 🙏

@s4nji
Copy link
Author

s4nji commented Apr 1, 2022

@mumoshu
It does not seem to happen in ARC 0.21.1: after a pod fails to start due to expired registration token (identical logs with my previous message), the pod is removed, I see RegistrationTokenUpdated event firing on the runner resource, the runner gets updated with a valid token (expireat field included), and a new pod gets created with valid registration token, and all is fine.

@toast-gear
Looks like the several issues has been filed already (actions/runner#1739, actions/runner#1748) and a fix has been merged (actions/runner#1741) 4 days ago, but not yet included in the latest release (2 days ago v2.289.2).

@nehalkpatel
Copy link

@mumoshu - yes Downgrading to 0.17.0 does seem to have addressed the issue (no CrashLoopBackOff). However, I'm seeing other (likely unrelated) issues with failing to scale down.

mumoshu added a commit that referenced this issue Apr 3, 2022
@toast-gear toast-gear modified the milestones: v0.22.2, v0.22.3 Apr 6, 2022
@toast-gear
Copy link
Collaborator

toast-gear commented Apr 6, 2022

@snji could you try v0.22.2 please as this release contains @mumoshu 's startup timeout fix

However, I'm seeing other (likely unrelated) issues with failing to scale down.

@nehalkpatel could you raise a new issue for the other failing to scale down problems with full details (versions, yaml snippets, logs, kubectl describes etc), as a first step we ask you to upgrade to latest before raising the issue

@toast-gear toast-gear reopened this Apr 6, 2022
@s4nji
Copy link
Author

s4nji commented Apr 6, 2022

@toast-gear yes, we are planning to update to 0.22.2 this week and will write back here with the result (the 30m should really fix it though!)

@nehalkpatel
Copy link

@nehalkpatel could you raise a new issue for the other failing to scale down problems with full details (versions, yaml snippets, logs, kubectl describes etc), as a first step we ask you to upgrade to latest before raising the issue

I think this was due to a mismatch across our clusters with the ARC version. Once I rebuilt the new cluster, deleted the old one, and used a consistent version of ARC, I'm no longer seeing scale-down issues.

@ebeigarts
Copy link

ebeigarts commented Apr 7, 2022

Downgrading to 0.17.0 and setting dockerEnabled: false (docker was stuck in NotReady on 0.17.0) fixed this for me.

@toast-gear
Copy link
Collaborator

@s4nji any updates? v0.22.3 is out now with other fixes, could you upgrade and let us know if this issue is now resolved?

@s4nji
Copy link
Author

s4nji commented Apr 8, 2022

@toast-gear we have upgraded to 0.22.3 and while we stopped getting random pods stuck on CrashLoopBackOff every hour, we still encounter some stuck pods — due to other reasons causing some pods to take longer than 30 minutes to start — so at the moment we still need to manually clean things up.

I think the current fix is sufficient/enough (if your pod takes longer than 30 minutes to start, you have other problems), but it would perhaps be better if the controller can assign new valid registration token to runner/pods with expired tokens that are still in NotReady state.

@toast-gear
Copy link
Collaborator

I think the current fix is sufficient/enough (if your pod takes longer than 30 minutes to start, you have other problems), but it would perhaps be better if the controller can assign new valid registration token to runner/pods with expired tokens that are still in NotReady state.

that would be nice on paper however not really doable with the new architecture tbh. We now rely solely on the mutating webhook to inject reg tokens. Mutatingwebhook isn't a regular k8s controller that works like "check if the pod spec contains expired token in envvar and update it". It works more like "the pod is being updated/created for whatever reason. i'm going to inject the token but i dont care other fields and the lifecycle of the pod"

@toast-gear we have upgraded to 0.22.3 and while we stopped getting random pods stuck on CrashLoopBackOff every hour, we still encounter some stuck pods — due to other reasons causing some pods to take longer than 30 minutes to start — so at the moment we still need to manually clean things up.

I'm going to close this off seen as we've resolved the core problem.

@Bruce6X
Copy link

Bruce6X commented May 18, 2022

Hi @mumoshu @toast-gear I still see the same "token expired" and pod CrashLoopBackOff error in ARC version 0.23.0.

I installed ARC version 0.20 in Dec 2021. Everything was working fine till May 13, 2022. Pod stuck on CrashLoopBackOff state. I saw the following error messages in the pod log.

2022-05-18 07:28:38.330  DEBUG --- Configuration failed. Retrying
2022-05-18 07:28:39.333  ERROR --- Configuration failed!
Http response code: Unauthorized from 'POST https://<our-ghes-domian>/api/v3/actions/runner-registration'
{"message":"Token expired.","documentation_url":"https://docs.github.com/enterprise/3.2/rest"}
Response status code does not indicate success: 401 (Unauthorized).

Then I uninstalled helm actions-runner-controller, deleted all summerwind relevant CRD, and installed ARC version 0.23.0. However, the same error still exists.

I know this solution requires GHES version >= 3.3.0. In our case, it was working fine on GHES 3.2 till last week. Any help will be very appreciated.

Environment

  • Hi Controller Version: 0.23.0
  • Helm chart: actions-runner-controller-0.18.0
  • Cert-manager: v1.8.0
  • Kubernetes: 1.22.6
  • GHES: 3.2.11

@mumoshu
Copy link
Collaborator

mumoshu commented May 18, 2022

@bl02 Hey! If the issue still persists after reinstalling ARC and all the runner pods had been recreated, it's more likely that something has gone wrong in your GHES instance.
Could you verify all the runner pods are actually gone after you uninstalled ARC? If not, please remove runner pods and see how newly create runner pods behave. Thanks.

@mumoshu
Copy link
Collaborator

mumoshu commented May 18, 2022

@bl02 Also, you'd better ask GitHub support as well. It it reproduces after recreating runner pods, it's more likely it's not specific to ARC or K8s.

@mumoshu
Copy link
Collaborator

mumoshu commented May 18, 2022

Just curious, but how did you confirm you actually installed ARC 0.23.0? Can you share the relevant part of your values.yaml?

@Bruce6X
Copy link

Bruce6X commented May 18, 2022

Thanks for your fast response. Yes, all the runner pods were removed successfully before I reinstalled ARC. I also installed ARC 0.23 in a new fresh new K8S cluster, same error. I just wonder why GHES3.2 was working fine with ARC before. Our GHES instance didn't have update recently.

@mumoshu
Copy link
Collaborator

mumoshu commented May 18, 2022

@bl02 Thanks. ARC works by calling some GitHub API to obtain registration token that is then passed to each runner pod so that actions/runner ran within the pod's container can use it to actually call the runner registration API.
If the runner pod recreated after a fresh install of ARC, it can't grab an outdated registration token. the only viable cause could be that your GHES instance is returning an outdated registration token to ARC and ARC just passes it as-is to the runner pod(as... it believes GHES will never return an outdated token!)
That's why I think it's rather an issue in your GHES instance, not ARC.

I don't know much about how a real GHES instance is deployed. But anyway... do you manage a VM or a baremetal machine to run your GHES? Are you sure the system clock of your GHES machine is not skewed a lot?

@Bruce6X
Copy link

Bruce6X commented May 18, 2022

Just curious, but how did you confirm you actually installed ARC 0.23.0? Can you share the relevant part of your values.yaml?

I executed "helm repo update" before reinstall. Now in helm list, I can see APP Version is 0.23.0

NAME                                    NAMESPACE                       REVISION        UPDATED                                         STATUS          CHART                              APP VERSION
actions-runner-controller    github-actions       1               2022-05-18 10:12:34.993704438 +0200 CEST        deployed        actions-runner-controller-0.18.0   0.23.0   

If I describe the deployment actions-runner-controller, I can see the following information:

  labels:
    app.kubernetes.io/instance: actions-runner-controller
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: actions-runner-controller
    app.kubernetes.io/version: 0.23.0
    helm.sh/chart: actions-runner-controller-0.18.0

@mumoshu
Copy link
Collaborator

mumoshu commented May 18, 2022

@bl02 Thanks. Could you also make sure that you don't have image.tag set in your values.yaml?

@Bruce6X
Copy link

Bruce6X commented May 18, 2022

@mumoshu I don't have image.tag set. In pod log, the image tag is v0.23.0

Containers:
  manager:
    Image:         summerwind/actions-runner-controller:v0.23.0

@mumoshu
Copy link
Collaborator

mumoshu commented May 18, 2022

@bl02 Thanks. Fine! Then it's even more likely that there's something went wrong in your GHES instance.

@MichaelSp
Copy link

MichaelSp commented May 18, 2022

@mumoshu @bl02 Regarding the GHES instance. It runs on a GCP hosted VM. NTP is enabled so the system clock shouldn't be too far off. Anything else we can check?

/cc @stoe

@mumoshu
Copy link
Collaborator

mumoshu commented May 19, 2022

@MichaelSp Hey! Unfortunately, none from my end. Have you already asked GitHub support about that? The only possible reason I can come up with is that your GHES instance is returning outdated registration token in the first place, which shouldn't happen, no way to be handled by ARC.

@MichaelSp
Copy link

MichaelSp commented May 19, 2022

I found something interesting in the GHES logs:

"ExceptionMessage":"Ephemeral runners are not supported in this version of GHES or GitHub Actions"

We are running GHES 3.2. Looks like we have to wait for 3.3.

/cc @stoe

@toast-gear
Copy link
Collaborator

toast-gear commented May 19, 2022

@MichaelSp we're removing support for the --once flag extremely extremely soon #1196, you will need to implement one of the suggested solutions soon to avoid an outage. Upgrading to GHES to => 3.3 ASAP is the simpliest easiest solution to avoiding an outage.

@mumoshu
Copy link
Collaborator

mumoshu commented May 20, 2022

@MichaelSp Ah thanks! That makes sense. We recently made --ephemeral the default in our runner image. Although we'd highly recommend you to upgrade GHES ASAP, to be clear and to be fair, we did add an escape hatch for GHES 3.2 users. That is, you can set RUNNER_FEATURE_FLAG_ONCE=true in runners to force the use of --once #1384. EDIT This escape hatch however is being removed entirely so it can only be used very short term.

@mumoshu
Copy link
Collaborator

mumoshu commented May 20, 2022

// BTW, unfortunately, this turned out to be an umbrella issue of 3 different issues. This happens so often and that's why I enabled the lock app on this repo(https://github.com/actions-runner-controller/actions-runner-controller/blob/master/.github/lock.yml) so that people are encouraged to open dedicated issues(adding links to "similar" issues is very helpful tho). But apparently, the lock app isn't working as expected? 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants