Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline gets stuck randomly with "Job is waiting for a runner from XXX to come online" #3420

Closed
4 tasks done
wasd171 opened this issue Apr 9, 2024 · 14 comments · Fixed by #3426
Closed
4 tasks done
Assignees
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode

Comments

@wasd171
Copy link

wasd171 commented Apr 9, 2024

Checks

Controller Version

0.9.0

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Start a GitHub Actions pipeline

Describe the bug

Part of the pipeline works well, but at some point it gets stuck with the "Job is waiting for a runner from XXX to come online.". Restarting the pipeline fixes it. We have started experiencing it roughly since last week while still being on 0.6.1 but the issue has persisted on 0.9.0
SCR-20240409-mzbh

Describe the expected behavior

Pipeline does not get stuck

Additional Context

# values.yaml

arcOperator:
  values:
    replicaCount: 3
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/name: gha-rs-controller
          topologyKey: kubernetes.io/hostname

arcRunners:
  values:
    githubConfigUrl: "https://github.com/CareerLunch/career-lunch"
    githubConfigSecret:
      github_token: {{ requiredEnv "ARC_CHART_GITHUB_TOKEN" | quote }}
    containerMode:
      type: kubernetes
      kubernetesModeWorkVolumeClaim:
        accessModes: ["ReadWriteOnce"]
        storageClassName: "career-lunch-bootstrap-aws-ebs-gp3"
        resources:
          requests:
            storage: 10Gi
    template:
      metadata:
        annotations:
          karpenter.sh/do-not-evict: "true"
      spec:
        securityContext:
          fsGroup: 123
        nodeSelector:
          workload: arc_runners_v1
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  actions-ephemeral-runner: "True"
              topologyKey: kubernetes.io/hostname
        tolerations:
          - key: workload
            operator: Equal
            value: arc_runners_v1
            effect: NoSchedule
        containers:
          - name: runner
            image: ghcr.io/actions/actions-runner:latest
            command: ["/home/runner/run.sh"]
            env:
              - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
                value: "false"
              - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
                value: /mnt/pod-template/pod-template.yml
            volumeMounts:
              - name: pod-template
                mountPath: /mnt/pod-template
                readOnly: true
            resources:
              requests:
                memory: "250Mi"
                cpu: "100m"
              limits:
                memory: "250Mi"
        volumes:
          - name: pod-template
            configMap:
              name: career-lunch-bootstrap-pod-template

---
# pod-template.yml

{{- $name := ( include "career-lunch-library.fullname" . ) }}
{{- $labels := ( include "career-lunch-library.labels" . ) }}

{{- /* See https://github.com/actions/runner-container-hooks/blob/main/docs/adrs/0096-hook-extensions.md */}}
apiVersion: v1
kind: ConfigMap
metadata:
  name: {{ $name }}-pod-template
  namespace: {{ .Values.arcRunners.namespace }}
  labels:
{{ $labels | indent 4 }}
data:
  pod-template.yml: |
    {{- /*
      Technically, we do not need a PodTemplate, see https://github.com/actions/runner-container-hooks/blob/main/examples/extension.yaml
      But this communicates the intent better
    */}}
    apiVersion: v1
    kind: PodTemplate
    metadata:
      annotations:
        karpenter.sh/do-not-evict: "true"
    spec:
      serviceAccountName: {{ include "career-lunch-library.fullname" . }}-arc-runners-sa
      securityContext:
        {{- /* Provides access to /home/runner/_work directory in ephemeral volume */}}
        fsGroup: 123
      nodeSelector:
        workload: arc_runners_v1
      tolerations:
        - key: workload
          operator: Equal
          value: arc_runners_v1
          effect: NoSchedule
      containers:
        {{- /* Overrides job container */}}
        - name: $job
          resources:
            {{- /* Memory values are rather high, but we need them for webpack / Cypress jobs */}}
            requests:
              memory: 4Gi
              cpu: 1
            {{- /* 
              Best practice is to add limits too
              However, we run these jobs in isolated instances and want them to use all resources available
              Instance resources are configured via Karpenter
            */}}

Controller Logs

Controller: https://gist.github.com/wasd171/fdfda5ab0484782f0593eef4364e6811
Listener: https://gist.github.com/wasd171/113c7b11a876da64977170f09e878185

Runner Pod Logs

None, no runner pod gets created
@wasd171 wasd171 added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Apr 9, 2024
Copy link
Contributor

github-actions bot commented Apr 9, 2024

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@wasd171 wasd171 changed the title <Please write what didn't work for you here> Pipeline gets stuck randomly with "Job is waiting for a runner from XXX to come online" Apr 9, 2024
@rteeling-evernorth
Copy link

rteeling-evernorth commented Apr 9, 2024

not sure if this is the same issue, but im seeing similar behavior from 0.8.2 and 0.9.0. When i went to take at the cluster, i was seeing the runner pods try to start but immediately die with following error event:

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "/home/runner/run.sh": stat /home/runner/run.sh: no such file or directory: unknown

This began last week when we were forced off of runner version 2.314.0

@wasd171 If you have the chance, try to start the runner and watch the k8s event log for your runner namespace

if this isnt the case for you i'll file a similar issue

@wasd171
Copy link
Author

wasd171 commented Apr 9, 2024

@rteeling-evernorth in our case the pods don't even get created

@rteeling-evernorth
Copy link

gotcha. you have a different problem from me - my issue was caused by the fact that i had erroneously mirrored the arm actions-runner image for a bunch of intel runners. that shim task error was a bit of a red herring.

anyways, good luck with your issue

@mhuijgen
Copy link

We are seeing this as well with 0.9.0 after upgrading from 0.8.2

It seems to occur for us when there is one job running (job that takes like 1h) and then another repo requests a runner. From what Ive quickly found in the logs is that it does scale out, but then for some reason that next runner pod starts up and then disappears again without running the job. The runnerset on GH GUI then shows 1 pending 1 running 2 assigned. But the autoscaling runnerset on ARC just shows 1 running 0 pending.

Only when this long running job finishes it seems to notice again that there is still one pending job and fires up a pod.

This does not happen always though, it can happily scale up to more than pod in other occasions.

Due to an issue on our AKS I cannot access historical pod logs however, so its hard to check if there are any errors in that runner pod that seems to have gotten created and then disappearing again.

From the changelog on 0.9.0 I see noticed #3371 changed something with respect to scaling. Perhaps it had unintended side effects? @nikola-jokic

@nikola-jokic
Copy link
Collaborator

Great summary @mhuijgen, thanks!

I think I might know what is going on here. I'll try to confirm my assumption and get back to you all with my findings. Thank you all for helping me to better understand the problem!

If my assumption is right, the scale will self correct as @mhuijgen pointed out on job completed event, but that may take a long time.

@nikola-jokic nikola-jokic self-assigned this Apr 10, 2024
@nikola-jokic nikola-jokic removed the needs triage Requires review from the maintainers label Apr 10, 2024
@jcorreia-forbesadvisor
Copy link

jcorreia-forbesadvisor commented Apr 10, 2024

We're seeing the same since upgrading to 0.9.0.

Just to add that restarting the listener will make a new (proper) runner pod to be created and the build will go through just fine.

@mhuijgen
Copy link

Killing the listener pod indeed corrects the issue as well.

Just had a case where the autoscalingrunner set stated 0 pending/current/running and on GH GUI 1 job was assigned and never got picked up until I killed the listener and the listener restarted.

Downgrading to 0.8.2 now until this is fixed...

@nikola-jokic
Copy link
Collaborator

Hey @mhuijgen, the issue is with the controller. So you can use 0.8.3 in the meantime. Again, sorry this has happened, I personally tested the new scaling process many times, and I couldn't bump into this problem... 😞

I'm working on the fix right now.

@Hazmi35
Copy link

Hazmi35 commented Apr 12, 2024

Not sure if this related, but I have the same issue on 0.9.0

The runner pod is created, but it's immediately deleted, then the controller never recreates the runner pod, resulting in pipeline get stuck.

Most of the times, the error happens on runner initialization, one of them is:

WRITE ERROR: Failed to create a session. The runner registration has been deleted from the server, please re-configure. Runner registrations are automatically deleted for runners that have not connected to the service recently.

Here's full log from Loki https://bin.hzmi.xyz/sadoneponi.log

@clementnero
Copy link

Same symptoms with GKE v1.29.1 and gha-runner-scale-set-0.9.0

it does scale out, but then for some reason that next runner pod starts up and then disappears again without running the job

The runner pod is created, but it's immediately deleted, then the controller never recreates the runner pod, resulting in pipeline get stuck.

I made the same observations as @mhuijgen and Hazmi35, and collected the runner and controller logs as well. Going a step further, here's what I can say.

On the runner side, it fails when creating the session and quits immediately, indicating that the runner registration has been deleted from the server.

Runner
Waiting for docker to be ready.
Docker is ready.
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] No proxy settings were found based on environmental variables (http_proxy/https_proxy/HTTP_PROXY/HTTPS_PROXY)
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known directory 'Bin': '/home/runner/bin'
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known directory 'Root': '/home/runner'
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known config file 'Credentials': '/home/runner/.credentials'
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known directory 'Bin': '/home/runner/bin'
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known directory 'Root': '/home/runner'
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known config file 'Runner': '/home/runner/.runner'
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Adding extra user agent 'actions-runner-controller/0.9.0' to all HTTP requests.
[RUNNER 2024-04-12 10:43:32Z INFO Listener] Runner is built for Linux (X64) - linux-x64.
[RUNNER 2024-04-12 10:43:32Z INFO Listener] RuntimeInformation: Linux 6.1.58+ #1 SMP PREEMPT_DYNAMIC Mon Jan 29 15:19:25 UTC 2024.
[RUNNER 2024-04-12 10:43:32Z INFO Listener] Version: 2.315.0
[RUNNER 2024-04-12 10:43:32Z INFO Listener] Commit: 86858e39e42e25e23332fea3ef7d459ff10f30f7
[RUNNER 2024-04-12 10:43:32Z INFO Listener] Culture:
[RUNNER 2024-04-12 10:43:32Z INFO Listener] UI Culture:
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known directory 'Bin': '/home/runner/bin'
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known directory 'Root': '/home/runner'
[RUNNER 2024-04-12 10:43:32Z INFO Listener] Validating directory permissions for: '/home/runner'
[RUNNER 2024-04-12 10:43:32Z INFO CommandLineParser] Parse
[RUNNER 2024-04-12 10:43:32Z INFO CommandLineParser] Parsing 1 args
[RUNNER 2024-04-12 10:43:32Z INFO CommandLineParser] parsing argument
[RUNNER 2024-04-12 10:43:32Z INFO CommandLineParser] HasArgs: False
[RUNNER 2024-04-12 10:43:32Z INFO CommandLineParser] Adding Command: run
[RUNNER 2024-04-12 10:43:32Z INFO CommandSettings] Removing env var: 'ACTIONS_RUNNER_INPUT_JITCONFIG'
[RUNNER 2024-04-12 10:43:32Z INFO Listener] Arguments parsed
[RUNNER 2024-04-12 10:43:32Z INFO Runner] ExecuteCommand
[RUNNER 2024-04-12 10:43:32Z INFO ConfigurationStore] currentAssemblyLocation: /home/runner/bin/Runner.Listener.dll
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known directory 'Bin': '/home/runner/bin'
[RUNNER 2024-04-12 10:43:32Z INFO ConfigurationStore] binPath: /home/runner/bin
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known directory 'Bin': '/home/runner/bin'
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known directory 'Root': '/home/runner'
[RUNNER 2024-04-12 10:43:32Z INFO ConfigurationStore] RootFolder: /home/runner
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known directory 'Bin': '/home/runner/bin'
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known directory 'Root': '/home/runner'
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known config file 'Runner': '/home/runner/.runner'
[RUNNER 2024-04-12 10:43:32Z INFO ConfigurationStore] ConfigFilePath: /home/runner/.runner
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known directory 'Bin': '/home/runner/bin'
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known directory 'Root': '/home/runner'
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known config file 'Credentials': '/home/runner/.credentials'
[RUNNER 2024-04-12 10:43:32Z INFO ConfigurationStore] CredFilePath: /home/runner/.credentials
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known directory 'Bin': '/home/runner/bin'
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known directory 'Root': '/home/runner'
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known config file 'MigratedCredentials': '/home/runner/.credentials_migrated'
[RUNNER 2024-04-12 10:43:32Z INFO ConfigurationStore] MigratedCredFilePath: /home/runner/.credentials_migrated
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known directory 'Bin': '/home/runner/bin'
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known directory 'Root': '/home/runner'
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known config file 'Service': '/home/runner/.service'
[RUNNER 2024-04-12 10:43:32Z INFO ConfigurationStore] ServiceConfigFilePath: /home/runner/.service
[RUNNER 2024-04-12 10:43:32Z INFO CommandSettings] Flag 'help': 'False'
[RUNNER 2024-04-12 10:43:32Z INFO CommandSettings] Flag 'version': 'False'
[RUNNER 2024-04-12 10:43:32Z INFO CommandSettings] Flag 'commit': 'False'
[RUNNER 2024-04-12 10:43:32Z INFO CommandSettings] Flag 'check': 'False'
[RUNNER 2024-04-12 10:43:32Z INFO CommandSettings] Command 'configure': 'False'
[RUNNER 2024-04-12 10:43:32Z INFO CommandSettings] Command 'remove': 'False'
[RUNNER 2024-04-12 10:43:32Z INFO CommandSettings] Command 'warmup': 'False'
[RUNNER 2024-04-12 10:43:32Z INFO CommandSettings] Env arg 'jitconfig': '***'
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known directory 'Bin': '/home/runner/bin'
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known directory 'Root': '/home/runner'
[RUNNER 2024-04-12 10:43:32Z INFO Runner] Saved 361 bytes to '/home/runner/.runner'.
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known directory 'Bin': '/home/runner/bin'
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known directory 'Root': '/home/runner'
[RUNNER 2024-04-12 10:43:32Z INFO Runner] Saved 225 bytes to '/home/runner/.credentials'.
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known directory 'Bin': '/home/runner/bin'
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known directory 'Root': '/home/runner'
[RUNNER 2024-04-12 10:43:32Z INFO Runner] Saved 1631 bytes to '/home/runner/.credentials_rsaparams'.
[RUNNER 2024-04-12 10:43:32Z INFO ConfigurationManager] LoadSettings
[RUNNER 2024-04-12 10:43:32Z INFO ConfigurationStore] IsConfigured()
[RUNNER 2024-04-12 10:43:32Z INFO ConfigurationStore] IsConfigured: True
[RUNNER 2024-04-12 10:43:32Z INFO ConfigurationManager] Is configured: True
[RUNNER 2024-04-12 10:43:32Z INFO ConfigurationStore] Read setting file: 361 chars
[RUNNER 2024-04-12 10:43:32Z INFO ConfigurationManager] Settings Loaded
[RUNNER 2024-04-12 10:43:32Z INFO ConfigurationStore] IsServiceConfigured()
[RUNNER 2024-04-12 10:43:32Z INFO ConfigurationStore] IsServiceConfigured: False
[RUNNER 2024-04-12 10:43:32Z INFO CommandSettings] Command 'run': 'True'
[RUNNER 2024-04-12 10:43:32Z INFO ConfigurationStore] IsConfigured()
[RUNNER 2024-04-12 10:43:32Z INFO ConfigurationStore] IsConfigured: True
[RUNNER 2024-04-12 10:43:32Z INFO ConfigurationManager] Is configured: True
[RUNNER 2024-04-12 10:43:32Z INFO Runner] Could not parse the argument value '' for StartupType. Defaulting to Manual
[RUNNER 2024-04-12 10:43:32Z INFO Runner] Set runner startup type - Manual
[RUNNER 2024-04-12 10:43:32Z INFO CommandSettings] Flag 'once': 'False'
[RUNNER 2024-04-12 10:43:32Z INFO CommandSettings] Flag 'once': 'False'
[RUNNER 2024-04-12 10:43:32Z INFO Runner] RunAsync
[RUNNER 2024-04-12 10:43:32Z INFO ConfigurationManager] LoadSettings
[RUNNER 2024-04-12 10:43:32Z INFO ConfigurationStore] IsConfigured()
[RUNNER 2024-04-12 10:43:32Z INFO ConfigurationStore] IsConfigured: True
[RUNNER 2024-04-12 10:43:32Z INFO ConfigurationManager] Is configured: True
[RUNNER 2024-04-12 10:43:32Z INFO ConfigurationManager] Settings Loaded
[RUNNER 2024-04-12 10:43:32Z INFO MessageListener] {
[RUNNER 2024-04-12 10:43:32Z INFO MessageListener]   "AgentId": 435793,
[RUNNER 2024-04-12 10:43:32Z INFO MessageListener]   "AgentName": "infra-gcp-core-2-default-vpj25-runner-nm48c",
[RUNNER 2024-04-12 10:43:32Z INFO MessageListener]   "PoolId": 21,
[RUNNER 2024-04-12 10:43:32Z INFO MessageListener]   "DisableUpdate": true,
[RUNNER 2024-04-12 10:43:32Z INFO MessageListener]   "Ephemeral": true,
[RUNNER 2024-04-12 10:43:32Z INFO MessageListener]   "ServerUrl": "https://pipelinesghubeus22.actions.githubusercontent.com/NGLyubjj83NX6GK0kkdaaXAXLsAulAc1weFnvjL9fCcgzFIdVe/",
[RUNNER 2024-04-12 10:43:32Z INFO MessageListener]   "WorkFolder": "_work"
[RUNNER 2024-04-12 10:43:32Z INFO MessageListener] }
[RUNNER 2024-04-12 10:43:32Z INFO MessageListener] Loading Credentials
[RUNNER 2024-04-12 10:43:32Z INFO ConfigurationStore] HasCredentials()
[RUNNER 2024-04-12 10:43:32Z INFO ConfigurationStore] stored True
[RUNNER 2024-04-12 10:43:32Z INFO CredentialManager] GetCredentialProvider
[RUNNER 2024-04-12 10:43:32Z INFO CredentialManager] Creating type OAuth
[RUNNER 2024-04-12 10:43:32Z INFO CredentialManager] Creating credential type: OAuth
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known directory 'Bin': '/home/runner/bin'
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known directory 'Root': '/home/runner'
[RUNNER 2024-04-12 10:43:32Z INFO HostContext] Well known config file 'RSACredentials': '/home/runner/.credentials_rsaparams'
[RUNNER 2024-04-12 10:43:32Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2024-04-12 10:43:32Z INFO MessageListener] Attempt to create session.
[RUNNER 2024-04-12 10:43:32Z INFO MessageListener] Connecting to the Runner Server...
[RUNNER 2024-04-12 10:43:32Z INFO RunnerServer] EstablishVssConnection
[RUNNER 2024-04-12 10:43:32Z INFO RunnerServer] Establish connection with 100 seconds timeout.
[RUNNER 2024-04-12 10:43:32Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[RUNNER 2024-04-12 10:43:32Z INFO RunnerServer] EstablishVssConnection
[RUNNER 2024-04-12 10:43:32Z INFO RunnerServer] Establish connection with 60 seconds timeout.
[RUNNER 2024-04-12 10:43:32Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[RUNNER 2024-04-12 10:43:32Z INFO RunnerServer] EstablishVssConnection
[RUNNER 2024-04-12 10:43:32Z INFO RunnerServer] Establish connection with 60 seconds timeout.
[RUNNER 2024-04-12 10:43:32Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[RUNNER 2024-04-12 10:43:33Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[RUNNER 2024-04-12 10:43:33Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[RUNNER 2024-04-12 10:43:33Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[RUNNER 2024-04-12 10:43:34Z INFO MessageListener] VssConnection created
[RUNNER 2024-04-12 10:43:34Z INFO Terminal] WRITE LINE:

√ Connected to GitHub
[RUNNER 2024-04-12 10:43:34Z INFO Terminal] WRITE LINE:

[RUNNER 2024-04-12 10:43:34Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2024-04-12 10:43:34Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2024-04-12 10:43:35Z ERR  GitHubActionsService] POST request to https://pipelinesghubeus22.actions.githubusercontent.com/NGLyubjj83NX6GK0kkdaaXAXLsAulAc1weFnvjL9fCcgzFIdVe/_apis/oauth2/token failed. HTTP Status: BadRequest
[RUNNER 2024-04-12 10:43:35Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[RUNNER 2024-04-12 10:43:35Z ERR  MessageListener] Catch exception during create session.
[RUNNER 2024-04-12 10:43:35Z ERR  MessageListener] GitHub.Services.OAuth.VssOAuthTokenRequestException: Registration c9a9807b-9ff9-410e-8cb5-f089835c4f34 was not found.
[RUNNER 2024-04-12 10:43:35Z ERR  MessageListener]    at GitHub.Services.OAuth.VssOAuthTokenProvider.OnGetTokenAsync(IssuedToken failedToken, CancellationToken cancellationToken)
[RUNNER 2024-04-12 10:43:35Z ERR  MessageListener]    at GitHub.Services.Common.IssuedTokenProvider.GetTokenOperation.GetTokenAsync(VssTraceActivity traceActivity)
[RUNNER 2024-04-12 10:43:35Z ERR  MessageListener]    at GitHub.Services.Common.IssuedTokenProvider.GetTokenAsync(IssuedToken failedToken, CancellationToken cancellationToken)
[RUNNER 2024-04-12 10:43:35Z ERR  MessageListener]    at GitHub.Services.Common.VssHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
[RUNNER 2024-04-12 10:43:35Z ERR  MessageListener]    at GitHub.Services.Common.VssHttpRetryMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
[RUNNER 2024-04-12 10:43:35Z ERR  MessageListener]    at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
[RUNNER 2024-04-12 10:43:35Z ERR  MessageListener]    at GitHub.Services.WebApi.VssHttpClientBase.SendAsync(HttpRequestMessage message, HttpCompletionOption completionOption, Object userState, CancellationToken cancellationToken)
[RUNNER 2024-04-12 10:43:35Z ERR  MessageListener]    at GitHub.Services.WebApi.VssHttpClientBase.SendAsync[T](HttpRequestMessage message, Object userState, CancellationToken cancellationToken)
[RUNNER 2024-04-12 10:43:35Z ERR  MessageListener]    at GitHub.Services.WebApi.VssHttpClientBase.SendAsync[T](HttpMethod method, IEnumerable`1 additionalHeaders, Guid locationId, Object routeValues, ApiResourceVersion version, HttpContent content, IEnumerable`1 queryParameters, Object userState, CancellationToken cancellationToken)
[RUNNER 2024-04-12 10:43:35Z ERR  MessageListener]    at GitHub.Runner.Listener.MessageListener.CreateSessionAsync(CancellationToken token)
[RUNNER 2024-04-12 10:43:35Z ERR  Terminal] WRITE ERROR: Failed to create a session. The runner registration has been deleted from the server, please re-configure. Runner registrations are automatically deleted for runners that have not connected to the service recently.
[RUNNER 2024-04-12 10:43:35Z INFO Listener] Runner execution has finished with return code 1
Runner listener exit with terminated error, stop the service, no retry needed.
Exiting runner...

On the controller side, the ephemeral runner registration is created once as expected, but a second time immediately afterwards (which is not expected), resulting in a detected conflict (the runner with a same name exists in GitHub) and causing the controller to delete the existing ephemeral runner registration from GitHub (the one created in the first attempt) and requeue reconciliation.
The next reconciliation ends up creating directly a new ephemeral runner secret for the JIT configuration (with the same registration that has just been deleted) and the runner pod, without recreating a new ephemeral runner registration first, thus resulting in immediate runtime failure for the runner.

First successful attempt to create an ephemeral runner registration:

2024-04-12T10:43:04Z	INFO	EphemeralRunner	Creating new ephemeral runner registration and updating status with runner config	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:04Z	INFO	EphemeralRunner	Creating ephemeral runner JIT config	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:04Z	INFO	EphemeralRunnerSet	Ephemeral runner counts	{"ephemeralrunnerset": {"name":"infra-gcp-core-2-default-vpj25","namespace":"infra"}, "pending": 1, "running": 0, "finished": 0, "failed": 0, "deleting": 0}
2024-04-12T10:43:04Z	INFO	actions-clients	retrieve actions client	{"githubConfigURL": "https://github.com/camunda", "namespace": "infra"}
2024-04-12T10:43:04Z	INFO	actions-clients	using cache client	{"githubConfigURL": "https://github.com/camunda", "namespace": "infra"}
2024-04-12T10:43:04Z	INFO	actions-clients	refreshing token	{"githubConfigUrl": "https://github.com/camunda"}
2024-04-12T10:43:04Z	INFO	actions-clients	getting access token for GitHub App auth	{"accessTokenURL": "https://api.github.com/app/installations/40503011/access_tokens"}
2024-04-12T10:43:04Z	INFO	AutoscalingRunnerSet	Find existing ephemeral runner set	{"autoscalingrunnerset": {"name":"infra-gcp-core-2-default","namespace":"infra"}, "name": "infra-gcp-core-2-default-vpj25", "specHash": "6759f567c8"}
2024-04-12T10:43:04Z	INFO	actions-clients	getting runner registration token	{"registrationTokenURL": "https://api.github.com/orgs/camunda/actions/runners/registration-token"}
2024-04-12T10:43:04Z	INFO	actions-clients	getting Actions tenant URL and JWT	{"registrationURL": "https://api.github.com/actions/runner-registration"}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Created ephemeral runner JIT config	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "runnerId": 435793}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Updating ephemeral runner status with runnerId and runnerJITConfig	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Updated ephemeral runner status with runnerId and runnerJITConfig	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}

Second unsuccessful attempt to create an ephemeral runner registration:

2024-04-12T10:43:05Z	INFO	EphemeralRunner	Creating new ephemeral runner registration and updating status with runner config	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Creating ephemeral runner JIT config	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:05Z	INFO	EphemeralRunnerSet	Ephemeral runner counts	{"ephemeralrunnerset": {"name":"infra-gcp-core-2-default-vpj25","namespace":"infra"}, "pending": 1, "running": 0, "finished": 0, "failed": 0, "deleting": 0}
2024-04-12T10:43:05Z	INFO	actions-clients	retrieve actions client	{"githubConfigURL": "https://github.com/camunda", "namespace": "infra"}
2024-04-12T10:43:05Z	INFO	actions-clients	using cache client	{"githubConfigURL": "https://github.com/camunda", "namespace": "infra"}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Getting runner jit config failed with conflict error, trying to get the runner by name	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "runnerName": "infra-gcp-core-2-default-vpj25-runner-nm48c"}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Found the runner with the same name	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "runnerId": 435793, "runnerScaleSetId": 136}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Removing the runner with the same name	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Removed the runner with the same name, re-queuing the reconciliation	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Creating new ephemeral runner secret for jitconfig.	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}

More interestingly, subsequently as the controller do not find the ephemeral runner (registration) in GitHub, it concludes immediately that the runner has finished, whereas it has failed.

2024-04-12T10:43:35Z	INFO	EphemeralRunner	Checking if runner exists in GitHub service	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "runnerId": 435793}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Runner does not exist in GitHub service	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "runnerId": 435793}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Ephemeral runner has finished since it does not exist in the service anymore	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Updating ephemeral runner status to Finished	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	EphemeralRunner status is marked as Finished	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
Controller
2024-04-12T10:43:03Z	INFO	EphemeralRunnerSet	Ephemeral runner counts	{"ephemeralrunnerset": {"name":"infra-gcp-core-2-default-vpj25","namespace":"infra"}, "pending": 0, "running": 0, "finished": 0, "failed": 0, "deleting": 0}
2024-04-12T10:43:03Z	INFO	EphemeralRunnerSet	Scaling comparison	{"ephemeralrunnerset": {"name":"infra-gcp-core-2-default-vpj25","namespace":"infra"}, "current": 0, "desired": 1}
2024-04-12T10:43:03Z	INFO	EphemeralRunnerSet	Creating new ephemeral runners (scale up)	{"ephemeralrunnerset": {"name":"infra-gcp-core-2-default-vpj25","namespace":"infra"}, "count": 1}
2024-04-12T10:43:03Z	INFO	EphemeralRunnerSet	Creating new ephemeral runner	{"ephemeralrunnerset": {"name":"infra-gcp-core-2-default-vpj25","namespace":"infra"}, "progress": 1, "total": 1}
2024-04-12T10:43:04Z	INFO	AutoscalingRunnerSet	Find existing ephemeral runner set	{"autoscalingrunnerset": {"name":"infra-gcp-core-2-default","namespace":"infra"}, "name": "infra-gcp-core-2-default-vpj25", "specHash": "6759f567c8"}
2024-04-12T10:43:04Z	INFO	EphemeralRunnerSet	Created new ephemeral runner	{"ephemeralrunnerset": {"name":"infra-gcp-core-2-default-vpj25","namespace":"infra"}, "runner": "infra-gcp-core-2-default-vpj25-runner-nm48c"}
2024-04-12T10:43:04Z	INFO	EphemeralRunner	Adding runner registration finalizer	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:04Z	INFO	EphemeralRunnerSet	Ephemeral runner counts	{"ephemeralrunnerset": {"name":"infra-gcp-core-2-default-vpj25","namespace":"infra"}, "pending": 1, "running": 0, "finished": 0, "failed": 0, "deleting": 0}
2024-04-12T10:43:04Z	INFO	EphemeralRunnerSet	Updating status with current runners count	{"ephemeralrunnerset": {"name":"infra-gcp-core-2-default-vpj25","namespace":"infra"}, "count": 1}
2024-04-12T10:43:04Z	INFO	EphemeralRunnerSet	Ephemeral runner counts	{"ephemeralrunnerset": {"name":"infra-gcp-core-2-default-vpj25","namespace":"infra"}, "pending": 1, "running": 0, "finished": 0, "failed": 0, "deleting": 0}
2024-04-12T10:43:04Z	INFO	EphemeralRunner	Successfully added runner registration finalizer	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:04Z	INFO	EphemeralRunner	Adding finalizer	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:04Z	INFO	EphemeralRunnerSet	Ephemeral runner counts	{"ephemeralrunnerset": {"name":"infra-gcp-core-2-default-vpj25","namespace":"infra"}, "pending": 1, "running": 0, "finished": 0, "failed": 0, "deleting": 0}
2024-04-12T10:43:04Z	INFO	AutoscalingRunnerSet	Find existing ephemeral runner set	{"autoscalingrunnerset": {"name":"infra-gcp-core-2-default","namespace":"infra"}, "name": "infra-gcp-core-2-default-vpj25", "specHash": "6759f567c8"}
2024-04-12T10:43:04Z	INFO	EphemeralRunner	Successfully added finalizer	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:04Z	INFO	EphemeralRunner	Creating new ephemeral runner registration and updating status with runner config	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:04Z	INFO	EphemeralRunner	Creating ephemeral runner JIT config	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:04Z	INFO	EphemeralRunnerSet	Ephemeral runner counts	{"ephemeralrunnerset": {"name":"infra-gcp-core-2-default-vpj25","namespace":"infra"}, "pending": 1, "running": 0, "finished": 0, "failed": 0, "deleting": 0}
2024-04-12T10:43:04Z	INFO	actions-clients	retrieve actions client	{"githubConfigURL": "https://github.com/camunda", "namespace": "infra"}
2024-04-12T10:43:04Z	INFO	actions-clients	using cache client	{"githubConfigURL": "https://github.com/camunda", "namespace": "infra"}
2024-04-12T10:43:04Z	INFO	actions-clients	refreshing token	{"githubConfigUrl": "https://github.com/camunda"}
2024-04-12T10:43:04Z	INFO	actions-clients	getting access token for GitHub App auth	{"accessTokenURL": "https://api.github.com/app/installations/40503011/access_tokens"}
2024-04-12T10:43:04Z	INFO	AutoscalingRunnerSet	Find existing ephemeral runner set	{"autoscalingrunnerset": {"name":"infra-gcp-core-2-default","namespace":"infra"}, "name": "infra-gcp-core-2-default-vpj25", "specHash": "6759f567c8"}
2024-04-12T10:43:04Z	INFO	actions-clients	getting runner registration token	{"registrationTokenURL": "https://api.github.com/orgs/camunda/actions/runners/registration-token"}
2024-04-12T10:43:04Z	INFO	actions-clients	getting Actions tenant URL and JWT	{"registrationURL": "https://api.github.com/actions/runner-registration"}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Created ephemeral runner JIT config	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "runnerId": 435793}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Updating ephemeral runner status with runnerId and runnerJITConfig	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Updated ephemeral runner status with runnerId and runnerJITConfig	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Creating new ephemeral runner registration and updating status with runner config	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Creating ephemeral runner JIT config	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:05Z	INFO	EphemeralRunnerSet	Ephemeral runner counts	{"ephemeralrunnerset": {"name":"infra-gcp-core-2-default-vpj25","namespace":"infra"}, "pending": 1, "running": 0, "finished": 0, "failed": 0, "deleting": 0}
2024-04-12T10:43:05Z	INFO	actions-clients	retrieve actions client	{"githubConfigURL": "https://github.com/camunda", "namespace": "infra"}
2024-04-12T10:43:05Z	INFO	actions-clients	using cache client	{"githubConfigURL": "https://github.com/camunda", "namespace": "infra"}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Getting runner jit config failed with conflict error, trying to get the runner by name	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "runnerName": "infra-gcp-core-2-default-vpj25-runner-nm48c"}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Found the runner with the same name	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "runnerId": 435793, "runnerScaleSetId": 136}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Removing the runner with the same name	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Removed the runner with the same name, re-queuing the reconciliation	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Creating new ephemeral runner secret for jitconfig.	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Creating new secret for ephemeral runner	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Created new secret spec for ephemeral runner	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Created ephemeral runner secret	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "secretName": "infra-gcp-core-2-default-vpj25-runner-nm48c"}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Creating new EphemeralRunner pod.	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Creating new pod for ephemeral runner	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Created new pod spec for ephemeral runner	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Created ephemeral runner pod	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "runnerScaleSetId": 136, "runnerName": "infra-gcp-core-2-default-vpj25-runner-nm48c", "runnerId": 435793, "configUrl": "https://github.com/camunda", "podName": "infra-gcp-core-2-default-vpj25-runner-nm48c"}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Waiting for runner container status to be available	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Waiting for runner container status to be available	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Updating ephemeral runner status with pod phase	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "statusPhase": "Pending", "statusReason": "", "statusMessage": ""}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Updated ephemeral runner status with pod phase	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:05Z	INFO	EphemeralRunnerSet	Ephemeral runner counts	{"ephemeralrunnerset": {"name":"infra-gcp-core-2-default-vpj25","namespace":"infra"}, "pending": 1, "running": 0, "finished": 0, "failed": 0, "deleting": 0}
2024-04-12T10:43:05Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:28Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:31Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:32Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:32Z	INFO	EphemeralRunner	Updating ephemeral runner status with pod phase	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "statusPhase": "Running", "statusReason": "", "statusMessage": ""}
2024-04-12T10:43:32Z	INFO	EphemeralRunner	Updated ephemeral runner status with pod phase	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:32Z	INFO	EphemeralRunnerSet	Ephemeral runner counts	{"ephemeralrunnerset": {"name":"infra-gcp-core-2-default-vpj25","namespace":"infra"}, "pending": 0, "running": 1, "finished": 0, "failed": 0, "deleting": 0}
2024-04-12T10:43:32Z	INFO	EphemeralRunnerSet	Updating status with current runners count	{"ephemeralrunnerset": {"name":"infra-gcp-core-2-default-vpj25","namespace":"infra"}, "count": 1}
2024-04-12T10:43:32Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:32Z	INFO	EphemeralRunnerSet	Ephemeral runner counts	{"ephemeralrunnerset": {"name":"infra-gcp-core-2-default-vpj25","namespace":"infra"}, "pending": 0, "running": 1, "finished": 0, "failed": 0, "deleting": 0}
2024-04-12T10:43:32Z	INFO	AutoscalingRunnerSet	Find existing ephemeral runner set	{"autoscalingrunnerset": {"name":"infra-gcp-core-2-default","namespace":"infra"}, "name": "infra-gcp-core-2-default-vpj25", "specHash": "6759f567c8"}
2024-04-12T10:43:35Z	INFO	actions-clients	retrieve actions client	{"githubConfigURL": "https://github.com/camunda", "namespace": "infra"}
2024-04-12T10:43:35Z	INFO	actions-clients	using cache client	{"githubConfigURL": "https://github.com/camunda", "namespace": "infra"}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Checking if runner exists in GitHub service	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "runnerId": 435793}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Runner does not exist in GitHub service	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "runnerId": 435793}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Ephemeral runner has finished since it does not exist in the service anymore	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Updating ephemeral runner status to Finished	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	EphemeralRunner status is marked as Finished	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Cleaning up resources after after ephemeral runner termination	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "phase": "Succeeded"}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Cleaning up the runner pod	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Deleting the runner pod	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:35Z	INFO	EphemeralRunnerSet	Ephemeral runner counts	{"ephemeralrunnerset": {"name":"infra-gcp-core-2-default-vpj25","namespace":"infra"}, "pending": 0, "running": 0, "finished": 1, "failed": 0, "deleting": 0}
2024-04-12T10:43:35Z	INFO	EphemeralRunnerSet	Updating status with current runners count	{"ephemeralrunnerset": {"name":"infra-gcp-core-2-default-vpj25","namespace":"infra"}, "count": 0}
2024-04-12T10:43:35Z	INFO	EphemeralRunnerSet	Ephemeral runner counts	{"ephemeralrunnerset": {"name":"infra-gcp-core-2-default-vpj25","namespace":"infra"}, "pending": 0, "running": 0, "finished": 1, "failed": 0, "deleting": 0}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Waiting for ephemeral runner owned resources to be deleted	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Cleaning up resources after after ephemeral runner termination	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "phase": "Succeeded"}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Cleaning up the runner pod	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Waiting for ephemeral runner owned resources to be deleted	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:35Z	INFO	AutoscalingRunnerSet	Find existing ephemeral runner set	{"autoscalingrunnerset": {"name":"infra-gcp-core-2-default","namespace":"infra"}, "name": "infra-gcp-core-2-default-vpj25", "specHash": "6759f567c8"}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Cleaning up resources after after ephemeral runner termination	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "phase": "Succeeded"}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Cleaning up the runner pod	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Waiting for ephemeral runner owned resources to be deleted	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Cleaning up resources after after ephemeral runner termination	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "phase": "Succeeded"}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Cleaning up the runner pod	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Waiting for ephemeral runner owned resources to be deleted	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:35Z	INFO	AutoscalingRunnerSet	Find existing ephemeral runner set	{"autoscalingrunnerset": {"name":"infra-gcp-core-2-default","namespace":"infra"}, "name": "infra-gcp-core-2-default-vpj25", "specHash": "6759f567c8"}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Cleaning up resources after after ephemeral runner termination	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "phase": "Succeeded"}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Cleaning up the runner pod	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Waiting for ephemeral runner owned resources to be deleted	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Cleaning up resources after after ephemeral runner termination	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "phase": "Succeeded"}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Cleaning up the runner pod	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Waiting for ephemeral runner owned resources to be deleted	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Cleaning up resources after after ephemeral runner termination	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "phase": "Succeeded"}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Cleaning up the runner pod	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:35Z	INFO	EphemeralRunner	Waiting for ephemeral runner owned resources to be deleted	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:36Z	INFO	EphemeralRunner	Cleaning up resources after after ephemeral runner termination	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "phase": "Succeeded"}
2024-04-12T10:43:36Z	INFO	EphemeralRunner	Cleaning up the runner pod	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:36Z	INFO	EphemeralRunner	Waiting for ephemeral runner owned resources to be deleted	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:36Z	INFO	EphemeralRunner	Cleaning up resources after after ephemeral runner termination	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "phase": "Succeeded"}
2024-04-12T10:43:36Z	INFO	EphemeralRunner	Cleaning up the runner pod	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:36Z	INFO	EphemeralRunner	Waiting for ephemeral runner owned resources to be deleted	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:37Z	INFO	EphemeralRunner	Cleaning up resources after after ephemeral runner termination	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "phase": "Succeeded"}
2024-04-12T10:43:37Z	INFO	EphemeralRunner	Cleaning up the runner pod	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:37Z	INFO	EphemeralRunner	Waiting for ephemeral runner owned resources to be deleted	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:38Z	INFO	EphemeralRunner	Cleaning up resources after after ephemeral runner termination	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "phase": "Succeeded"}
2024-04-12T10:43:38Z	INFO	EphemeralRunner	Cleaning up the runner pod	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:38Z	INFO	EphemeralRunner	Waiting for ephemeral runner owned resources to be deleted	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:38Z	INFO	EphemeralRunner	Cleaning up resources after after ephemeral runner termination	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "phase": "Succeeded"}
2024-04-12T10:43:38Z	INFO	EphemeralRunner	Cleaning up the runner pod	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:38Z	INFO	EphemeralRunner	Waiting for ephemeral runner owned resources to be deleted	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:38Z	INFO	EphemeralRunner	Cleaning up resources after after ephemeral runner termination	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "phase": "Succeeded"}
2024-04-12T10:43:38Z	INFO	EphemeralRunner	Cleaning up the runner pod	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:38Z	INFO	EphemeralRunner	Waiting for ephemeral runner owned resources to be deleted	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:38Z	INFO	EphemeralRunner	Cleaning up resources after after ephemeral runner termination	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "phase": "Succeeded"}
2024-04-12T10:43:38Z	INFO	EphemeralRunner	Cleaning up the runner pod	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:38Z	INFO	EphemeralRunner	Pod is deleted	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:38Z	INFO	EphemeralRunner	Cleaning up the runner jitconfig secret	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:38Z	INFO	EphemeralRunner	Deleting the jitconfig secret	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:38Z	INFO	EphemeralRunner	Waiting for ephemeral runner owned resources to be deleted	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:43Z	INFO	EphemeralRunner	Cleaning up resources after after ephemeral runner termination	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "phase": "Succeeded"}
2024-04-12T10:43:43Z	INFO	EphemeralRunner	Cleaning up the runner pod	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:43Z	INFO	EphemeralRunner	Pod is deleted	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:43Z	INFO	EphemeralRunner	Cleaning up the runner jitconfig secret	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:43Z	INFO	EphemeralRunner	Secret is deleted	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}}
2024-04-12T10:43:43Z	INFO	EphemeralRunner	EphemeralRunner has already finished. Stopping reconciliation and waiting for EphemeralRunnerSet to clean it up	{"ephemeralrunner": {"name":"infra-gcp-core-2-default-vpj25-runner-nm48c","namespace":"infra"}, "phase": "Succeeded"}

In short, if I'm not mistaken:

  • 1st reconciliation
    • runnerId in status is not set
    • try to create an ephemeral runner registration
    • new JIT config is generated
    • status is updated with runnerId and JIT config
  • 2nd reconciliation
    • updated value for runnerId (1st reconciliation) is not read (do not understand why) and is considered not set (the runner is seen as not registered)
    • try to create an ephemeral runner registration
    • conflict is detected when generating the JIT config
    • registration from first reconciliation is deleted
    • reconciliation is requeue
  • 3rd reconciliation
    • updated value for runnerId (1st reconciliation) is (now) read, the runner is considered as registered
    • new ephemeral runner secret for the JIT config is created (with the registration deleted in 2nd reconciliation)
    • runner pod is created and fails (registration has been deleted)
  • The controller does not find the ephemeral runner registration in GitHub and marks the ephemeral runner as finished

Let's consider this as an assumption for now, based on logs and source code. I have struggled a bit to reproduce it on a local setup, since the code base is relatively new to me and I lack time (I'll be on vacation next week).
Hope this will help solve the issue.

@nikola-jokic
Copy link
Collaborator

Hey @clementnero,

Thank you for this great investigation! The PR I created should take care of this problem as well (Long poll restarts the sequence, but during that 50s period, the controller will re-create a session and start the pod successfully), but I'm wondering why the controller didn't pick up the status update. I will dig into it as soon as I can.

@nikola-jokic
Copy link
Collaborator

I'll keep an eye on this issue after the next 0.9.1 release

Thank you all!

@alecor191
Copy link

alecor191 commented May 3, 2024

@nikola-jokic we're experiencing the same issue as described here (randomly jobs get stuck waiting for a runner; killing listener pod unblocks it).

For us, it started yesterday on 0.9.0. So today we upgraded to 0.9.1. Unfortunately, we hit the same issue several times today with 0.9.1.

Do I understand correctly that the above issue should be fixed in 0.9.1? If so, is there any info I can share to support troubleshooting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode
Projects
None yet
8 participants