Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent runners from being stuck in Terminating when pod disappeared without standard termination process #1318

Merged
merged 1 commit into from
Apr 8, 2022

Conversation

mumoshu
Copy link
Collaborator

@mumoshu mumoshu commented Apr 8, 2022

This fixes the said issue by additionally treating any runner pod whose phase is Failed or the runner container exited with non-zero code as "complete" so that ARC gives up unregistering the runner from Actions, deletes the runner pod anyway.

Note that there might be plenty of possible causes for that. If you are deploying runner pods on AWS spot instances or GCE preemptive instances and a job assigned to a runner took more time than the shutdown grace period provided by your cloud provider (2 minutes for AWS spot instances), the runner pod would be terminated prematurely without letting actions/runner unregisters itself from Actions. If your VM or hypervisor failed then runner pods that were running on the node will become PodFailed without unregistering runners from Actions.

Please beware that ARC leaves dangling runners on Actions in such circumstances, with or without this patch, because ARC can't unregister runners because Actions doesn't automatically change disappeared runners from busy to non-busy, at least for several hours, and there's no API to forcefully change runners to non-busy or force-delete runners(DeleteRunner API returns 422 on busy runners). It is currently the users' responsibility to clean up any dangling runner resources on GitHub Actions.

Ref #1307
Might also relate to #1273

…t standard termination process

This fixes the said issue by additionally treating any runner pod whose phase is Failed or the runner container exited with non-zero code as "complete" so that ARC gives up unregistering the runner from Actions, deletes the runner pod anyway.

Note that there are a plenty of causes for that. If you are deploying runner pods on AWS spot instances or GCE preemptive instances and a job assigned to a runner took more time than the shutdown grace period provided by your cloud provider (2 minutes for AWS spot instances), the runner pod would be terminated prematurely without letting actions/runner unregisters itself from Actions. If your VM or hypervisor failed then runner pods that were running on the node will become PodFailed without unregistering runners from Actions.

Please beware that it is currently users responsibility to clean up any dangling runner resources on GitHub Actions.

Ref #1307
Might also relate to #1273
@mumoshu mumoshu changed the title Prevent runners from stuck in Terminating when pod disappeared without standard termination process Prevent runners from being stuck in Terminating when pod disappeared without standard termination process Apr 8, 2022
@mumoshu mumoshu added this to the v0.22.3 milestone Apr 8, 2022
@mumoshu mumoshu merged commit b09c540 into master Apr 8, 2022
@mumoshu mumoshu deleted the give-up-runner-unregistration-on-pod-failure branch April 8, 2022 01:17
billimek added a commit to homedepot/actions-runner-controller that referenced this pull request Apr 12, 2022
* master: (30 commits)
  chore(deps): update azure/setup-helm action to v2.1 (actions#1328) [skip ci]
  Add more usages of RUNNER_VERSION to Renovate config. (actions#1313)
  chore: fix typo (actions#1316) [skip ci]
  chore: bump chart to latest (actions#1319)
  Fix release workflow
  Prevent runners from stuck in Terminating when pod disappeared without standard termination process (actions#1318)
  ci: pin go version to the known working version (actions#1303)
  chore: bump chart to latest (actions#1300)
  Fix runner pod to be cleaned up earlier regardless of the sync period (actions#1299)
  Make the hard-coded runner startup timeout to avoid race on token expiration longer (actions#1296)
  docs: highlight why persistent are not ideal
  fix(deps): update module sigs.k8s.io/controller-runtime to v0.11.2
  chore(deps): update dependency actions/runner to v2.289.2
  chore(deps): update helm/chart-releaser-action action to v1.4.0 (actions#1287)
  refactor(runner/entrypoint): check for externalstmp (actions#1277)
  docs: redundant words
  docs: wording
  docs: add autoscaling also causes problems
  chore: new line for consistency
  docs: use the right font
  ...
@mumoshu mumoshu mentioned this pull request Apr 14, 2022
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant