Graceful job termination #26311

jupe · 2020-07-03T09:50:46Z

jupe
Jul 3, 2020

When cancelling job there is chance that running software handle some resources that needs to release when job is cancelled. To make this robust way GitHub Actions should send signals for running process so that application can tear-down tasks properly. Gentle terminator would send first SIGINT, wait some seconds and if app still doesn’t die it would send another signal SIGTERM and finally SIGKILL to force terminate it eventually.

Probably GH already manage this someway, but at least I didn’t find any documentation about subject. would be good to document it properly how it behaviours at the moment so it’s easier to propose changes if needed.

Here is nice document for Jenkins about same issue: https://gist.github.com/datagrok/dfe9604cb907523f4a2f

Answered by BrightRan

Jul 20, 2020

@jupe,

According the introduction from the engineering team, after the user click “Cancel workflow”:

The server will re-evaluate job-if condition on all running jobs.
If the job condition is always(), it will not get canceled.
For the rest of the jobs that need cancellation, the server will send a cancellation message to all the runners.
Each runner has 5 minutes to finish the cancellation process before the server force terminate the job.
The runner will re-evaluate if condition on the current running step.
If the step condition is always(), it will not get canceled.
Otherwise, the runner will send Ctrl-C to the action entry process (node for javascript action, docker for c…

View full answer

BrightRan · 2020-07-06T02:21:19Z

BrightRan
Jul 6, 2020

@jupe,

Thanks for your feedback.
I have created an internal ticket to help you report this question to the appropriate engineering team for further discussion and evaluation. If they have any update, I will notify you in time, and sometimes the appropriate engineers may directly reply you here.

1 reply

lundstrj Oct 11, 2023

Hello @BrightRan any movement since 2020?

BrightRan · 2020-07-20T01:34:38Z

BrightRan
Jul 20, 2020

@jupe,

According the introduction from the engineering team, after the user click “Cancel workflow”:

The server will re-evaluate job-if condition on all running jobs.
If the job condition is always(), it will not get canceled.
For the rest of the jobs that need cancellation, the server will send a cancellation message to all the runners.
Each runner has 5 minutes to finish the cancellation process before the server force terminate the job.
The runner will re-evaluate if condition on the current running step.
If the step condition is always(), it will not get canceled.
Otherwise, the runner will send Ctrl-C to the action entry process (node for javascript action, docker for container action, and bash/cmd/pwd for run action), if the process doesn’t exit within 7500ms, the runner will send Ctrl-Break to the process, then wait for 2500ms for the process to exit. the runner will terminate the process tree if the process is still running.
The runner will try all the following steps that have condition sets to always() as many as it can within the 5 minutes cancellation timeout.

Hope this can help you understand better.

3 replies

briceburg Jun 20, 2023

this does not seem correct. the cancelled job is not sent SIGINT/Ctrl-C , it exits immediately without sending any termination messages. https://github.com/orgs/community/discussions/26311#discussioncomment-6231898

also, is the 7500ms configurable? if each runner has 5minutes to finish cancellation process, 7.5s for process termination seems way to low.

justinmchase Dec 4, 2023

We are using shell: bash and run: terraform ... and I am seeing the same thing as @briceburg, the process is not gracefully terminated it just abruptly exits and leaves our terraform state file in a locked state.

Do we need to use traps or what is the proper way to terminate a process in a bash shell action?

justinmchase Mar 21, 2024

I ended up using a node20 action and making a simple js file which just uses nodes child_process module to invoke terraform and process.on('SIGINT', ...) to propagate cancellation signals. For some reason I just could not make it work with the bash action.

acvejic · 2021-07-01T09:02:32Z

acvejic
Jul 1, 2021

Hi @brightran

I have a terraform/terragrunt action that when it is canceled it leaves state file locked in AWS dynamoDB. Then I need to force unlock it and most of the resources are not in state file so my state is corrupted.

jobs:
    terragrunt:  
      runs-on: ubuntu-18.04
      if: "!(contains(github.event.head_commit.message, 'skip all') || \
            startsWith(github.event.head_commit.message, 'Merge branch ''develop'' into') \
          )"
...
          - name: terragrunt apply
            if: "!contains(github.event.head_commit.message, 'skip tf') && env.DIR_EXISTS=='true' && env.DESTROY_QA_ENV!='true'"
            working-directory: .terraform/environments/${{ steps.parse_branch.outputs.qa_name }}
            run: |
                 terragrunt apply -auto-approve

And when I cancel the job all I see in action console is:

module.sisCodeDeploy.null_resource.wait_for_admin[0]: Still creating... [2m30s elapsed]
Error: The operation was canceled.

So it does not look like job allowed terraform to do all the necessary steps, to save state, release lock etc.
Any suggestions what to do?

0 replies

james-portman-az · 2022-02-10T11:14:35Z

james-portman-az
Feb 10, 2022

Hi,

This is happening for us too, terraform does not have chance to release the state locks.
It does not appear to be working as documented, it just immediately terminates.

If it is working as documented then it would be good to be able to extend the 7500ms shutdown time

0 replies

ruohola · 2022-03-27T15:14:15Z

ruohola
Mar 27, 2022

I can also confirm that using cancel-in-progress: true with a GitHub Actions CI pipeline that runs terraform plan does not work well. If the plan step is canceled it will just immediately terminate and not gracefully release the lock.

I’ve tested locally sending SIGINT and SIGTERM signals to a running terraform plan and it does exit gracefully and releases the lock. GitHub Actions seems to be sending some other, more aggressive signal, when killing the workflow.

0 replies

jsoref · 2022-03-28T15:05:31Z

jsoref
Mar 28, 2022

If you’re using terraform, I’d suggest using atlantis:

  <a href="https://www.runatlantis.io/" target="_blank" rel="noopener">runatlantis.io</a>

Terraform Pull Request Automation | Atlantis

Atlantis: Terraform Pull Request Automation

Yes, it means you’re running a small VM somewhere, but, it also means you don’t have to worry about it being killed.

0 replies

pauldraper · 2022-06-07T20:39:57Z

pauldraper
Jun 7, 2022

I confirmed; Github runner is using SIGKILL for cancel-in-progress.

1 reply

briceburg Jun 20, 2023

Ideally the runner would attempt a SIGINT, to allow the executing process time to cleanup, before SIGKILL / halting...

@BrightRan suggests that a SIGINT is sent, but clearly this is not happening per the job logs (terraform will log that it is trying to gracefully exit)... again, this is requesting we try to gracefully terminate processes in jobs where concurrency has cancel-in-progress: true

CharlesLiu-TOPNetwork · 2022-10-25T08:43:41Z

CharlesLiu-TOPNetwork
Oct 25, 2022

Can we make job termination operate more gentle?
Uncompleted compilation file won't be clean after cancel-in-progress KILL make process.

0 replies

cornerman · 2023-02-24T21:50:04Z

cornerman
Feb 24, 2023

This seems like a common issue for everybody running tasks like terraform - that need to gracefully exit. You can and should disable fail-fast in the github action, so jobs don't get cancelled if a new one arrives. But having the cancel button in the UI which could be accidentally clicked by some user is really dangerous. In case of terraform that means potentially leaking resources - that need to be cleaned up manually afterwards.

Is there any solution for this?

0 replies

ringerc · 2023-06-21T04:07:00Z

ringerc
Jun 21, 2023

I wrote a demo to illustrate and confirm the behaviour described above, which really should be in the docs.

At least on a Linux runner, "CTRL-C" means SIGINT and "CTRL-Break" (also CTRL-) means SIGTERM.

.github/workflows/cancel-test.yaml:

name: Cancel test
on:
  - workflow_dispatch
jobs:
  cancel-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - run: |
          exec python3 signaller.py
      - if: always()
        run: |
          cat signaller.log

signaller.py:

#!/usr/bin/env python3

import time
import signal
import sys

signaller_log = open("signaller.log", "w")

def handler(signum, frame):
  signame = signal.Signals(signum).name
  print(f'Signal handler called with signal {signame} ({signum}) at {time.time()}', file=sys.stderr)
  print(f'Signal handler called with signal {signame} ({signum}) at {time.time()}', file=signaller_log)
  sys.stderr.flush()
  signaller_log.flush()

for sig in (signal.SIGINT, signal.SIGTERM, signal.SIGHUP, signal.SIGUSR1, signal.SIGUSR2, signal.SIGABRT, signal.SIGQUIT, signal.SIGPIPE):
  signal.signal(sig, handler)

while True:
  print(f"tick {time.time()}", file=sys.stderr)
  print(f"tick {time.time()}", file=signaller_log)
  sys.stderr.flush()
  signaller_log.flush()
  time.sleep(1);

in this simple case the result observed was consistent with above docs - SIGINT, about 7.5s, SIGTERM, about 2.5s, and presumably SIGKILL + cleanup.

But ... note that I've exec'd the Python interpreter running the script so it's the leader process.

If I don't exec the interpreter, the Python script dies without warning or any chance to clean anything up!

It looks like the github actions runner probably waits for the session leader process to exit, then hard-kills anything under it when it exits.

It doesn't appear to deliver signals to the process tree by signalling the process group; AFAICS it only signals the leader process. So the leader must install a signal handler that explicitly propagates signals to child processes then waits for them to exit.

0 replies

ringerc · 2023-06-22T00:55:44Z

ringerc
Jun 22, 2023

I wrote this up better in a demo at https://github.com/ringerc/github-actions-signal-handling-demo since I wasn't satisfied with the answers from @BrightRan above, nor my earlier quick tests.

It's a right mess. It looks like you really need to rely on if: always() blocks, and assume the whole process tree of a step can be nuked without warning. It's possible to implement signal-propagation to child procs, but it's ugly, and without any way to change the timeouts it's still going to be fragile.

2 replies

briceburg Jun 29, 2023

✅ this is super helpful @ringerc . I've opened a discussion to track a configurable grace period here: https://github.com/orgs/community/discussions/59652

ringerc Jul 24, 2023

Thanks @briceburg . It's especially difficult to correctly handle failures because a given step's processes could be SIGKILL'd without a chance to clean up, but per the test case I supplied it's also possible for processes from a step to keep running after the step is cancelled, so they're still running when an if: always() cleanup or recovery step runs. And it's easy to leak such processes into the background if, say, you're using shell job control to run two coupled commands in a step.

You can imagine that this would be bad if, say, your if: always() or if: failed() step does a terraform force-unlock to make sure that a terraform lock is released when a job gets cancelled, but the terraform process run by the cancelled step got detached from the step's process group so it kept running in the background when the step got killed...

AFAICS the only correct way to do it is to:

have your critical commands write pid-files when they start, recording the pid and process start time
have them delete the pid file on termination
in an if: always() or if: failure() step, test if the pid file exists, and if it does:
- test the pid it points to is still running, and the pid it points to has the same process start time as in the pid file (to detect pid reuse). If still running, kill -KILL the process tree of the target process (and hope nothing detached/daemonized off it).
- now that you know for sure the critical process is terminated, perform your cleanup action, which must not take longer than 7.5 seconds allowing for possible worker load, scheduler delays, network I/O etc, so it'd better be simple and minimal

Even then, this is subject to a race between the critical process succeeding and when it then deletes its pid file on completion or otherwise locally records success. So your cleanup step must be idempotent and not do anything wrong if the critical step actually succeeded, then crashed or timed out in the moment before it could delete its pid file.

kedare · 2023-07-28T10:00:37Z

kedare
Jul 28, 2023

Has something changed in the way Github processing of the job termination ?
I am not receiving any signal that can be catched anymore, everything just gets killed directly.

1 reply

ringerc Aug 2, 2023

@kedare you won't get a signal unless your process is the session leader. If you run it via a shell, it won't get a signal. See the detailed demos I wrote in the repo linked above.

restfulhead · 2023-09-13T14:40:16Z

restfulhead
Sep 13, 2023

Why is this marked as answered when it isn't really answered? How to avoid e.g. terraform apply to be killed without cleaning up?

3 replies

ringerc Sep 17, 2023

Because GH don't really expect Actions to work reliably I guess. It's not even documented. Maybe they figure if they make it bad enough people will be driven to try Azure Devops.

jsoref Sep 18, 2023

Here's a way to avoid the problem: #26311 (comment)

gerbyzation Oct 25, 2023

Yes this is quite annoying. For terraform we are using terraform cloud, so that the actual apply is triggered but not executed on github actions runners. If a job is terminated this won't brick the terraform operation halfway through.

We have a similar problem with helm releases, which can get stuck in an upgrade-pending state when the job is killed while updating a release. Our current workaround is to have another step that will run if a job is cancelled and rollback if the release is in a bad state.

      - name: Rollback if cancelled
        if: always() && (steps.helm-update.outcome == 'cancelled' || steps.helm-update.outcome == 'failure')
        run: |
          helm status ${{ inputs.chart_values_file_name }} | \
            grep -qi 'pending-upgrade' && \
            helm rollback ${{ inputs.chart_values_file_name }}

breathe · 2023-11-15T00:31:47Z

breathe
Nov 15, 2023

Could we maybe adjust the wrapper so that it invokes terraform via tini -- krallin/tini#8 -- or something like it which 'does a good job' (I'm guessing?) of forwarding signals to child processes ...

0 replies

breathe · 2023-11-15T02:02:45Z

breathe
Nov 15, 2023

This is the workaround I'm using now ...

No terraform wrapper because the node process which implements the wrapper doesn't (and can't?) forward signals

    - uses: hashicorp/setup-terraform@v2
      with:
        terraform_wrapper: false

dumb bash script wrapper: forward as much as possible to terraform and only respond to the first signal to maximize the time terraform will spend trying to cleanup. Save this to a file in the repo (${{ github.workspace }}/terraform/terraform-ci.bash in example code below)

#!/bin/bash 

COUNTER=0
_term() { 
  echo "Caught SIGTERM signal!" 

  if [[ $COUNTER -lt 1 ]] ; then
    echo "Passing signal to terraform"
    kill -TERM "$child" 2>/dev/null
  else
    echo "Already passed signal to terraform"  
  fi

  let COUNTER++
}

_int() { 
  echo "Caught SIGINT signal!"

  if [[ $COUNTER -lt 1 ]] ; then
    echo "Passing signal to terraform"
    kill -INT "$child" 2>/dev/null
  else
    echo "Already passed signal to terraform"  
  fi

  let COUNTER++
}

_other() { 
  echo "Caught OTHER signal!" 

  if [[ $COUNTER -lt 1 ]] ; then
    echo "Passing signal to terraform"
    kill -INT "$child" 2>/dev/null
  else
    echo "Already passed signal to terraform"  
  fi

  let COUNTER++
}

trap _term SIGTERM
trap _int SIGINT

trap _other SIGHUP
trap _other SIGUSR1
trap _other SIGUSR2
trap _other SIGABRT
trap _other SIGQUIT
trap _other SIGPIPE

terraform "$@" &
child=$! 
wait "$child"

use exec when invoking terraform to avoid an extra shell process in the signal propogation chain

    - name: Terraform Apply
      run: exec ${{ github.workspace }}/terraform/terraform-ci.bash apply -input=false ...

2 replies

justinmchase Dec 4, 2023

Very nice, this is the right answer.

What signals are you seeing actually trapped in _other?

breathe Dec 5, 2023

Not seeing any output through the _other handlers so far. Put that there initially to try and get visibility into what github actions does -- and then left it there for a little extra visibility in case something changes

npwolf · 2024-03-20T23:03:28Z

npwolf
Mar 20, 2024

Here is my workaround. Like @breathe , I don't use the terraform wrapper. Aside from that, I instead use tini to make sure all signals get propagated:

      - name: Setup terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_wrapper: false

      - name: Initialize terraform environment
        run: |
          terraform init
          terraform validate
          sudo apt install -y tini  

      - name: terraform apply
        # Need tini so terraform gets SIGTERM appropriately 
        run: |
          exec tini -s -g -- terraform apply -auto-approve

Which results in something like the following when the job gets cancelled:

Interrupt received.
Please wait for Terraform to exit or data loss may occur.
Gracefully shutting down...

Stopping operation...
╷
│ Error: execution halted
│ 
│ 
╵
╷
│ Error: execution halted
│ 
│ 
╵
╷
│ Error: waiting for ECS Service (arn:aws:ecs:....) create: context canceled
│ 
│   <error about resource>
│ 
╵
Error: The operation was canceled.

3 replies

r34son Mar 21, 2024

Great! Thanks a lot

BenJackGill May 22, 2024

This solution is working great for me. But I have a question because my logs look like this:

Interrupt received.
Please wait for Terraform to exit or data loss may occur.
Gracefully shutting down...

Stopping operation...
module.firebase.google_firestore_index.main["8"]: Still creating... [2m20s elapsed]
module.firebase.google_firestore_index.main["1"]: Still creating... [2m20s elapsed]
module.firebase.google_firestore_index.main["4"]: Still creating... [2m20s elapsed]
module.github.github_actions_environment_variable.main["PROJECT_ID"]: Creation complete after 6s [id=seoturbo:orange-9813a:PROJECT_ID]
module.github.github_actions_environment_variable.main["CATEGORY_PASCAL_CASE"]: Creation complete after 7s [id=seoturbo:orange-9813a:CATEGORY_PASCAL_CASE]
module.github.github_actions_environment_variable.main["FB_API_KEY"]: Creation complete after 7s [id=seoturbo:orange-9813a:FB_API_KEY]
module.github.github_actions_environment_variable.main["FB_MEASUREMENT_ID"]: Creation complete after 5s [id=seoturbo:orange-9813a:FB_MEASUREMENT_ID]
module.firebase.google_firestore_index.main["0"]: Still creating... [20s elapsed]

Two interrupts received. Exiting immediately. Note that data loss may have
occurred.


Error: operation canceled

No lock file is produced, so I can start up the Terraform process again successfully, which is great.

But I am a little worried about the log messages.

It seems like I am getting a graceful shutdown at Gracefully shutting down... but then later it says Two interrupts received. Exiting immediately. Note that data loss may have occurred.. Should I be worried about that? Is Terraform exiting before all data can be saved to state?

breathe May 22, 2024

Yes -- that second message indicates that terraform is shutting down without completing its safe shutdown process. The lock file may not get cleaned up and/or there could be resources created by terraform which are not recorded as being owned by terraform -- which means that on a subsequent execution of terraform, terraform may observe some resources that it actually did create but think that they are not owned by the terraform stack and fail with an error about refusing to modify resources that it doesn't own.

In my experience (at least with the resources in the stacks I use) - within github actions it's better to give terraform as much time as it can possibly have than it is to send it that second shutdown signal -- that's why the script I posted has the else clause like this to disregard all signals past the first

else
echo "Already passed signal to terraform"
fi

Both forms are technically wrong in terms of there still exists an unresolved race condition between terraform getting shut down and terraform saving state -- but in practice from my observations at least -- giving it more time to complete safe shutdown seems to be more worth it than sending the signal and shutting it down faster.

I also include a statement in my terraform github action to guarantee that the lock file has been removed whenever a cancellation occurs

In my case it looks like this (but obviously depends on where you store the state as to where the lock file is).

       - name: Ensure lockfile doesn't exist if the job was cancelled
        # its possible the cancellation may have occurred while terraform was running and held a lock
        # the cancellation signal is delivered in such a way that terraform is not able to safely clean its lock up
        # simple solution: if the github action is cancelled, then just ensure any lock associated with this task doesn't exist
        if: ${{ cancelled() }}
        run: |
          gcloud storage rm gs://terraform-state-${{inputs.TF_STATE_PROJECT}}/${{github.repository}}-${{ inputs.working_dir }}/${{inputs.TF_WORKSPACE}}.tflock || echo "No Lock Removed -- it must not exist"

If you manage the lock like this tho -- then you'll also want another mechanism to guarantee that only a single github action runner is running the workflow at a time ... I use logic like this (but my env is likely unique in that we auto deploy new environments per PR -- so the aim with the below is to ensure at most one action running terraform per pr at a time -- if you don't have separate terraform workspaces for every pr then you'd want the below to look different

# The goal is to cancel workflows in progress when they are superseded by new commits to a branch with an open PR
# We don't ever want to cancel:
# - workflows triggered to deploy from the main branch after the main branch is modified.  Commits to main should always trigger actions which run sequentially to completion
# - workflows triggered when a PR is closed/merged -- these actions clean up resources previously created for testing that PR and should run to completion
concurrency:
  group: workflow-name-${{ github.workflow }} pr-${{ github.event.pull_request.number || 0}} event_name-${{github.event_name}}
  cancel-in-progress: ${{ github.event_name != 'push' && github.event.action != 'closed'}}

dimikot · 2024-04-11T20:24:35Z

dimikot
Apr 11, 2024

I built a simple tool which speeds up the signals propagation based on the above material:
dimikot/signal-fanout

It can be just put to shell: key of the step specification. And then, all of the processes launched from that step (including nested and called without exec) will receive SIGTERM immediately on the workflow cancellation.

...
jobs:
  my-job:
    runs-on: ubuntu-latest
    steps:
      - name: Long-running step
        shell: signal-fanout {0}
        run: |
          for i in $(seq 1 30); do echo "$(date): $i"; sleep 1; done

5 replies

BenJackGill May 21, 2024

Sorry for the basic questions, but how exactly do we use this?

Download the signal-fanout file from your repo
Upload to somewhere in my repo. At the root or somewhere else?
Do we also need to stuff like chmod +x to make it runnable?

I put mine in root and added to the job step as you show above. But I keep getting signal-fanout: command not found.

Also shouldn't the file have a .sh extension?

jsoref May 21, 2024

On unix/posix, file extensions are conventions and totally optional.

Yes, the file needs to be +x for it to work.

The file also needs to somehow be in your path (you can use the $GITHUB_PATH file to add it).

The easiest way would be to wrap it into an action (I'm playing with that now).

BenJackGill May 21, 2024

Thanks, I'll look into all that now but I'm new to all this so that Action would be amazing if you could share once you get it working :)

jsoref May 21, 2024

dimikot/signal-fanout#1

You should be able to do:

- uses: jsoref/signal-fanout@c4aaa067cb278157a5f03488579f2a435cabaa5a
- shell: signal-fanout {0}

Note that GitHub today isn't working particularly well, so testing this was hard...

BenJackGill May 22, 2024

Amazing. Thank you. I did some testing and this was my result.

Used like this in my job:

      - name: Signal Fanout
        uses: jsoref/signal-fanout@c4aaa067cb278157a5f03488579f2a435cabaa5a

      - name: Terraform - Apply
        id: terraform-apply
        working-directory: ./apps/infra
        shell: signal-fanout {0}
        run: |
          terraform apply -auto-approve -no-color \
            -var="github_token=${{ secrets.GH_PAT }}" \
            -var="github_repository=${{ github.repository }}" \
            -var="github_repository_id=${{ github.repository_id }}" \
            -var="project_id=${{ env.PROJECT_ID }}"

But unfortunately it didn't work as expected...

There are two main problems where the lock file comes into play:

PROBLEM: If other jobs in the workflow fail while Terraform is running, then the Terraform job also gets cancelled abruptly and locked. This is because Github Actions has fail-fast which defaults to true.
ACTION RESULT: Using the provided Action Terraform does not stop at all when other jobs fail. This is unexpected because fail-fast was not changed in my workflow.
EXPECTED RESULT: What should happen is if other jobs fail then the Terraform job should also stop, but do so gracefully without locking.
PROBLEM: If a user clicks the Cancel button while Terraform is running, then Terraform job gets cancelled abruptly and locked. ACTION RESULT: Using this Action Terraform does stop the job, but it doesn't do it gracefully because it becomes locked.
EXPECTED RESULT: It should stop the job but do so gracefully without locking.

Did you experience the same?

kamatama41 · 2024-11-15T06:04:04Z

kamatama41
Nov 15, 2024

FYI: We are using the following snippet as a workaround.

    - name: Terraform apply
      id: apply
      run: terraform apply -no-color -auto-approve

    - name: Release lock if exists
      if: ${{ steps.apply.outcome == 'cancelled' && always() }}
      run: |
        lock_id=$(terraform plan -no-color -refresh=false 2>&1 | grep ' ID: ' | cut -d: -f2 | tr -d ' ' || true)
        if [[ -n "${lock_id}" ]]; then
          terraform force-unlock -force ${lock_id}
        fi

0 replies

Graceful job termination #26311

Replies: 18 comments · 21 replies

Replies: 18 comments 21 replies