In case of interrupt signal, Task does not give time to a subprocess to cleanup #458

marco-m · 2021-03-25T10:19:28Z

Background

Although Task has logic to intercept a SIGINT in getSignalContext():

Line 209 in d3cd9f1

func getSignalContext() context.Context {

the execution of a subprocess is interrupted immediately, not giving time to cleanup. For some subprocesses such as Terraform, this can have severe consequences as explained below.

The cancellable context obtained from getSignalContext() is passed everywhere:

Executor.Run()
Executor.RunTask()
Executor.runCommand()
internal/execext/exec.go:RunCommand()

and finally to

mvdan.cc/sh/v3/interp/interp.New()

The failing use case

I found this problem invoking terraform apply from Task.

Terraform is a tool to provision cloud infrastructure as code. When it runs, it can modify state in a remote backend, for example S3. If interrupted with CTRL-C (so SIGINT), it will take around 10 seconds to cleanup in a way that is safe for the remote state to be used again.

When run under task, this grace period is skipped and Terraform is terminated too fast. This can leave the remote state in a corrupted configuration and require non-obvius manual intervention to recover.

One could argue that Terraform should have a more resilient approach, but I think that in general this is a good example of the general problem.

Proposal 1

In my opinion Task should actually do as Terraform, which reacts differently depending on how many SIGINT signals it has received:

on reception of the first SIGINT, enter cleanup state. For Task, this would mean simply record that a SIGINT has been received and wait as usual for the subprocess to terminate.
on reception of a second SIGINT, assume that the user decided that the wait is over, and terminate abruptly (the current behavior).

Proposal 2

If proposal 1 is too complicated to implement, a fallback would be to introduce a per-task timeout: when a SIGINT is received, start a timer, and terminate abruptly only on timer expiration

How to reproduce

Taskfile:

version: '3'

tasks:
  default:
    cmds:
      - ./ctrl_c

Go code to simulate a program that needs time to cleanup on reception of SIGINT:

package main

import (
	"fmt"
	"os"
	"os/signal"
	"time"
)

func main() {
	c := make(chan os.Signal, 1)
	signal.Notify(c, os.Interrupt) // Ctrl-C -> SIGINT

	fmt.Printf("Do one of the following:\n")
	fmt.Printf("- From the Task shell: type CTRL-C\n")
	fmt.Printf("- From another shell:  kill -INT %d\n", os.Getpid())

	// block waiting for a signal
	s := <-c
	fmt.Printf("Got signal: %v\n", s)
	fmt.Printf("Simulating cleanup\n")
	time.Sleep(5 * time.Second)
	fmt.Printf("Cleanup done\n")  // <== NOTE THIS LINE
}

When running Task and typing CTRL-C we get:

$ task
task: ./ctrl_c
Do one of the following:
- Type CTRL-C
- kill -INT 5614
^Ctask: signal received: interrupt
Got signal: interrupt
Simulating cleanup
task: Failed to run task "default": context canceled

the exit status of Task is 1 and the line "Cleanup done" is not printed, confirming that the subprocess has been terminated abruptly.

On the other hand, when running Task and sending a INT signal with kill (kill -INT ), we see:

$ task
task: ./ctrl_c
Do one of the following:
- Type CTRL-C
- kill -INT 5599
Got signal: interrupt
Simulating cleanup
Cleanup done

the exit status of Task is 0 and the line "Cleanup done" is printed, confirming that the subprocess terminated by itself cleanly.

The text was updated successfully, but these errors were encountered:

marco-m-pix4d · 2021-04-20T12:15:57Z

FYI, I am working on a PR.

Helper (sleepit) and test code based on https://github.com/marco-m/timeit go-task/task/go-task#458

andreynering · 2021-04-23T20:56:04Z

Hi @marco-m (and @marco-m-pix4d 😅), thanks for opening this issue!

We currently use the default timeout logic given by the mvdan.cc/sh library. The implementation is around here:

https://github.com/mvdan/sh/blob/f33507475241da6fc37b972d825c351b94300bab/interp/handler.go#L92-L104

What it does is: when Task receives a SIGINT or SIGTERM signal, we forward that signal to the process immediately but force a SIGKILL after a given timeout. The library default timeout is 2 seconds and we just use that.

I think it'd be possible to increase that if we think it's too low, and/or make it configuration per Taskfile/task/command as well.

marco-m · 2021-04-24T07:57:43Z

Hello @andreynering :-)

The library default timeout is 2 seconds and we just use that.

Ah, now I understand because in my repro test, to be able to reliably replicate the problem, I had to use a timeout longer than 2s :-D

What it does is: when Task receives a SIGINT or SIGTERM signal, we forward that signal to the process immediately but force a SIGKILL after a given timeout.

I saw the code path with a cancellable context.Context, yes. Currently I am experimenting without any forwarding at all (I pass an empty context to the mvdan.cc/sh Runner), because on Unix there is no need to forward anything, it is the TTY driver that upon receiving CTRL-C sends a SINGINT to the whole process group (https://en.wikipedia.org/wiki/Process_group). On the other hand I am not sure of what will happen on Windows with my current approach.

I will then revisit the approach following your suggestion, thanks.

See go-task/task/go-task#458 Helper (sleepit) and test code based on https://github.com/marco-m/timeit

…c/sh We used to pass to mvdan.cc/sh/interp.Runner a context that was cancelled on reception of a OS signal. This caused the Runner to terminate the subprocess abruptly. The correct behavior instead is for us to completely ignore the signal and let the subprocess deal with it. If the subprocess doesn't handle the signal, it will be terminated. If the subprocess does handle the signal, it knows better than us wether it wants to cleanup and terminate or do something different. So now we pass an empty context just to make the API of interp.Runner happy Fixes go-task/task/go-task#458

marco-m-pix4d · 2021-04-29T08:56:56Z

After further thinking and experimenting, I don't think we can forward anything at all. This is explained in details in the PR #479.

Still, I don't have a solution for Windows.

olegstepura · 2021-08-16T20:55:44Z

So what about simply making cleanup timeout configurable? Is that possible already?

marco-m-pix4d · 2021-08-17T06:36:22Z

@olegstepura as I wrote in #458 (comment), I don't think it is actually possible. Best way to understand is to checkout the PR and experiment with it.

olegstepura · 2021-08-17T07:55:43Z

I'm switching from Makefile to Taskfile. I do this on Mac OS. And noticed this issue. If a task starts some docker container and I press ctrl+c container is kept, it does not exit with Taskfile. While with Makefile it did. Now I wonder how they solved it.

HeCorr · 2021-11-19T06:08:46Z

Hello! I'm having the same issue on Windows where Task is not waiting for my app to cleanly shutdown.

From what I've seen, there's no such option to disable the sending of a Kill signal which would be ideal in my scenario since my app handles interrupts by itself.

Or at least, change the current implementation so that it still waits for the timeout but sends a Kill signal instead of an Interrupt.

HeCorr · 2021-12-26T18:37:40Z

Can this be fixed please? It turned out to be a big deal for me. I keep encountering this issue again and again on any long-running app, pretty much running Task useless for such tasks :/

I would fix it myself but I have no idea of how.

Maybe this is causing the issue? Why would the process be instantly killed only on windows instead of waiting for the timeout as well?

I tried removing that and recompiling Task but it didn't work. Maybe I modified the incorrect file.

HeCorr · 2021-12-26T19:11:38Z

Actually it seems Task doesn't even use that code. That explains why my changes didn't have an effect.

I think I found the actual code responsible for handling signals and adding a delay didn't solve the issue, which tells me Task might be exiting early without waiting for goroutines to exit or something..

marco-m-pix4d · 2022-01-03T09:35:10Z

In case it is not clear: there is a PR to fix this, see #479.

olegstepura · 2022-01-04T23:42:16Z

@marco-m-pix4d what is the current status of that PR?

See go-task/task/#458 Helper (sleepit) and test code based on https://github.com/marco-m/timeit

…c/sh We used to pass to mvdan.cc/sh/interp.Runner a context that was cancelled on reception of a OS signal. This caused the Runner to terminate the subprocess abruptly. The correct behavior instead is for us to completely ignore the signal and let the subprocess deal with it. If the subprocess doesn't handle the signal, it will be terminated. If the subprocess does handle the signal, it knows better than us wether it wants to cleanup and terminate or do something different. So now we pass an empty context just to make the API of interp.Runner happy Fixes go-task/task/#458

ghostsquad · 2022-05-16T00:08:37Z

this should be fixed now with #479 merged. Though we still need to do a release for this feature. Closing this for now.

ssbarnea · 2022-06-01T14:02:51Z

Any chance to get a new release for this? I just encountred this bug with a simple npm compile that fails with:

ERROR: got KeyboardInterrupt signal
Interrupted (^C): KeyboardInterrupt:

I mentioned that there was no Ctrl-C made by the user at all.

Task will now give time for the processes running to do cleanup work Ref #458 Ref #479 Fixes #728 Co-authored-by: Marco Molteni <marco.molteni@pix4d.com> Co-authored-by: aliculPix4D <aleksandar.licul_ext@pix4d.com>

marco-m added the type: bug Something not working as intended. label Mar 25, 2021

marco-m-pix4d mentioned this issue Apr 20, 2021

signals: do not try to catch uncatchable signals #476

Merged

marco-m-pix4d added a commit to Pix4D/task that referenced this issue Apr 22, 2021

add regression test for SIGINT behavior

b994f39

Helper (sleepit) and test code based on https://github.com/marco-m/timeit go-task/task/go-task#458

marco-m-pix4d added a commit to Pix4D/task that referenced this issue Apr 22, 2021

add regression test for SIGINT behavior

ba69548

Helper (sleepit) and test code based on https://github.com/marco-m/timeit go-task/task/go-task#458

marco-m-pix4d added a commit to Pix4D/task that referenced this issue Apr 22, 2021

add regression test for SIGINT behavior

a5de3da

Helper (sleepit) and test code based on https://github.com/marco-m/timeit go-task/task/go-task#458

marco-m-pix4d pushed a commit to Pix4D/task that referenced this issue Apr 28, 2021

add regression test for SIGINT behavior

a496a75

See go-task/task/go-task#458 Helper (sleepit) and test code based on https://github.com/marco-m/timeit

marco-m-pix4d mentioned this issue Apr 28, 2021

Fix signal handling for subprocess #479

Merged

marco-m-pix4d mentioned this issue Apr 29, 2021

Fix signal handling for subprocess Pix4D/task#1

Merged

ghostsquad pushed a commit that referenced this issue May 14, 2022

add regression test for SIGINT behavior

bffb6e1

See go-task/task/#458 Helper (sleepit) and test code based on https://github.com/marco-m/timeit

ghostsquad mentioned this issue May 14, 2022

Ctx/Signal Handling still not totally correct #728

Closed

ghostsquad closed this as completed May 16, 2022

andreynering mentioned this issue Jun 12, 2022

Fix behavior of interrupt (SIGINT, SIGTERM) signals #769

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In case of interrupt signal, Task does not give time to a subprocess to cleanup #458

In case of interrupt signal, Task does not give time to a subprocess to cleanup #458

marco-m commented Mar 25, 2021

marco-m-pix4d commented Apr 20, 2021

andreynering commented Apr 23, 2021

marco-m commented Apr 24, 2021 •

edited

Loading

marco-m-pix4d commented Apr 29, 2021

olegstepura commented Aug 16, 2021

marco-m-pix4d commented Aug 17, 2021

olegstepura commented Aug 17, 2021

HeCorr commented Nov 19, 2021 •

edited

Loading

HeCorr commented Dec 26, 2021 •

edited

Loading

HeCorr commented Dec 26, 2021 •

edited

Loading

marco-m-pix4d commented Jan 3, 2022

olegstepura commented Jan 4, 2022

ghostsquad commented May 16, 2022

ssbarnea commented Jun 1, 2022

In case of interrupt signal, Task does not give time to a subprocess to cleanup #458

In case of interrupt signal, Task does not give time to a subprocess to cleanup #458

Comments

marco-m commented Mar 25, 2021

Background

The failing use case

Proposal 1

Proposal 2

How to reproduce

marco-m-pix4d commented Apr 20, 2021

andreynering commented Apr 23, 2021

marco-m commented Apr 24, 2021 • edited Loading

marco-m-pix4d commented Apr 29, 2021

olegstepura commented Aug 16, 2021

marco-m-pix4d commented Aug 17, 2021

olegstepura commented Aug 17, 2021

HeCorr commented Nov 19, 2021 • edited Loading

HeCorr commented Dec 26, 2021 • edited Loading

HeCorr commented Dec 26, 2021 • edited Loading

marco-m-pix4d commented Jan 3, 2022

olegstepura commented Jan 4, 2022

ghostsquad commented May 16, 2022

ssbarnea commented Jun 1, 2022

marco-m commented Apr 24, 2021 •

edited

Loading

HeCorr commented Nov 19, 2021 •

edited

Loading

HeCorr commented Dec 26, 2021 •

edited

Loading

HeCorr commented Dec 26, 2021 •

edited

Loading