Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In case of interrupt signal, Task does not give time to a subprocess to cleanup #458

Closed
marco-m opened this issue Mar 25, 2021 · 14 comments
Closed
Labels
type: bug Something not working as intended.

Comments

@marco-m
Copy link
Contributor

marco-m commented Mar 25, 2021

Background

Although Task has logic to intercept a SIGINT in getSignalContext():

func getSignalContext() context.Context {

the execution of a subprocess is interrupted immediately, not giving time to cleanup. For some subprocesses such as Terraform, this can have severe consequences as explained below.

The cancellable context obtained from getSignalContext() is passed everywhere:

Executor.Run()
Executor.RunTask()
Executor.runCommand()
internal/execext/exec.go:RunCommand()

and finally to

mvdan.cc/sh/v3/interp/interp.New()

The failing use case

I found this problem invoking terraform apply from Task.

Terraform is a tool to provision cloud infrastructure as code. When it runs, it can modify state in a remote backend, for example S3. If interrupted with CTRL-C (so SIGINT), it will take around 10 seconds to cleanup in a way that is safe for the remote state to be used again.

When run under task, this grace period is skipped and Terraform is terminated too fast. This can leave the remote state in a corrupted configuration and require non-obvius manual intervention to recover.

One could argue that Terraform should have a more resilient approach, but I think that in general this is a good example of the general problem.

Proposal 1

In my opinion Task should actually do as Terraform, which reacts differently depending on how many SIGINT signals it has received:

  • on reception of the first SIGINT, enter cleanup state. For Task, this would mean simply record that a SIGINT has been received and wait as usual for the subprocess to terminate.
  • on reception of a second SIGINT, assume that the user decided that the wait is over, and terminate abruptly (the current behavior).

Proposal 2

If proposal 1 is too complicated to implement, a fallback would be to introduce a per-task timeout: when a SIGINT is received, start a timer, and terminate abruptly only on timer expiration

How to reproduce

Taskfile:

version: '3'

tasks:
  default:
    cmds:
      - ./ctrl_c

Go code to simulate a program that needs time to cleanup on reception of SIGINT:

package main

import (
	"fmt"
	"os"
	"os/signal"
	"time"
)

func main() {
	c := make(chan os.Signal, 1)
	signal.Notify(c, os.Interrupt) // Ctrl-C -> SIGINT

	fmt.Printf("Do one of the following:\n")
	fmt.Printf("- From the Task shell: type CTRL-C\n")
	fmt.Printf("- From another shell:  kill -INT %d\n", os.Getpid())

	// block waiting for a signal
	s := <-c
	fmt.Printf("Got signal: %v\n", s)
	fmt.Printf("Simulating cleanup\n")
	time.Sleep(5 * time.Second)
	fmt.Printf("Cleanup done\n")  // <== NOTE THIS LINE
}

When running Task and typing CTRL-C we get:

$ task
task: ./ctrl_c
Do one of the following:
- Type CTRL-C
- kill -INT 5614
^Ctask: signal received: interrupt
Got signal: interrupt
Simulating cleanup
task: Failed to run task "default": context canceled

the exit status of Task is 1 and the line "Cleanup done" is not printed, confirming that the subprocess has been terminated abruptly.

On the other hand, when running Task and sending a INT signal with kill (kill -INT ), we see:

$ task
task: ./ctrl_c
Do one of the following:
- Type CTRL-C
- kill -INT 5599
Got signal: interrupt
Simulating cleanup
Cleanup done

the exit status of Task is 0 and the line "Cleanup done" is printed, confirming that the subprocess terminated by itself cleanly.

@marco-m marco-m added the type: bug Something not working as intended. label Mar 25, 2021
@marco-m-pix4d
Copy link
Contributor

FYI, I am working on a PR.

marco-m-pix4d added a commit to Pix4D/task that referenced this issue Apr 22, 2021
marco-m-pix4d added a commit to Pix4D/task that referenced this issue Apr 22, 2021
marco-m-pix4d added a commit to Pix4D/task that referenced this issue Apr 22, 2021
@andreynering
Copy link
Member

Hi @marco-m (and @marco-m-pix4d 😅), thanks for opening this issue!

We currently use the default timeout logic given by the mvdan.cc/sh library. The implementation is around here:

https://github.com/mvdan/sh/blob/f33507475241da6fc37b972d825c351b94300bab/interp/handler.go#L92-L104

What it does is: when Task receives a SIGINT or SIGTERM signal, we forward that signal to the process immediately but force a SIGKILL after a given timeout. The library default timeout is 2 seconds and we just use that.

I think it'd be possible to increase that if we think it's too low, and/or make it configuration per Taskfile/task/command as well.

@marco-m
Copy link
Contributor Author

marco-m commented Apr 24, 2021

Hello @andreynering :-)

The library default timeout is 2 seconds and we just use that.

Ah, now I understand because in my repro test, to be able to reliably replicate the problem, I had to use a timeout longer than 2s :-D

What it does is: when Task receives a SIGINT or SIGTERM signal, we forward that signal to the process immediately but force a SIGKILL after a given timeout.

I saw the code path with a cancellable context.Context, yes. Currently I am experimenting without any forwarding at all (I pass an empty context to the mvdan.cc/sh Runner), because on Unix there is no need to forward anything, it is the TTY driver that upon receiving CTRL-C sends a SINGINT to the whole process group (https://en.wikipedia.org/wiki/Process_group). On the other hand I am not sure of what will happen on Windows with my current approach.

I will then revisit the approach following your suggestion, thanks.

marco-m-pix4d pushed a commit to Pix4D/task that referenced this issue Apr 28, 2021
marco-m-pix4d pushed a commit to Pix4D/task that referenced this issue Apr 28, 2021
…c/sh

We used to pass to mvdan.cc/sh/interp.Runner a context that was cancelled on
reception of a OS signal. This caused the Runner to terminate the subprocess
abruptly.

The correct behavior instead is for us to completely ignore the signal and let
the subprocess deal with it. If the subprocess doesn't handle the signal, it
will be terminated. If the subprocess does handle the signal, it knows better
than us wether it wants to cleanup and terminate or do something different.

So now we pass an empty context just to make the API of interp.Runner happy

Fixes go-task/task/go-task#458
@marco-m-pix4d
Copy link
Contributor

After further thinking and experimenting, I don't think we can forward anything at all. This is explained in details in the PR #479.

Still, I don't have a solution for Windows.

@olegstepura
Copy link

So what about simply making cleanup timeout configurable? Is that possible already?

@marco-m-pix4d
Copy link
Contributor

@olegstepura as I wrote in #458 (comment), I don't think it is actually possible. Best way to understand is to checkout the PR and experiment with it.

@olegstepura
Copy link

I'm switching from Makefile to Taskfile. I do this on Mac OS. And noticed this issue. If a task starts some docker container and I press ctrl+c container is kept, it does not exit with Taskfile. While with Makefile it did. Now I wonder how they solved it.

@HeCorr
Copy link
Contributor

HeCorr commented Nov 19, 2021

Hello! I'm having the same issue on Windows where Task is not waiting for my app to cleanly shutdown.

From what I've seen, there's no such option to disable the sending of a Kill signal which would be ideal in my scenario since my app handles interrupts by itself.

Or at least, change the current implementation so that it still waits for the timeout but sends a Kill signal instead of an Interrupt.

@HeCorr
Copy link
Contributor

HeCorr commented Dec 26, 2021

Can this be fixed please? It turned out to be a big deal for me. I keep encountering this issue again and again on any long-running app, pretty much running Task useless for such tasks :/

I would fix it myself but I have no idea of how.

Maybe this is causing the issue? Why would the process be instantly killed only on windows instead of waiting for the timeout as well?
image

I tried removing that and recompiling Task but it didn't work. Maybe I modified the incorrect file.

@HeCorr
Copy link
Contributor

HeCorr commented Dec 26, 2021

Actually it seems Task doesn't even use that code. That explains why my changes didn't have an effect.

I think I found the actual code responsible for handling signals and adding a delay didn't solve the issue, which tells me Task might be exiting early without waiting for goroutines to exit or something..

image

image

@marco-m-pix4d
Copy link
Contributor

In case it is not clear: there is a PR to fix this, see #479.

@olegstepura
Copy link

@marco-m-pix4d what is the current status of that PR?

ghostsquad pushed a commit that referenced this issue May 14, 2022
ghostsquad pushed a commit that referenced this issue May 14, 2022
…c/sh

We used to pass to mvdan.cc/sh/interp.Runner a context that was cancelled on
reception of a OS signal. This caused the Runner to terminate the subprocess
abruptly.

The correct behavior instead is for us to completely ignore the signal and let
the subprocess deal with it. If the subprocess doesn't handle the signal, it
will be terminated. If the subprocess does handle the signal, it knows better
than us wether it wants to cleanup and terminate or do something different.

So now we pass an empty context just to make the API of interp.Runner happy

Fixes go-task/task/#458
@ghostsquad
Copy link
Contributor

this should be fixed now with #479 merged. Though we still need to do a release for this feature. Closing this for now.

@ssbarnea
Copy link
Contributor

ssbarnea commented Jun 1, 2022

Any chance to get a new release for this? I just encountred this bug with a simple npm compile that fails with:

ERROR: got KeyboardInterrupt signal
Interrupted (^C): KeyboardInterrupt:

I mentioned that there was no Ctrl-C made by the user at all.

andreynering added a commit that referenced this issue Jun 12, 2022
Task will now give time for the processes running to do cleanup work

Ref #458
Ref #479
Fixes #728

Co-authored-by: Marco Molteni <marco.molteni@pix4d.com>
Co-authored-by: aliculPix4D <aleksandar.licul_ext@pix4d.com>
andreynering added a commit that referenced this issue Jun 12, 2022
Task will now give time for the processes running to do cleanup work

Ref #458
Ref #479
Fixes #728

Co-authored-by: Marco Molteni <marco.molteni@pix4d.com>
Co-authored-by: aliculPix4D <aleksandar.licul_ext@pix4d.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something not working as intended.
Projects
None yet
Development

No branches or pull requests

7 participants