-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/exec: add fields for managing termination signals and pipes #50436
Comments
Change https://golang.org/cl/373005 mentions this issue: |
I think this may be innately race-prone in that theoretically, the process can terminate and be replaced by a new process with that PID between the check and the kill. I have vague recollections of someone having told me there's a fancy new feature in some Linuxes that could bypass this, and it's admittedly a very small window for a very large amount of stuff to happen. |
I'm a bit unclear on the meaning of the exported fields. Are they required to be set prior to launching a command, or can they be changed during the lifespan of the command? What happens if I change c.Interrupt after starting a process? What happens if I write a new value into c.Context while the process is running, and then after doing that close the previous context? etc. I'm not sure what the answers should be, I'm pretty sure that any reasonably plausible answer will be fine, but I think it should be explicit. I think I'd favor "Context is a read-only field once the command is started, Interrupt and WaitDelay are both writeable". |
That is #13987, which seems orthogonal. |
They are like the other (existing) exported fields of |
It is the same issue as 13987, but it's a new circumstance in which we hit that which might be non-obvious to a user, who is assuming that if we are killing "the" process, we'll pick the right one. |
That's true, but applies equally to all of the (numerous) workarounds that build the same behavior on top of |
The Go standard library already uses that feature (the |
I share this concern enthusiastically. Depending on what program you're executing, and in what environment and context, you may need very different timeouts. I also agree that we should let the user decide. As much as I would love for Go to just "do the right thing" without having to extend the API, I don't think we can do that without introducing footguns.
I had originally given #23019 a thumbs up, but I've moved my thumbs up here now instead :) I even think #23019 might qualify as a breaking change, given how it may break some unlucky programs where the timeout may lead to losing some output. |
Thinking about this some more, there is one other interesting caveat: error behavior. If we stop reading the program's output before EOF, it may be useful for the caller to know that. Stopping output before EOF seems orthogonal to the exit status of the program, which is what the Should setting |
The Interrupt field seems super useful to me. However, IMHO adding the Context field to simplify the documentation might end up hurting more than helping as now I've to understand why there are two ways to pass context. Especially after reading this:
|
Someone might ask themselves: is it only if I pass Context as a field or also when using CommandContext? |
Note that without the Context field, we would still have to explain that
Probably we could update the |
I gave this some thought and arrived at:
|
I thought some more about the appropriate Windows behaviors. I added an “Alternatives considered” section and removed the future option for Instead, on Windows the |
This proposal has been added to the active column of the proposals project |
It seems like the docs need to distinguish what we call cmd.Stdin from cmd.Stdout/cmd.Stderr. cmd.Stdin is already closed when we finish writing whatever was standard input, independent of exiting/waiting/cancellation. I'm also confused by the WaitDelay description mixing cancellation and ordinary exits. Maybe it would help to see the diff that would implement this? I looked at CL 373005 but it's a whole package, not a diff. |
A hang due to So, at least in that case, we really do need to treat the I/O pipes symmetrically. |
Thanks for the clarification about stdin. I still feel like it would help me to see the diff in the package, as opposed to a fork as in CL 373005. I'm still not 100% sure I understand the proposal. |
Found another “run or kill” function in the main Go repo, and this one doesn't even bother using an appropriate signal (it skips straight to That manifests as a less-than-useful failure mode in #50973. |
I'm still a little confused about the implications here. I'd like to see a concrete CL to understand it better. Will leave this open for this week, but let me know if you'd rather we put it on hold for now. |
Change https://go.dev/cl/446876 mentions this issue: |
I noticed some test failures in the build dashboard after CL 445597 that made me realize the grace period should be based on the test timeout, not the Context timeout: if the test itself sets a short timeout for a command, we still want to give the test process enough time to consume and log its output. I also put some more thought into how one might debug a test hang, and realized that in that case we don't want to set a WaitDelay at all: instead, we want to leave the processes in their stuck state so that they can be investigated with tools like `ps` and 'lsof'. Updates #50436. Change-Id: I65421084f44eeaaaec5dd2741cd836e9e68dd380 Reviewed-on: https://go-review.googlesource.com/c/go/+/446875 Auto-Submit: Bryan Mills <bcmills@google.com> Reviewed-by: Ian Lance Taylor <iant@google.com> Run-TryBot: Bryan Mills <bcmills@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
ErrWaitDelay is not expected to occur in this test, but if it does it indicates a failure mode very different from the “failed to start” catchall that we log for other non-ExitError errors. Updates #50436. Change-Id: I3f4d87d502f772bf471fc17303d5a6b483446f8f Reviewed-on: https://go-review.googlesource.com/c/go/+/446876 Reviewed-by: Ian Lance Taylor <iant@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Auto-Submit: Bryan Mills <bcmills@google.com> Run-TryBot: Bryan Mills <bcmills@google.com>
This permits us to safely support concurrent access to files on Plan 9. Concurrent access was already safe on other systems. This does introduce a change: if one goroutine calls a blocking read on a pipe, and another goroutine closes the pipe, then before this CL the close would occur. Now the close will be delayed until the blocking read completes. Also add tests that concurrent I/O and Close on a pipe are OK. For golang#50436 For golang#56043 Change-Id: I969c869ea3b8c5c2f2ef319e441a56a3c64e7bf5 Reviewed-on: https://go-review.googlesource.com/c/go/+/438347 Reviewed-by: Bryan Mills <bcmills@google.com> Reviewed-by: Ian Lance Taylor <iant@google.com> Reviewed-by: David du Colombier <0intro@gmail.com> Run-TryBot: Ian Lance Taylor <iant@google.com> Auto-Submit: Ian Lance Taylor <iant@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Rob Pike <r@golang.org>
Fixes golang#50436. Change-Id: I9dff8caa317a04b7b2b605f810b8f12ef8ca485d Reviewed-on: https://go-review.googlesource.com/c/go/+/401835 TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Bryan Mills <bcmills@google.com> Auto-Submit: Bryan Mills <bcmills@google.com> Reviewed-by: Ian Lance Taylor <iant@google.com>
This adds a testenv.CommandContext function, with timeout behavior based on the existing logic in cmd/go.TestScript: namely, the command is terminated with SIGQUIT (if supported) with an arbitrary grace period remaining until the test's deadline. If the test environment does not support executing subprocesses, CommandContext skips the test. If the command is terminated due to the timout expiring or the test fails to wait for the command after starting it, CommandContext marks the test as failing. For tests where a shorter timeout is desired (such as for fail-fast behavior), one may be supplied using context.WithTimeout. The more concise Command helper is like CommandContext but without the need to supply an explicit Context. Updates golang#50436. Change-Id: Ifd81fb86c402f034063c9e9c03045b4106eab81a Reviewed-on: https://go-review.googlesource.com/c/go/+/445596 Run-TryBot: Bryan Mills <bcmills@google.com> Reviewed-by: Austin Clements <austin@google.com> Auto-Submit: Bryan Mills <bcmills@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@google.com>
For most tests, the test's deadline itself is more appropriate than an arbitrary timeout layered atop of it (especially once golang#48157 is implemented), and testenv.Command already adds cleaner timeout behavior when a command would run too close to the test's deadline. That makes RunWithTimeout something of an attractive nuisance. For now, migrate the two existing uses of it to testenv.CommandContext, with a shorter timeout implemented using context.WithTimeout. As a followup, we may want to drop the extra timeouts from these invocations entirely. Updates golang#50436. Updates golang#37405. Change-Id: I16840fd36c0137b6da87ec54012b3e44661f0d08 Reviewed-on: https://go-review.googlesource.com/c/go/+/445597 Reviewed-by: Ian Lance Taylor <iant@google.com> Run-TryBot: Bryan Mills <bcmills@google.com> Auto-Submit: Bryan Mills <bcmills@google.com> Reviewed-by: Austin Clements <austin@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
I noticed some test failures in the build dashboard after CL 445597 that made me realize the grace period should be based on the test timeout, not the Context timeout: if the test itself sets a short timeout for a command, we still want to give the test process enough time to consume and log its output. I also put some more thought into how one might debug a test hang, and realized that in that case we don't want to set a WaitDelay at all: instead, we want to leave the processes in their stuck state so that they can be investigated with tools like `ps` and 'lsof'. Updates golang#50436. Change-Id: I65421084f44eeaaaec5dd2741cd836e9e68dd380 Reviewed-on: https://go-review.googlesource.com/c/go/+/446875 Auto-Submit: Bryan Mills <bcmills@google.com> Reviewed-by: Ian Lance Taylor <iant@google.com> Run-TryBot: Bryan Mills <bcmills@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
ErrWaitDelay is not expected to occur in this test, but if it does it indicates a failure mode very different from the “failed to start” catchall that we log for other non-ExitError errors. Updates golang#50436. Change-Id: I3f4d87d502f772bf471fc17303d5a6b483446f8f Reviewed-on: https://go-review.googlesource.com/c/go/+/446876 Reviewed-by: Ian Lance Taylor <iant@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Auto-Submit: Bryan Mills <bcmills@google.com> Run-TryBot: Bryan Mills <bcmills@google.com>
The implementation is merged and I've added a couple of call sites that use this API in tests to shake out any loose ends. I think all that's left at this point is a release note! |
Change https://go.dev/cl/449076 mentions this issue: |
Updates #50436. Change-Id: Ib6771221bda1c81d5593b29d7287ebcf169882ff Reviewed-on: https://go-review.googlesource.com/c/go/+/449076 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@google.com> Auto-Submit: Bryan Mills <bcmills@google.com> Run-TryBot: Bryan Mills <bcmills@google.com>
Change https://go.dev/cl/377835 mentions this issue: |
The function is derived from the testenv.Command function in the main repo as of CL 446875, with a couple of modifications to allow it to build (with more limited functionality) with Go versions as old as 1.16 (currently needed in order to test gopls with such versions). testenv.Command sets up an exec.Cmd with more useful termination behavior in the context of a test: namely, it is terminated with SIGQUIT (to get a goroutine dump from the subprocess) shortly before the test would otherwise time out. Assuming that the test logs the output from the command appropriately, this should make deadlocks and unexpectedly slow operations easier to diagnose in the builders. For golang/go#50014. Updates golang/go#50436. Change-Id: I872d4b24e63951bf9b7811189e672973d366fb78 Reviewed-on: https://go-review.googlesource.com/c/tools/+/377835 Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org> Reviewed-by: Joedian Reid <joedian@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org> Auto-Submit: Bryan Mills <bcmills@google.com> Run-TryBot: Bryan Mills <bcmills@google.com> Reviewed-by: Robert Findley <rfindley@google.com> gopls-CI: kokoro <noreply+kokoro@google.com>
Change https://go.dev/cl/456116 mentions this issue: |
Change https://go.dev/cl/460997 mentions this issue: |
Change https://go.dev/cl/460998 mentions this issue: |
In CL 436655 I added a GODEBUG setting to this test process to verify that Wait is eventually called for every exec.Cmd before it becomes unreachable. However, the cmdHang test helpers in TestWaitInterrupt/Exit-hang and TestWaitInterrupt/SIGKILL-hang intentially leak a subprocess in order to simulate a leaky third-party program, as Go users might encounter in practical use. To avoid tripping over the leak check, we call Wait on the leaked subprocess in a background goroutine. Since we expect the process running cmdHang to exit before its subprocess does, the call to Wait should have no effect beyond suppressing the leak check. Fixes #57596. Updates #52580. Updates #50436. Change-Id: Ia4b88ea47fc6b605c27ca6d9d7669c874867a900 Reviewed-on: https://go-review.googlesource.com/c/go/+/460998 Run-TryBot: Bryan Mills <bcmills@google.com> Auto-Submit: Bryan Mills <bcmills@google.com> Reviewed-by: Ian Lance Taylor <iant@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
Updates #50436. Updates #56163. Fixes #24050. Change-Id: I1b00eb8fb60e0879f029642b5bad97b2e139fee6 Reviewed-on: https://go-review.googlesource.com/c/go/+/456116 Reviewed-by: Ian Lance Taylor <iant@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Bryan Mills <bcmills@google.com> Auto-Submit: Bryan Mills <bcmills@google.com>
This uses the new Cancel and WaitDelay fields of os/exec.Cmd (added in #50436) to interrupt or kill the 'hg serve' command when its incoming http.Request is canceled. This should keep the vcweb hg handler from getting stuck if 'hg serve' hangs after the request either completes or is canceled. Fixes #57597 (maybe). Change-Id: I53cf58e8ab953fd48c0c37f596f99e885a036d9b Reviewed-on: https://go-review.googlesource.com/c/go/+/460997 Reviewed-by: Ian Lance Taylor <iant@google.com> Run-TryBot: Bryan Mills <bcmills@google.com> Auto-Submit: Bryan Mills <bcmills@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
Add WaitDelay to ensure cmd.Wait() returns in a reasonable timeframe if the goroutines that cmd.Start() uses to copy Stdin/Stdout/Stderr are blocked when copying due to a sub-subprocess holding onto them. Read more details in these issues: - golang/go#23019 - golang/go#50436 This isn't the original intent of kill-delay, but it seems reasonable to reuse it in this context. Fixes canonical#149
…/out/err (#275) Use os.exec's Cmd.WaitDelay to ensure cmd.Wait returns in a reasonable timeframe if the goroutines that cmd.Start() uses to copy stdin/out/err are blocked when copying due to a sub-subprocess holding onto them. Read more details about the issue in golang/go#23019 and the proposed solution (that was added in Go 1.20) in golang/go#50436. This solves issue #149, where Patroni wasn't restarting properly even after a `KILL` signal was sent to it. I had originally mis-diagnosed this problem as an issue with Pebble not tracking the process tree of processes that daemonise and change their process group (which is still an issue, but is not causing this problem). The Patroni process wasn't being marked as finished at all due to being blocked on the `cmd.Wait()`. Patroni starts sub-processes and "forwards" stdin/out/err, so the copy goroutines block. Thankfully Go 1.20 introduced `WaitDelay` to allow you to easily work around this exact problem. The fix itself is [this one-liner] (#275): s.cmd.WaitDelay = s.killDelay() * 9 / 10 // 90% of kill-delay This will really only be a problem for services, but we make the same change for exec and exec health checks as it won't hurt there either. Also, as a drive-by, this PR also canonicalises some log messages: our style is to start with an uppercase letter (for logs, not errors) and to use "Cannot X" rather than "Error Xing". Fixes #149.
Background
#23019 (accepted but not yet implemented; CC @ianlancetaylor @bradfitz) proposed to change
exec.Cmd.Wait
to stop the goroutines that are copying I/O to and from a completedexec.Cmd
; see that proposal for further background on the problem it aims to address. However, as noted in #23019 (comment) and #23019 (comment), any feasible implementation of the proposal requires the use of an arbitrary timeout, and the proposal does not include a mechanism to adjust that timeout. (Given our history with the Go project's builders, I am extremely skeptical that any particular hard-coded timeout can strike an appropriate balance between robustness and latency.)#31774, #22757, and #21135 proposed to allow users of
exec.CommandContext
to customize the signal sent to the command when the context is canceled. They were all declined due to lack of concrete demand for the feature (#21135 (comment), #22757 (comment), #31774 (comment)). We have since accrued a number of copies of functions that work around the feature's absence. In the Go project alone, we have:I'm attempting to add yet another variation (in CL 373005) in order to help diagnose #50014. However, for this variation (prompted by discussions with @aclements and @prattmic) I have tried to make this variation a minimally-invasive change on top of the
exec.Cmd
API.I believe I have achieved that goal: the API requires the addition of only 2–3 new fields and no new methods or top-level functions. You can view (and try out) a prototype as
github.com/bcmills/more/os/moreexec
, which provides a drop-in replacement for a subset of theexec.Cmd
API.Proposal
I propose the addition of the following fields to the
exec.Cmd
struct, along with their corresponding implementation:The new
Context
field is exported only in order to simplify the documentation for theInterrupt
andWaitDelay
fields. (It was requested and rejected in #46699, but the objection there was my own — due to concerns about the interactions with the API in this proposal. It could be excised from this proposal without damaging anything but documentation clarity.)The new
Interrupt
field sets the signal to be sent when theContext
is done.exec.CommandContext
explicitly sets it toos.Kill
in order to maintain the existing behavior ofexec.CommandContext
, but I expect many users on Unix platforms will want to set it toos.Interrupt
orsyscall.SIGQUIT
instead.The new
WaitDelay
field sets the interval to wait for input and output after process termination or an interrupt signal. That interval turns out to be important for many testing applications (such as the Go Playground implementation and thecmd/go
test suite). It also generalizes nicely to the use-cases in #23019: settingWaitDelay
withoutContext
provides bounded I/O wait times without sending a preceding signal.Compatibility
I believe that this proposal is entirely backward-compatible (in contrast with #23019). The zero-values for the new fields provide exactly the same behavior as a
Cmd
returned byexec.Command
orexec.CommandContext
today.Caveats
This proposal does not address graceful shutdown on Windows (#22757 (comment); CC @mvdan). However, it may be possible to extend it to do so by providing special-case Windows behavior when the
Interrupt
field is set toos.Interrupt
, or by adding anInterruptFunc func(*Cmd)
callback that would also be invoked whenContext
is done.The proposed API also does not provide a mechanism to send an
Interrupt
signal followed byos.Kill
after a delay but still wait for subprocesses to close all I/O pipes. I believe the use-cases for that scenario are sufficiently niche to be provided only by third-party libraries: sendingSIGKILL
to the parent process makes it likely that subprocesses will not know to shut down, so in the vast majority of cases users should either not sendSIGKILL
at all (WaitDelay
== 0), forcibly terminate the pipes to try to kill the subprocesses withSIGPIPE
(WaitDelay
> 0), or do something platform-specific to try to forcibly shut down an entire process group (outside the scope of this proposal).Alternatives considered
In #31774 (comment), @bradfitz suggested a field
Kill func(*os.Process)
, which would presumably be added instead of theInterrupt
field in this proposal. However, I believe that such a field would be simultaneously too complex and not powerful enough:The
Kill
field would be too complex for most Unix applications, which overwhelmingly only need to send one ofSIGTERM
,SIGINT
,SIGQUIT
, orSIGKILL
— why pass a whole callback when you really just want to say which signal you need?A
*os.Process
callback would still not be powerful enough for Windows applications. If I understand the discussion in os: Interrupt is not sendable on Windows #6720 correctly (CC @alexbrainman),CTRL_BREAK_EVENT
is sent to an entire process group, not a single*os.Process
, so Windows users would also need a mechanism for creating (or determining) such a group, or some completely separate out-of-band way to request that the process terminate (such as by sending it a particular input or IPC message).Given the above, the
Interrupt
field seems more ergonomic: it gives the right behavior for Unix users, and if Windows users want to do something more complex they can setInterrupt
tonil
and start a separate goroutine in between the calls to(*Cmd).Start
and(*Cmd).Wait
to implement whatever custom logic they want.The text was updated successfully, but these errors were encountered: