Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/exec: add fields for managing termination signals and pipes #50436

Closed
bcmills opened this issue Jan 4, 2022 · 71 comments
Closed

os/exec: add fields for managing termination signals and pipes #50436

bcmills opened this issue Jan 4, 2022 · 71 comments

Comments

@bcmills
Copy link
Contributor

bcmills commented Jan 4, 2022

Background

#23019 (accepted but not yet implemented; CC @ianlancetaylor @bradfitz) proposed to change exec.Cmd.Wait to stop the goroutines that are copying I/O to and from a completed exec.Cmd; see that proposal for further background on the problem it aims to address. However, as noted in #23019 (comment) and #23019 (comment), any feasible implementation of the proposal requires the use of an arbitrary timeout, and the proposal does not include a mechanism to adjust that timeout. (Given our history with the Go project's builders, I am extremely skeptical that any particular hard-coded timeout can strike an appropriate balance between robustness and latency.)

#31774, #22757, and #21135 proposed to allow users of exec.CommandContext to customize the signal sent to the command when the context is canceled. They were all declined due to lack of concrete demand for the feature (#21135 (comment), #22757 (comment), #31774 (comment)). We have since accrued a number of copies of functions that work around the feature's absence. In the Go project alone, we have:

I'm attempting to add yet another variation (in CL 373005) in order to help diagnose #50014. However, for this variation (prompted by discussions with @aclements and @prattmic) I have tried to make this variation a minimally-invasive change on top of the exec.Cmd API.

I believe I have achieved that goal: the API requires the addition of only 2–3 new fields and no new methods or top-level functions. You can view (and try out) a prototype as github.com/bcmills/more/os/moreexec, which provides a drop-in replacement for a subset of the exec.Cmd API.

Proposal

I propose the addition of the following fields to the exec.Cmd struct, along with their corresponding implementation:

	// Context is the context that controls the lifetime of the command
	// (typically the one passed to CommandContext).
	Context context.Context

	// If Interrupt is non-nil, Context must also be non-nil and Interrupt will be
	// sent to the child process when Context is done.
	//
	// If the command exits with a success code after the Interrupt signal has
	// been sent, Wait and similar methods will return Context.Err()
	// instead of nil.
	//
	// If the Interrupt signal is not supported on the current platform
	// (for example, if it is os.Interrupt on Windows), Start may fail
	// (and return a non-nil error).
	Interrupt os.Signal

	// If WaitDelay is non-zero, the command's I/O pipes will be closed after
	// WaitDelay has elapsed after either the command's process has exited or
	// (if Context is non-nil) Context is done, whichever occurs first.
	// If the command's process is still running after WaitDelay has elapsed,
	// it will be terminated with os.Kill before the pipes are closed.
	//
	// If the command exits with a success code after pipes are closed due to
	// WaitDelay and no Interrupt signal has been sent, Wait and similar methods
	// will return ErrWaitDelay instead of nil.
	//
	// If WaitDelay is zero (the default), I/O pipes will be read until EOF,
	// which might not occur until orphaned subprocesses of the command have
	// also closed their descriptors for the pipes.
	WaitDelay time.Duration

The new Context field is exported only in order to simplify the documentation for the Interrupt and WaitDelay fields. (It was requested and rejected in #46699, but the objection there was my own — due to concerns about the interactions with the API in this proposal. It could be excised from this proposal without damaging anything but documentation clarity.)

The new Interrupt field sets the signal to be sent when the Context is done. exec.CommandContext explicitly sets it to os.Kill in order to maintain the existing behavior of exec.CommandContext, but I expect many users on Unix platforms will want to set it to os.Interrupt or syscall.SIGQUIT instead.

The new WaitDelay field sets the interval to wait for input and output after process termination or an interrupt signal. That interval turns out to be important for many testing applications (such as the Go Playground implementation and the cmd/go test suite). It also generalizes nicely to the use-cases in #23019: setting WaitDelay without Context provides bounded I/O wait times without sending a preceding signal.

Compatibility

I believe that this proposal is entirely backward-compatible (in contrast with #23019). The zero-values for the new fields provide exactly the same behavior as a Cmd returned by exec.Command or exec.CommandContext today.

Caveats

This proposal does not address graceful shutdown on Windows (#22757 (comment); CC @mvdan). However, it may be possible to extend it to do so by providing special-case Windows behavior when the Interrupt field is set to os.Interrupt, or by adding an InterruptFunc func(*Cmd) callback that would also be invoked when Context is done.

The proposed API also does not provide a mechanism to send an Interrupt signal followed by os.Kill after a delay but still wait for subprocesses to close all I/O pipes. I believe the use-cases for that scenario are sufficiently niche to be provided only by third-party libraries: sending SIGKILL to the parent process makes it likely that subprocesses will not know to shut down, so in the vast majority of cases users should either not send SIGKILL at all (WaitDelay == 0), forcibly terminate the pipes to try to kill the subprocesses with SIGPIPE (WaitDelay > 0), or do something platform-specific to try to forcibly shut down an entire process group (outside the scope of this proposal).

Alternatives considered

In #31774 (comment), @bradfitz suggested a field Kill func(*os.Process), which would presumably be added instead of the Interrupt field in this proposal. However, I believe that such a field would be simultaneously too complex and not powerful enough:

  • The Kill field would be too complex for most Unix applications, which overwhelmingly only need to send one of SIGTERM, SIGINT, SIGQUIT, or SIGKILL — why pass a whole callback when you really just want to say which signal you need?

  • A *os.Process callback would still not be powerful enough for Windows applications. If I understand the discussion in os: Interrupt is not sendable on Windows #6720 correctly (CC @alexbrainman), CTRL_BREAK_EVENT is sent to an entire process group, not a single *os.Process, so Windows users would also need a mechanism for creating (or determining) such a group, or some completely separate out-of-band way to request that the process terminate (such as by sending it a particular input or IPC message).

Given the above, the Interrupt field seems more ergonomic: it gives the right behavior for Unix users, and if Windows users want to do something more complex they can set Interrupt to nil and start a separate goroutine in between the calls to (*Cmd).Start and (*Cmd).Wait to implement whatever custom logic they want.

@gopherbot gopherbot added this to the Proposal milestone Jan 4, 2022
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/373005 mentions this issue: internal/moreexec: add a utility package for manipulating os/exec.Cmd

@seebs
Copy link
Contributor

seebs commented Jan 4, 2022

// If the command's process is still running after WaitDelay has elapsed,
// it will be terminated with os.Kill before the pipes are closed.

I think this may be innately race-prone in that theoretically, the process can terminate and be replaced by a new process with that PID between the check and the kill. I have vague recollections of someone having told me there's a fancy new feature in some Linuxes that could bypass this, and it's admittedly a very small window for a very large amount of stuff to happen.

@seebs
Copy link
Contributor

seebs commented Jan 4, 2022

I'm a bit unclear on the meaning of the exported fields. Are they required to be set prior to launching a command, or can they be changed during the lifespan of the command? What happens if I change c.Interrupt after starting a process? What happens if I write a new value into c.Context while the process is running, and then after doing that close the previous context? etc.

I'm not sure what the answers should be, I'm pretty sure that any reasonably plausible answer will be fine, but I think it should be explicit. I think I'd favor "Context is a read-only field once the command is started, Interrupt and WaitDelay are both writeable".

@bcmills
Copy link
Contributor Author

bcmills commented Jan 4, 2022

the process can terminate and be replaced by a new process with that PID between the check and the kill.

That is #13987, which seems orthogonal.

@bcmills
Copy link
Contributor Author

bcmills commented Jan 4, 2022

I'm a bit unclear on the meaning of the exported fields. Are they required to be set prior to launching a command, or can they be changed during the lifespan of the command?

They are like the other (existing) exported fields of exec.Cmd: they must be set prior to Start (or equilavent) and then not modified until Wait (or equivalent) returns.

@seebs
Copy link
Contributor

seebs commented Jan 4, 2022

It is the same issue as 13987, but it's a new circumstance in which we hit that which might be non-obvious to a user, who is assuming that if we are killing "the" process, we'll pick the right one.

@bcmills
Copy link
Contributor Author

bcmills commented Jan 4, 2022

That's true, but applies equally to all of the (numerous) workarounds that build the same behavior on top of os/exec.Cmd today. (I don't think this proposal makes that race any worse for anybody than it already is.)

@ianlancetaylor
Copy link
Contributor

theoretically, the process can terminate and be replaced by a new process with that PID between the check and the kill. I have vague recollections of someone having told me there's a fancy new feature in some Linuxes that could bypass this, and it's admittedly a very small window for a very large amount of stuff to happen.

The Go standard library already uses that feature (the waitid or wait6 system call with an option of WNOWAIT) on Linux, DragonFly, and FreeBSD. (For a while we used waitid on macOS but stopped doing so due to #19314.)

@mvdan
Copy link
Member

mvdan commented Jan 4, 2022

However, as noted in #23019 (comment) and #23019 (comment), any feasible implementation of the proposal requires the use of an arbitrary timeout, and the proposal does not include a mechanism to adjust that timeout. (Given our history with the Go project's builders, I am extremely skeptical that any particular hard-coded timeout can strike an appropriate balance between robustness and latency.)

I share this concern enthusiastically. Depending on what program you're executing, and in what environment and context, you may need very different timeouts. I also agree that we should let the user decide. As much as I would love for Go to just "do the right thing" without having to extend the API, I don't think we can do that without introducing footguns.

I believe that this proposal is entirely backward-compatible (in contrast with #23019). The zero-values for the new fields provide exactly the same behavior as a Cmd returned by exec.Command or exec.CommandContext today.

I had originally given #23019 a thumbs up, but I've moved my thumbs up here now instead :) I even think #23019 might qualify as a breaking change, given how it may break some unlucky programs where the timeout may lead to losing some output.

@bcmills
Copy link
Contributor Author

bcmills commented Jan 4, 2022

Thinking about this some more, there is one other interesting caveat: error behavior.

If we stop reading the program's output before EOF, it may be useful for the caller to know that. Stopping output before EOF seems orthogonal to the exit status of the program, which is what the Wait and similar methods currently report.

Should setting WaitDelay cause Wait and similar methods to return a different error if end up forcibly closing I/O pipes?

@henvic
Copy link
Contributor

henvic commented Jan 4, 2022

The Interrupt field seems super useful to me.

However, IMHO adding the Context field to simplify the documentation might end up hurting more than helping as now I've to understand why there are two ways to pass context. Especially after reading this:

// If Interrupt is non-nil, Context must also be non-nil and Interrupt will be
// sent to the child process when Context is done.

@henvic
Copy link
Contributor

henvic commented Jan 4, 2022

Someone might ask themselves: is it only if I pass Context as a field or also when using CommandContext?

@bcmills
Copy link
Contributor Author

bcmills commented Jan 5, 2022

IMHO adding the Context field to simplify the documentation might end up hurting more than helping as now I've to understand why there are two ways to pass context.

Note that without the Context field, we would still have to explain that Interrupt can only be used (or only has an effect) with Cmd instances returned by CommandContext. So I'm not sure that removing it would help with that confusion much. 😅

Someone might ask themselves: is it only if I pass Context as a field or also when using CommandContext?

Probably we could update the CommandContext doc comment to explicitly state that it sets the Context and Interrupt fields?

@bcmills
Copy link
Contributor Author

bcmills commented Jan 5, 2022

Should setting WaitDelay cause Wait and similar methods to return a different error if end up forcibly closing I/O pipes?

I gave this some thought and arrived at:

  • Setting WaitDelay should not change the exit code if the command fails. The fact that the command failed is almost always enough for the caller to know that its output may be suspecct.
  • However, if the command exits with a successful status (that is, if Wait would otherwise return nil) but we had to forcibly close the pipes, we should return a distinguished error (perhaps a new exec.ErrWaitDelay?) so that the caller knows the input and/or output may have been truncated.

@bcmills
Copy link
Contributor Author

bcmills commented Jan 5, 2022

I thought some more about the appropriate Windows behaviors. I added an “Alternatives considered” section and removed the future option for CTRL_BREAK_EVENT, since per #6720 it seems not to be feasible.

Instead, on Windows the Start method should fail with an error wrapping syscall.EWINDOWS if the specified Interrupt signal is not supported (that is, if it is not either nil or os.Kill).

@rsc
Copy link
Contributor

rsc commented Jan 5, 2022

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Jan 19, 2022

It seems like the docs need to distinguish what we call cmd.Stdin from cmd.Stdout/cmd.Stderr. cmd.Stdin is already closed when we finish writing whatever was standard input, independent of exiting/waiting/cancellation.

I'm also confused by the WaitDelay description mixing cancellation and ordinary exits.

Maybe it would help to see the diff that would implement this? I looked at CL 373005 but it's a whole package, not a diff.

@bcmills
Copy link
Contributor Author

bcmills commented Jan 19, 2022

It seems like the docs need to distinguish what we call cmd.Stdin from cmd.Stdout/cmd.Stderr. cmd.Stdin is already closed when we finish writing whatever was standard input, independent of exiting/waiting/cancellation.

A hang due to stdin copying analogous to the one involving stdout and stderr can occur if the command spawns its own subprocess with cmd.Stdin = os.Stdin and then exits without either process reading stdin to completion. (The copying goroutine is started in (*Cmd).stdin, and it only closes the pipe once copying is complete — which may be arbitrarily long if the writes don't fit in whatever buffer the kernel provides.)

So, at least in that case, we really do need to treat the I/O pipes symmetrically.

@rsc
Copy link
Contributor

rsc commented Jan 26, 2022

Thanks for the clarification about stdin.

I still feel like it would help me to see the diff in the package, as opposed to a fork as in CL 373005. I'm still not 100% sure I understand the proposal.

@bcmills
Copy link
Contributor Author

bcmills commented Feb 2, 2022

Found another “run or kill” function in the main Go repo, and this one doesn't even bother using an appropriate signal (it skips straight to SIGKILL):
https://cs.opensource.google/go/go/+/master:test/run.go;l=834-854;drc=f4aa021985e9ae4a9a395f8fbe32ad08d2bfda3b

That manifests as a less-than-useful failure mode in #50973.

@rsc
Copy link
Contributor

rsc commented Feb 2, 2022

I'm still a little confused about the implications here. I'd like to see a concrete CL to understand it better. Will leave this open for this week, but let me know if you'd rather we put it on hold for now.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/446876 mentions this issue: runtime: check for ErrWaitDelay is runBuiltTestProg

gopherbot pushed a commit that referenced this issue Nov 1, 2022
I noticed some test failures in the build dashboard after CL 445597
that made me realize the grace period should be based on the test
timeout, not the Context timeout: if the test itself sets a short
timeout for a command, we still want to give the test process enough
time to consume and log its output.

I also put some more thought into how one might debug a test hang, and
realized that in that case we don't want to set a WaitDelay at all:
instead, we want to leave the processes in their stuck state so that
they can be investigated with tools like `ps` and 'lsof'.

Updates #50436.

Change-Id: I65421084f44eeaaaec5dd2741cd836e9e68dd380
Reviewed-on: https://go-review.googlesource.com/c/go/+/446875
Auto-Submit: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
gopherbot pushed a commit that referenced this issue Nov 1, 2022
ErrWaitDelay is not expected to occur in this test, but if it does
it indicates a failure mode very different from the “failed to start”
catchall that we log for other non-ExitError errors.

Updates #50436.

Change-Id: I3f4d87d502f772bf471fc17303d5a6b483446f8f
Reviewed-on: https://go-review.googlesource.com/c/go/+/446876
Reviewed-by: Ian Lance Taylor <iant@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Bryan Mills <bcmills@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Nov 3, 2022
This permits us to safely support concurrent access to files on Plan 9.
Concurrent access was already safe on other systems.

This does introduce a change: if one goroutine calls a blocking read
on a pipe, and another goroutine closes the pipe, then before this CL
the close would occur. Now the close will be delayed until the blocking
read completes.

Also add tests that concurrent I/O and Close on a pipe are OK.

For golang#50436
For golang#56043

Change-Id: I969c869ea3b8c5c2f2ef319e441a56a3c64e7bf5
Reviewed-on: https://go-review.googlesource.com/c/go/+/438347
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Reviewed-by: David du Colombier <0intro@gmail.com>
Run-TryBot: Ian Lance Taylor <iant@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Rob Pike <r@golang.org>
romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Nov 3, 2022
Fixes golang#50436.

Change-Id: I9dff8caa317a04b7b2b605f810b8f12ef8ca485d
Reviewed-on: https://go-review.googlesource.com/c/go/+/401835
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Bryan Mills <bcmills@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Nov 3, 2022
This adds a testenv.CommandContext function, with timeout behavior
based on the existing logic in cmd/go.TestScript: namely, the command
is terminated with SIGQUIT (if supported) with an arbitrary grace
period remaining until the test's deadline.

If the test environment does not support executing subprocesses,
CommandContext skips the test.

If the command is terminated due to the timout expiring or the test
fails to wait for the command after starting it, CommandContext marks
the test as failing.

For tests where a shorter timeout is desired (such as for fail-fast
behavior), one may be supplied using context.WithTimeout.

The more concise Command helper is like CommandContext but without
the need to supply an explicit Context.

Updates golang#50436.

Change-Id: Ifd81fb86c402f034063c9e9c03045b4106eab81a
Reviewed-on: https://go-review.googlesource.com/c/go/+/445596
Run-TryBot: Bryan Mills <bcmills@google.com>
Reviewed-by: Austin Clements <austin@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Nov 3, 2022
For most tests, the test's deadline itself is more appropriate than an
arbitrary timeout layered atop of it (especially once golang#48157 is
implemented), and testenv.Command already adds cleaner timeout
behavior when a command would run too close to the test's deadline.

That makes RunWithTimeout something of an attractive nuisance. For
now, migrate the two existing uses of it to testenv.CommandContext,
with a shorter timeout implemented using context.WithTimeout.

As a followup, we may want to drop the extra timeouts from these
invocations entirely.

Updates golang#50436.
Updates golang#37405.

Change-Id: I16840fd36c0137b6da87ec54012b3e44661f0d08
Reviewed-on: https://go-review.googlesource.com/c/go/+/445597
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
Reviewed-by: Austin Clements <austin@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Nov 3, 2022
I noticed some test failures in the build dashboard after CL 445597
that made me realize the grace period should be based on the test
timeout, not the Context timeout: if the test itself sets a short
timeout for a command, we still want to give the test process enough
time to consume and log its output.

I also put some more thought into how one might debug a test hang, and
realized that in that case we don't want to set a WaitDelay at all:
instead, we want to leave the processes in their stuck state so that
they can be investigated with tools like `ps` and 'lsof'.

Updates golang#50436.

Change-Id: I65421084f44eeaaaec5dd2741cd836e9e68dd380
Reviewed-on: https://go-review.googlesource.com/c/go/+/446875
Auto-Submit: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Nov 3, 2022
ErrWaitDelay is not expected to occur in this test, but if it does
it indicates a failure mode very different from the “failed to start”
catchall that we log for other non-ExitError errors.

Updates golang#50436.

Change-Id: I3f4d87d502f772bf471fc17303d5a6b483446f8f
Reviewed-on: https://go-review.googlesource.com/c/go/+/446876
Reviewed-by: Ian Lance Taylor <iant@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Bryan Mills <bcmills@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
@bcmills
Copy link
Contributor Author

bcmills commented Nov 3, 2022

The implementation is merged and I've added a couple of call sites that use this API in tests to shake out any loose ends.

I think all that's left at this point is a release note!

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/449076 mentions this issue: doc/go1.20: add a release note for os/exec API changes

gopherbot pushed a commit that referenced this issue Nov 9, 2022
Updates #50436.

Change-Id: Ib6771221bda1c81d5593b29d7287ebcf169882ff
Reviewed-on: https://go-review.googlesource.com/c/go/+/449076
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/377835 mentions this issue: internal/testenv: add a Command function that replaces exec.Command

gopherbot pushed a commit to golang/tools that referenced this issue Nov 21, 2022
The function is derived from the testenv.Command function in the main
repo as of CL 446875, with a couple of modifications to allow it to
build (with more limited functionality) with Go versions as old as
1.16 (currently needed in order to test gopls with such versions).

testenv.Command sets up an exec.Cmd with more useful termination
behavior in the context of a test: namely, it is terminated with
SIGQUIT (to get a goroutine dump from the subprocess) shortly before
the test would otherwise time out.

Assuming that the test logs the output from the command appropriately,
this should make deadlocks and unexpectedly slow operations easier to
diagnose in the builders.

For golang/go#50014.
Updates golang/go#50436.

Change-Id: I872d4b24e63951bf9b7811189e672973d366fb78
Reviewed-on: https://go-review.googlesource.com/c/tools/+/377835
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Reviewed-by: Joedian Reid <joedian@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Bryan Mills <bcmills@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
Reviewed-by: Robert Findley <rfindley@google.com>
gopls-CI: kokoro <noreply+kokoro@google.com>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/456116 mentions this issue: cmd/go: use Cancel and WaitDelay to terminate test subprocesses

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/460997 mentions this issue: cmd/go/internal/vcweb: simplify hgHandler cancellation

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/460998 mentions this issue: os/exec: avoid leaking an exec.Cmd in TestWaitInterrupt

gopherbot pushed a commit that referenced this issue Jan 10, 2023
In CL 436655 I added a GODEBUG setting to this test process to verify
that Wait is eventually called for every exec.Cmd before it becomes
unreachable. However, the cmdHang test helpers in
TestWaitInterrupt/Exit-hang and TestWaitInterrupt/SIGKILL-hang
intentially leak a subprocess in order to simulate a leaky third-party
program, as Go users might encounter in practical use.

To avoid tripping over the leak check, we call Wait on the leaked
subprocess in a background goroutine. Since we expect the process
running cmdHang to exit before its subprocess does, the call to Wait
should have no effect beyond suppressing the leak check.

Fixes #57596.
Updates #52580.
Updates #50436.

Change-Id: Ia4b88ea47fc6b605c27ca6d9d7669c874867a900
Reviewed-on: https://go-review.googlesource.com/c/go/+/460998
Run-TryBot: Bryan Mills <bcmills@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
gopherbot pushed a commit that referenced this issue Jan 19, 2023
Updates #50436.
Updates #56163.
Fixes #24050.

Change-Id: I1b00eb8fb60e0879f029642b5bad97b2e139fee6
Reviewed-on: https://go-review.googlesource.com/c/go/+/456116
Reviewed-by: Ian Lance Taylor <iant@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Bryan Mills <bcmills@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
gopherbot pushed a commit that referenced this issue Jan 19, 2023
This uses the new Cancel and WaitDelay fields of os/exec.Cmd
(added in #50436) to interrupt or kill the 'hg serve' command
when its incoming http.Request is canceled.

This should keep the vcweb hg handler from getting stuck if 'hg serve'
hangs after the request either completes or is canceled.

Fixes #57597 (maybe).

Change-Id: I53cf58e8ab953fd48c0c37f596f99e885a036d9b
Reviewed-on: https://go-review.googlesource.com/c/go/+/460997
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
benhoyt added a commit to benhoyt/pebble that referenced this issue Aug 15, 2023
Add WaitDelay to ensure cmd.Wait() returns in a reasonable timeframe if
the goroutines that cmd.Start() uses to copy Stdin/Stdout/Stderr are
blocked when copying due to a sub-subprocess holding onto them. Read
more details in these issues:

- golang/go#23019
- golang/go#50436

This isn't the original intent of kill-delay, but it seems reasonable
to reuse it in this context.

Fixes canonical#149
benhoyt added a commit to canonical/pebble that referenced this issue Aug 18, 2023
…/out/err (#275)

Use os.exec's Cmd.WaitDelay to ensure cmd.Wait returns in a reasonable
timeframe if the goroutines that cmd.Start() uses to copy stdin/out/err
are blocked when copying due to a sub-subprocess holding onto them.
Read more details about the issue in
golang/go#23019 and the proposed solution
(that was added in Go 1.20) in
golang/go#50436.

This solves issue #149, where Patroni
wasn't restarting properly even after a `KILL` signal was sent to it.
I had originally mis-diagnosed this problem as an issue with Pebble
not tracking the process tree of processes that daemonise and change
their process group (which is still an issue, but is not causing this
problem). The Patroni process wasn't being marked as finished at all
due to being blocked on the `cmd.Wait()`. Patroni starts
sub-processes and "forwards" stdin/out/err, so the copy goroutines
block. Thankfully Go 1.20 introduced `WaitDelay` to allow you to
easily work around this exact problem.

The fix itself is [this one-liner]
(#275):

s.cmd.WaitDelay = s.killDelay() * 9 / 10  // 90% of kill-delay

This will really only be a problem for services, but we make the same
change for exec and exec health checks as it won't hurt there
either.

Also, as a drive-by, this PR also canonicalises some log messages: our
style is to start with an uppercase letter (for logs, not errors) and
to use "Cannot X" rather than "Error Xing".

Fixes #149.
@rsc rsc removed this from Proposals Oct 26, 2023
@golang golang locked and limited conversation to collaborators Jan 10, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants