Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testing: test process hangs beyond -timeout if a child process holds I/O streams open #24050

Closed
psanford opened this issue Feb 22, 2018 · 22 comments
Assignees
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@psanford
Copy link

What version of Go are you using (go version)?

go version go1.10 linux/amd64

What operating system and processor architecture are you using (go env)?

$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/psanford/.cache/go-build"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/psanford/projects/go"
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build953969108=/tmp/go-build -gno-record-gcc-switches"

What did you do?

After upgrading to 1.10 we had one test that started to hang intermittently. The test in question starts a child process which it kills by canceling a context object at the end of the test method. It does not do an explicit cmd.Wait().

Here is a minimal test case that demonstrates the problem:

https://play.golang.org/p/8rq41A5Khsm

I can get this to hang consistently by running it in a bash while loop:

$ while true; do go test -timeout 5s -v -count 1 .; sleep 0.1; done
=== RUN   TestOSExecNoWait
start
done
--- PASS: TestOSExecNoWait (0.01s)
PASS
ok      _/tmp   0.012s
=== RUN   TestOSExecNoWait
start
done
--- PASS: TestOSExecNoWait (0.01s)
PASS
ok      _/tmp   0.012s
=== RUN   TestOSExecNoWait
start
done
--- PASS: TestOSExecNoWait (0.01s)
PASS
ok      _/tmp   0.012s
=== RUN   TestOSExecNoWait
start
done
--- PASS: TestOSExecNoWait (0.01s)
PASS

<hangs here indefinitely>

If I explicitly call cmd.Wait() the test does not hang. If I don't attach the child process' Stdout and Stderr to os.Std{out,err} the test does not hang.

On 1.9.4 the test does not hang.

Its also interesting that even though I specified -timeout 5s the test runner hangs forever.

@ianlancetaylor ianlancetaylor added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker labels Feb 23, 2018
@ianlancetaylor ianlancetaylor added this to the Go1.10.1 milestone Feb 23, 2018
@crvv
Copy link
Contributor

crvv commented Feb 23, 2018

the test runner hangs forever.

It only hangs 60 seconds on my machine.

go test hangs at

done <- cmd.Wait()

And cmd.Wait() is waiting at

_, err := io.Copy(w, pr)

This io.Copy() is reading a pipe reader. It won't return until the pipe writer is closed.
The pipe writer will be closed when the sleep 60 exits. So it hangs 60 seconds.

func TestOSExecNoWait(t *testing.T) {
...
	cancel()
}

After the cancel() returns, the main function will also return.
The child process may or may not be killed, so it hangs intermittently.

@psanford
Copy link
Author

It only hangs 60 seconds on my machine.

Yes in the example provided it only hangs for 60 seconds because the child process exits.

In the test where I first found this issue the child process never exits so it hangs forever.

@crvv
Copy link
Contributor

crvv commented Feb 24, 2018

I don't think this issue is a regression introduced in Go 1.10.
I can use Go 1.9 to reproduce it but with a different condition.

This difference was introduced in bd95f88.
bd95f88#diff-acaf53a9cd478507ebbcf85037940b4dL1080

If the go command needs a bytes.Buffer to save the output, os/exec will open a pipe.
And go will hang until the pipe is closed.

The timeout doesn't work because it isn't handled by go.
There is a testKillTimeout in go, but it also hangs at cmd.Wait().

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/97497 mentions this issue: cmd/go/internal/test: don't wait for pending I/O if child process has gone

@ianlancetaylor
Copy link
Contributor

See also #23019.

@FiloSottile
Copy link
Contributor

@gopherbot please open backport tracking issues. This might be a 1.10 regression, or also a 1.9 issue.

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #25042 (for 1.10), #25043 (for 1.9).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

arvados-bot pushed a commit to arvados/arvados that referenced this issue Apr 26, 2018
In go 1.10.1, these seem to make "go test" hang sometimes.

golang/go#24050

No issue #

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <tclegg@veritasgenetics.com>
@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Go1.12 Jun 29, 2018
@andybons andybons modified the milestones: Go1.12, Go1.13 Feb 12, 2019
@andybons andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@mvdan
Copy link
Member

mvdan commented Sep 9, 2020

Also, before anyone spends significant time into a fix, I think we could just fix #23019 instead. It would save far more time in the long run, because it's a change we likely want to do anyway and should fix other cascading bugs. But it's also a bit more controversial.

@bcmills
Copy link
Contributor

bcmills commented Sep 9, 2020

I'm really not sure about #23019: the copying is a symptom (of child processes left running with open file handles), not the root cause here.

I think the main problem in this case is that the child process is not being terminated when the test process is. I think go test should make a best effort to terminate all subprocesses of the test process, unless they were explicitly started in a way that would suggest otherwise (such as by setting Setpgid in the SysProcAttr field).

However, I'm not sure what the most appropriate way to achieve that would be. I could imagine:

  • Hooking the os package from the testing package to have it set cmd.SysProcAttr.Pdeathsig implicitly where available (but I think “where available” is only Linux)?
  • Hooking the os package so that the testing.M.startAlarm callback can enumerate and signal all subprocesses started from the test?
  • Changing cmd/go to start each test in its own process group, forward signals to that process group explicitly, and signal the whole group on timeout?
  • Changing testing and/or cmd/go to enumerate child processes using procfs or similar?
  • Something else that I've missed?

And I'm not even sure where to start on Windows, since I assume there is no procfs there.

@mvdan
Copy link
Member

mvdan commented Sep 9, 2020

I think you're right, we should terminate all children processes. It's a bit alarming that that's not the default behavior, and more so that there doesn't seem to be a portable way to do it.

You could blame this on the written test, since technically I could wrap exec calls with a timeout from testing.T.Deadline. Still, you'd be relying on the tests being written properly that way, which is not very realistic, particularly since that Deadline method is very new.

Shouldn't os/exec have some sort of option to apply this "kill all child processes" logic in the most portable way possible? It seems like such a common source of problems, and I've personally had to implement non-portable hacks like Pdeathsig before.

@bcmills
Copy link
Contributor

bcmills commented Sep 9, 2020

I don't think the go test behavior should rely on folks using os/exec in any particular way. (I would rather we find a more transparent fix in cmd/go.)

@bcmills
Copy link
Contributor

bcmills commented Sep 9, 2020

But yes: I think it would be handy for the os package to have some straightforward way to express “send this signal to this process and all of its children”. Maybe that's the process-group solution?

@ianlancetaylor
Copy link
Contributor

The problem with a general "signal process and its children" mechanism on Unix systems is that the only options are "send signal to process" and "send signal to thread" and "send signal to process group". And we definitely don't want to start every new process in a different process group, as that will have surprising effects on the use of ^Z from the terminal. And of course we have no idea what processes a child process may start.

That said we could in principle invoke the ps program or (on GNU/Linux) read the /proc file system to look for all processes, build a process tree, and use that to identify the children of our child process. But that seems like a rather complex mechanism to implement a standard library function.

For testing specifically, it might be somewhat acceptable for go test to start the test binary in a separate process group, in which case it would be straightforward for the go command to kill the entire test binary process group.

@mvdan
Copy link
Member

mvdan commented Sep 9, 2020

Interesting, I wasn't thinking of the implications of doing this in general. I agree that the use case for testing is pretty narrow, especially because I don't think we should ever leak processes or run in the background.

@bcmills
Copy link
Contributor

bcmills commented Nov 19, 2020

See previously #28039.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/400877 mentions this issue: cmd/go/internal/test: used lockedStdout instead of os.Stdout

gopherbot pushed a commit that referenced this issue Apr 21, 2022
There is an issue where 'go test' will hang after the tests complete if
a test starts a sub-process that does not exit (see #24050).

However, go test only exhibits that behavior when a package name is
explicitly passed as an argument. If 'go test' is invoked without any
package arguments then the package in the working directory is assumed,
however in that case (and only that case) os.Stdout is used as the test
process's cmd.Stdout, which does *not* cause 'go test' wait for the
sub-process to exit (see #23019).

This change wraps os.Stdout in an io.Writer struct in this case, hiding
the *os.File from the os/exec package, causing cmd.Wait to always wait
for the full output from the test process and any of its sub-processes.

In other words, this makes 'go test' exhibit the same behavior as
'go test .' (or 'go test ./...' and so on).

Update #23019
Update #24050

Change-Id: Ica09bf156f3b017f9a31aad91ed0f16a7837195b
Reviewed-on: https://go-review.googlesource.com/c/go/+/400877
Reviewed-by: Bryan Mills <bcmills@google.com>
Run-TryBot: Andrew Gerrand <adg@golang.org>
Auto-Submit: Andrew Gerrand <adg@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Andrew Gerrand <adg@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
@josharian
Copy link
Contributor

This just bit me as well. Ran through a fair few hours of GitHub CI time before I noticed something was wrong. (Tip: set timeout-minutes in your Actions even if you don't need it.)

For those finding this thread and hoping for some code to copy/paste to work around the issue, here's the suggestion from @mvdan above, spelled out.

Change:

cmd := exec.Command("YOUR_PROGRAM")

to something along these lines:

ctx := context.Background()
deadline, ok := t.Deadline()
if ok {
	// Give ourselves 100ms to kill the command before the actual deadline arrives.
	deadline = deadline.Add(-100 * time.Millisecond)
	var cancel func()
	ctx, cancel = context.WithDeadline(ctx, deadline)
	t.Cleanup(cancel)
}
cmd := exec.CommandContext(ctx, "YOUR_PROGRAM")

(@mvdan @bcmills @ianlancetaylor feel free to edit this comment directly if you'd suggest a different formulation)

@bcmills
Copy link
Contributor

bcmills commented May 25, 2022

@josharian, if your subprocess is a Go program I suggest sending SIGQUIT on timeout instead of SIGTERM (so that you get a goroutine dump), but that's currently a bit complicated.

(My proposal #50436 aims to make that somewhat smoother, but the fact that SIGQUIT is not defined on all platforms remains a bit of a rough edge.)

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/456116 mentions this issue: cmd/go: use Cancel and WaitDelay to terminate test subprocesses

@bcmills bcmills self-assigned this Dec 14, 2022
@bcmills bcmills modified the milestones: Backlog, Go1.21 Dec 14, 2022
@bcmills bcmills added the NeedsFix The path to resolution is known, but the work has not been done. label Dec 14, 2022
@gopherbot gopherbot removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 14, 2022
@bcmills
Copy link
Contributor

bcmills commented Dec 14, 2022

I've been thinking about this some more.
I've added a reproducer (and proposed fix) in https://go.dev/cl/456116.

Notably, it is possible to reproduce this behavior even for a passing test, not just a failing one: the test may start and leak a subprocess, and then return from all of the Test functions, print PASS, and exit with code 0 while the subprocess is still running (and still holding open the pipes).

I don't think tests should be in the habit of orphaning subprocesses, and perhaps we should fail (or terminate the subprocesses) of tests that do so. On the other hand, that's a much harder problem, and programs that start subprocesses that may run indefinitely should arrange for those processes to be terminated anyway — whether by using Pdeathsig, or having the subprocess periodically probe stderr or stdout, or having it exit when stdin is closed.

(Note that in the integration test, I used a periodic write to stderr to terminate the subprocess. It's not perfect — it can leave the orphan running for up to 100ms after everything else has returned — but it at least does eventually clear up on its own.)

So, for now I think we should set a WaitDelay on the test process, and mark the test as failed if that WaitDelay expires (even if it would have passed otherwise).

@bcmills bcmills changed the title testing: test with child process sometimes hangs on 1.10; -timeout not respected testing: test process hangs beyond -timeout if a child process holds I/O streams open Jan 20, 2023
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/464555 mentions this issue: cmd/go/internal/test: scale the WaitDelay for test subprocesses

gopherbot pushed a commit that referenced this issue Feb 2, 2023
Prior to CL 456116 we had an arbitrary 5-second delay after a test
times out before we kill the test. In CL 456116, I reused that
arbitrary 5-second delay as the WaitDelay as well, but on slower
builders it does not seem to be generous enough.

Instead of hard-coding the delay, for tests with a finite timout we
now use a hard-coded fraction of the overall timeout. That will
probably give delays that are longer than strictly necessary for very
long timeouts, but if the user is willing to wait for a very long
timeout they can probably wait a little longer for I/O too.

Fixes #58230.
Updates #24050.

Change-Id: Ifbf3e576c034c721aa00cd17bf88563474b09955
Reviewed-on: https://go-review.googlesource.com/c/go/+/464555
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
johanbrandhorst pushed a commit to Pryz/go that referenced this issue Feb 12, 2023
Prior to CL 456116 we had an arbitrary 5-second delay after a test
times out before we kill the test. In CL 456116, I reused that
arbitrary 5-second delay as the WaitDelay as well, but on slower
builders it does not seem to be generous enough.

Instead of hard-coding the delay, for tests with a finite timout we
now use a hard-coded fraction of the overall timeout. That will
probably give delays that are longer than strictly necessary for very
long timeouts, but if the user is willing to wait for a very long
timeout they can probably wait a little longer for I/O too.

Fixes golang#58230.
Updates golang#24050.

Change-Id: Ifbf3e576c034c721aa00cd17bf88563474b09955
Reviewed-on: https://go-review.googlesource.com/c/go/+/464555
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
@golang golang locked and limited conversation to collaborators Feb 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

10 participants