-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/exec: exec.Cmd fails to cancel with non-*os.File outputs on linux #18874
Comments
I initially thought that this was caused by Context cancelation, but it turns out to be the same when directly killing the process. This also hangs/panics
A second call to wait rightly returns |
It seems to work fine for me. Both programs end immediately with Can you post a complete program, and post the output of the program? Thanks. |
I reproduce the problem on:
|
@dsnet Can you attach your complete program? |
It is quite literally the program from the original post: https://play.golang.org/p/ovO_hVDW49 Interestingly, I'm not able to repro the issue on my work computer. |
Interesting. I've now tried that program on three different GNU/Linux distros, kernel versions 2.6.32, 2.6.35, 4.2.0. For me it always completes immediately. @jbardin Tell us more about the system you are running. |
When the context is cancelled we kill the child process, in this case I'm not even sure this is wrong. |
If it helps in any way: Does Panic: Does Panic: Does NOT Panic: |
I think I see the difference, I bet all the systems where this fails have I'm also note sure what the correct behavior is now, but dash is the only common shell I can find that does this. |
confirming the only system where it does not panic is not dash: ll -a /bin/sh |
I'm inclined to close this as working as expected. You are asking for the processes's standard output to be collected. That means that the |
@ianlancetaylor , I think this is a documentation bug: The problem is
Consider this example: package main
import (
"context"
"fmt"
"io"
"os/exec"
"time"
)
type Hanger time.Duration
func (h Hanger) Read(b []byte) (int, error) {
time.Sleep(time.Duration(h))
return 0, io.EOF
}
func main() {
ctx, _ := context.WithTimeout(context.Background(), 5*time.Second)
cmd := exec.CommandContext(ctx, "/bin/sh", "-c", "sleep 7")
cmd.Stdin = Hanger(10 * time.Second)
cmd.Start()
now := time.Now()
go func() {
<-ctx.Done()
fmt.Printf("context: %v\n", time.Since(now))
}()
cmd.Wait()
fmt.Printf("wait: %v\n", time.Since(now))
} This prints
|
I don't see why this has anything to do with |
Yes. And that works as expected. But the code shouldn't use |
But, yes, you're right. The |
The difference in shell behavior here is that bash, zsh, etc will exec a single command in place, where as dash always executes in a sub process. Giving bash a series of commands causes it to exhibit this behavior too, and I assume it will be the same on all unix platforms.
This explains a number of issues people have asked me about in os/exec, and I didn't make the connection that the io loop for non *os.File stdin/stdout/stderr was the difference. I wonder if this could be cleaned up without having interruptible IO on regular file descriptors. We could poll the output with select, and check if we've killed the process in the meantime. We are killing the first child process, and we can wait on that immediately which is all I would expect. I wouldn't expect to be waiting on an FD passed to another process (which will die as soon as we close it). Or maybe because this is a particular behavior of shells, we just document it as such. |
You always need to call Wait on a child process. Waiting on the context doesn't do that for you. |
@jbardin , sure. I mean something like that: go func() {
<-ctx.Done()
cancel_input_producers
cancel_output_consumers
}()
....
cmd.Wait() |
@evverx: That still hangs. The Wait still may never return -- I've only put it in the goroutine to be able to show the timeout and panic. |
It depends. Your example doesn't control Anyway, that's unrelated to this issue. |
I think the underlying issue actually is directly related to that example. What's causing the hang here is the hidden goroutine (referenced via Cmd.goroutine, communicating over Cmd.errch) that's launched to copy from the process file descriptor to the io.Writer, which equates to the Using an os.Pipe for the output means the process can write directly to the pipe fd, and the blocked Read is lifted into my code where I can choose to stop waiting on it (though I still can't cancel that Read without polling the file descriptor). My feeling here is that Wait is only expected to call the wait syscall, and has done so at this point. The child process has been terminated and reaped, and now we're blocking on file descriptors which are being used by a new process that we don't intend kill or wait on. @ianlancetaylor, with the current file io implementation, would polling the FDs or selecting along with a self-pipe be an acceptable solution? (also related to #18507) Without an immediate solution, I think it would be good to document that setting the process stdin/stdout/stderr to non-*os.Files could cause Wait to block indefinitely. |
I don't understand your suggestion about polling. Right now if you point I agree that the documentation needs to be better. |
I guess it depends on the interpretation of "the program's standard output". I was expecting Wait to wait on the process and nothing more, but it's blocked waiting on the output from another process which isn't going to be killed. By the time Wait is blocked, the process referenced by Cmd.Process is gone and reaped. Short of a delay in the kernel buffering the output, there won't ever be any more data coming from the process that is being waited on. If we're continuing to block there, it must be because another process has taken over the fd. Some might consider that output part of the "program" launched by exec, which is why I'm not certain that aborting the Read is 100% correct, but it also seems strange that we sent a SIGKILL to a process yet Wait is blocked for some reason. By polling, I mean syscall.Select with a timeout to check for cancellation, but the self-pipe trick works just as well, in order to not have a blocked Read call. |
Would it make sense to have I'm not sure whether that would actually change the behavior of |
Sending |
I can confirm this with go tip ( to consistently reproduce with any shell use 'sleep 60; sleep 60' as suggested above: complete code https://play.golang.org/p/hORl4wxoQk). The backtrace is slightly different (due to c05b06a I think): https://gist.github.com/ronin13/16a328dd56846a28d25feb8a2140f858 |
CL https://golang.org/cl/42270 mentions this issue. |
Quoting @eZanmoto from #20730 :
|
CL https://golang.org/cl/47650 mentions this issue. |
It already implied that Cmd.Wait is more than os.Process.Wait, but say so explicitly. See #18874 (comment) Updates #18874 Change-Id: Iaa46defd776ae0be817d9f4466a99ac78cfd672b Reviewed-on: https://go-review.googlesource.com/47650 Reviewed-by: Russ Cox <rsc@golang.org>
I suspect Go is calling `/bin/sh` when using os.exec Command, but this will spawn `dash` on LX zones. Incorrect behavior expected as shown on golang/go#18874
* fails: work around golang/go/issues/18874 * fails: have curl output server errors
The docs have been improved as part of #22610. It's not clear that there is anything else to do here. |
It was copied from my local go version at: go version go1.9.2 linux/amd64 This is so that we can export a helper to close IO explicitly when we need it. For more info, see this issue: golang/go#18874
I'm going to close this. The documentation is clearer in the upcoming 1.10 release. I don't see how we can change anything in the behavior. |
implement workaround for golang/go#18874
go1.7, go1.8rc3
When using a
Context
to cancel anexec.Cmd
, if either of the outputs are anio.Writer
which isn't an*os.File
the command will fail to terminate.The following panics on amd64 linux
The full stack trace is:
The text was updated successfully, but these errors were encountered: