Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pty.Start seems to close the terminal too early #127

Closed
mvdan opened this issue Sep 27, 2021 · 16 comments
Closed

pty.Start seems to close the terminal too early #127

mvdan opened this issue Sep 27, 2021 · 16 comments
Assignees

Comments

@mvdan
Copy link

mvdan commented Sep 27, 2021

Thanks for writing this package! I've been using it successfully for some time to test a shell interpreter over at https://github.com/mvdan/sh :)

I've been running into sporadic flakes with pty.Start for some time; see mvdan/sh#513. It stumped me for a long time, but today it just clicked why it's happening: mvdan/sh#513 (comment)

The culprit seems to be this line:

pty/run.go

Line 50 in 8ac0cc1

defer func() { _ = tty.Close() }() // Best effort.

As proof, see how swapping pty.Start with a more manual pty.Open that closes after cmd.Wait() never fails: mvdan/sh@f7684ec

I'm not sure why the deferred close is there, or why I'm the only one running into this issue. But given that the API is called pty.Start, not pty.Run, it feels wrong to close the terminal as soon as the function returns.

Would you accept a patch with the fix? That is, removing line 50, and relying on the caller to close the primary/master terminal (pty) after they have done cmd.Wait(), just like in my updated test.

@mvdan
Copy link
Author

mvdan commented Sep 27, 2021

@myitcv @leitzler I actually wonder if this affects you too, because some of your https://github.com/govim/govim tests also use pty.Start. I seem to recall you mentioned some CI flakes involving pseudo-terminals as well, though I was unable to find any mention of signal: hangup in your issue tracker.

@creack
Copy link
Owner

creack commented Sep 27, 2021

I'll try to take a look this week.

@creack creack self-assigned this Sep 27, 2021
@kr
Copy link
Collaborator

kr commented Sep 27, 2021

That is, removing line 50, and relying on the caller to close the primary/master terminal (pty) after they have done cmd.Wait(),

I'm not sure if there is a bug or not, but note that line 50 closes the tty, not the pty. The client package is already responsible for closing the pty when it's done with the terminal.

Line 50 exists because only the child process uses the tty; the parent process doesn't need an open fd for the tty, it only needs the pty.

At least that is my understanding.

@kr
Copy link
Collaborator

kr commented Sep 27, 2021

As a side note, the variable names pty and tty might not have been the best choice; I do not know if those are standard names to label the two sides of a terminal. Mea culpa.

@mvdan
Copy link
Author

mvdan commented Sep 28, 2021

@kr I personally went with primary and secondary in my code, to remind myself of their roles. It seems like the upstream Linux docs say master and slave, but that feels a bit extra :)

As for your mention that the tty (secondary) is only used by the child process - intuitively, that makes sense. In practice, though, it seems to flake predictably, at least with -race. Perhaps that's a bug in our understanding of how the forking happens.

If @creack arrives at the same conclusion as you, that the code seems correct, then I might try to provide a small isolated reproducer for the error. I already did provide a reproducer test in the original post, but it's in a test within a third party module, so I get that there are more moving pieces.

@creack
Copy link
Owner

creack commented Oct 1, 2021

@mvdan,

As @kr mentioned, closing the tty in the defer is quite intentional. If there is some unexpected behavior, it would be interesting to not discard the error (currently stamped "best effort") and see if there is some kind of failure.

The idea behind primary/secondary is that the kernel entangles the two so what happens on one side happens on the other.

The pty/tty pair is initialized before the fork. When forking, all FDs are duplicated, in the child, we use the tty FD, so in the parent, we don't need it anymore. In other words, as soon as c.Start l70 returns, the tty is "garbage" for the parent, if it was not closed in the defer, it would be a memory leak (unless maybe, although unlikely, Go's Garbage collection would be able to detect it is an unused FD and close it)

That being said, looking at the stdlib code, it is quite different than when @kr initially created this library, and I see some new logic around the descriptor closing, which may be related to the issue. Looks like changes are from 2013 and 2016, which is quite old, but maybe nobody tried what you are doing as people tend to skip tests when dealing with terminal.

For context, forking in Go is a bit more tricky than in other languages because of the runtime and how it handles threads/goroutine, the initial "fix" to get it working was the global ForkLock, but as some new logic has been added which, to be honest, I don't fully understand yet, we may need to revisit how we deal with our pty/tty pair.

Unfortunately, I have not been able to reproduce, if you have a self-contained snippet that would showcase the error, I can investigate further and find a way to fix it.

Naming is tricky, I am not a fan of "primary" / "secondary" as it implies it is a 1-1 relation, while there is only one primary with many secondaries, but it may be my own bias due to not being a native speaker. On the other hand, I don't have a better idea, and even if not perfect, it may be clearer than pty/tty.
The idea for pty/tty was that from the parent's perspective, the important piece is the pty to be able to control the child's terminal, and because it is indeed a pseudo-terminal, it made sense to keep it as pty.
From the child's perspective, the "secondary", for all intent and purpose, is the tty.

Open to suggestions / PRs to change the naming :)

@mvdan
Copy link
Author

mvdan commented Oct 1, 2021

Thank you for the detailed reply! It could well be that the forking and FD duplication is more subtle than you and I assume. I haven't looked into it myself yet.

Naming is tricky, I am not a fan of "primary" / "secondary" as it implies it is a 1-1 relation

I don't have a strong opinion here, but I did want some way to remember which one is the parent and which is the child.

if you have a self-contained snippet that would showcase the error, I can investigate further and find a way to fix it.

I'll try to put one together then :)

@mvdan
Copy link
Author

mvdan commented Oct 2, 2021

In the process of writing a minimal reproducer that was reliable, I think I've nailed what's happening. Here's the reproducer: https://play.golang.org/p/QK7kD4elAA1

And here's my output:

$ go run main.go
signal: hangup
signal: hangup
signal: hangup
[...]

Note that I needed that runtime.GC call to deterministically reproduce the error nearly every time. In practice, I encountered garbage collection runs as part of my tests because go test runs hundreds of subtests in parallel, so other parallel tests were generating garbage and triggering GC runs shortly after pty.Start ran.

Assuming that possible GC run happens, here's what leads to the error:

  1. pty.Start calls os.OpenFile("/dev/ptmx", os.O_RDWR, 0)
  2. OpenFile sets up a finalizer, so that when the *os.File gets GCed, it gets automatically closed
  3. The child process starts.
  4. My code does not hold onto the primary/pty file for long. The GC run thus calls the finalizer, closing the parent terminal.
  5. The child process attempts to use its other end of the terminal to print something. That gets an error, as the parent terminal has been closed.

So my original fix suggestion was wrong, as you both pointed out. However, I still think there's a bug in the pty library: pty.Start should always, always ensure that the parent/primary/pty remains alive until the child process (i.e. the command) has finished executing. In my sample code above, you can see that I can manually accomplish that via runtime.KeepAlive, which makes the error never appear.

It's unclear to me where or how the "keep alive" fix would fit into the pty library - I'll leave that to you two :) A potential fix might be in the form of the following at the end of StartWithAttrs:

go func() {
    // Keep pty alive until the child process has finished.
    // Otherwise, a GC run might call pty's finalizer, closing the file too early.
    c.Process.Wait()
    runtime.KeepAlive(pty)
}

I'm happy to send a patch if that sounds good. I could also massage the demo above into a small test.

@mvdan
Copy link
Author

mvdan commented Oct 2, 2021

The explanation above also seems to explain why my downstream commit, mvdan/sh@f7684ec, seemed to remove the test flakes. Note how I now keep primary alive right until I call cmd.Wait, for the sake of closing the file. The child process has already printed all it needs to by that point. So my change happened to fix the flake by keeping the parent file alive for long enough.

mvdan added a commit to mvdan/sh that referenced this issue Oct 2, 2021
All we need to do is close the primary terminal after cmd.Wait.
This also ensures that file is kept alive for long enough,
preventing a GC run from closing the file before cmd.Wait.

After this change, I am still unable to reproduce a failure.

For more, see: creack/pty#127
@creack
Copy link
Owner

creack commented Oct 2, 2021

Thank you for digging into this and for the snippet!

I think your diagnostic is correct indeed. However, I am not sure I agree with

pty.Start should always, always ensure that the parent/primary/pty remains alive until the child process (i.e. the command) has finished executing

At the moment, it is just a gut feeling, I will dig into this today or tomorrow and get back to you with a more concrete reasoning.

When the parent terminal closes, it may be better to leave it up to the child to decide whether to die or to detach, or maybe it would make sense to kill the child, as if the parent doesn't have a reference to it's terminal, then it is a dangling orphan (may even be zombie).

In any case, what you did is incredibly helpful and a great way to dig deeper

@mvdan
Copy link
Author

mvdan commented Oct 2, 2021

I think I'm on the right track with the diagnostic, but perhaps my conclusion and proposed fix are wrong. Note how my last commit to "go back to pty.Start" actually triggered a failure on Mac :) As these things go, I can't reproduce that failure on Linux.

@creack
Copy link
Owner

creack commented Oct 2, 2021

I'll check on a mac as well.

Note that from the snippet, replacing the KeepAlive with a defer primary.Close() also 'fixes' the issue as it would keep a reference to the primary and prevent the gc to close it.

@creack
Copy link
Owner

creack commented Oct 2, 2021

Random side note, looking into your package made me think of this, in case you are curious to see the most complete and awesome shell in go: https://github.com/creack/goshell :)

@creack
Copy link
Owner

creack commented Oct 2, 2021

I did reproduce on OSX, and while I didn't find anything conclusive just yet, I am more and more thinking it is localized to the testing scenario and is unlikely to cause issues in a "real world" scenario, even for large supervisors like Docker.

Monitoring the goroutines and memory, it doesn't seem to be a leak.
Verified your solution using pty.Open directly and not closing the secondary, it gets properly closed by the finalizer, so it is indeed a viable solution.

Looking back at the self-contained snippet to reproduce, it is actually not the same use-case. In the snippet, the hangup is indeed expected, even if not obvious. When closing the primary before the end of 'Wait', it may or may not have been completed so getting a hangup signal is to be expected sometimes.
It would be the same in C or in other languages, it is a bad usage of the lib. To mitigate this, we could indeed improve the documentation to highlight this gotcha.

Thinking more about the idea of having a KeepAlive or something to ensure the parent stays alive, it would cause some unexpected / unintended behaviors as it is something existing code might depend on.
If the caller "looses" the reference to the child, then there is nothing actionable on this side, if the caller decides it is expected, then it would be up to it to call the KeepAlive or to maintain a reference to the files.
On the child side, SIGHUP is the expected signal when the controlling terminal is closed, so the child has the option to handle it and detach itself, or just die.
I guess it circles back to just being an obscure corner of unix and adding more docs would help.

I suspect it may be related to the stdlib ForkExec race, or it could be a bug in ioctl call, or maybe a bug in the kernel. As mentioned earlier, there is only one "primary", which is a device, not a regular file. It looks like that closing one of the opened "primary" results in closing another and/or the wrong "secondary".

I will keep digging tomorrow.

@mvdan
Copy link
Author

mvdan commented Oct 7, 2021

Bug or not, I agree that more docs around "be careful about the primary terminal being closed too early" would be useful, including how files being GCed means they'll get closed too.

@creack
Copy link
Owner

creack commented Oct 28, 2023

Closing with #167

@creack creack closed this as completed Oct 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants