-
-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pty.Start seems to close the terminal too early #127
Comments
@myitcv @leitzler I actually wonder if this affects you too, because some of your https://github.com/govim/govim tests also use |
I'll try to take a look this week. |
I'm not sure if there is a bug or not, but note that line 50 closes the tty, not the pty. The client package is already responsible for closing the pty when it's done with the terminal. Line 50 exists because only the child process uses the tty; the parent process doesn't need an open fd for the tty, it only needs the pty. At least that is my understanding. |
As a side note, the variable names |
@kr I personally went with As for your mention that the tty (secondary) is only used by the child process - intuitively, that makes sense. In practice, though, it seems to flake predictably, at least with If @creack arrives at the same conclusion as you, that the code seems correct, then I might try to provide a small isolated reproducer for the error. I already did provide a reproducer test in the original post, but it's in a test within a third party module, so I get that there are more moving pieces. |
As @kr mentioned, closing the The idea behind primary/secondary is that the kernel entangles the two so what happens on one side happens on the other. The pty/tty pair is initialized before the fork. When forking, all FDs are duplicated, in the child, we use the That being said, looking at the stdlib code, it is quite different than when @kr initially created this library, and I see some new logic around the descriptor closing, which may be related to the issue. Looks like changes are from 2013 and 2016, which is quite old, but maybe nobody tried what you are doing as people tend to skip tests when dealing with terminal. For context, forking in Go is a bit more tricky than in other languages because of the runtime and how it handles threads/goroutine, the initial "fix" to get it working was the global ForkLock, but as some new logic has been added which, to be honest, I don't fully understand yet, we may need to revisit how we deal with our pty/tty pair. Unfortunately, I have not been able to reproduce, if you have a self-contained snippet that would showcase the error, I can investigate further and find a way to fix it. Naming is tricky, I am not a fan of "primary" / "secondary" as it implies it is a 1-1 relation, while there is only one primary with many secondaries, but it may be my own bias due to not being a native speaker. On the other hand, I don't have a better idea, and even if not perfect, it may be clearer than Open to suggestions / PRs to change the naming :) |
Thank you for the detailed reply! It could well be that the forking and FD duplication is more subtle than you and I assume. I haven't looked into it myself yet.
I don't have a strong opinion here, but I did want some way to remember which one is the parent and which is the child.
I'll try to put one together then :) |
In the process of writing a minimal reproducer that was reliable, I think I've nailed what's happening. Here's the reproducer: https://play.golang.org/p/QK7kD4elAA1 And here's my output:
Note that I needed that Assuming that possible GC run happens, here's what leads to the error:
So my original fix suggestion was wrong, as you both pointed out. However, I still think there's a bug in the pty library: It's unclear to me where or how the "keep alive" fix would fit into the pty library - I'll leave that to you two :) A potential fix might be in the form of the following at the end of StartWithAttrs:
I'm happy to send a patch if that sounds good. I could also massage the demo above into a small test. |
The explanation above also seems to explain why my downstream commit, mvdan/sh@f7684ec, seemed to remove the test flakes. Note how I now keep |
All we need to do is close the primary terminal after cmd.Wait. This also ensures that file is kept alive for long enough, preventing a GC run from closing the file before cmd.Wait. After this change, I am still unable to reproduce a failure. For more, see: creack/pty#127
Thank you for digging into this and for the snippet! I think your diagnostic is correct indeed. However, I am not sure I agree with
At the moment, it is just a gut feeling, I will dig into this today or tomorrow and get back to you with a more concrete reasoning. When the parent terminal closes, it may be better to leave it up to the child to decide whether to die or to detach, or maybe it would make sense to kill the child, as if the parent doesn't have a reference to it's terminal, then it is a dangling orphan (may even be zombie). In any case, what you did is incredibly helpful and a great way to dig deeper |
I think I'm on the right track with the diagnostic, but perhaps my conclusion and proposed fix are wrong. Note how my last commit to "go back to pty.Start" actually triggered a failure on Mac :) As these things go, I can't reproduce that failure on Linux. |
I'll check on a mac as well. Note that from the snippet, replacing the KeepAlive with a |
Random side note, looking into your package made me think of this, in case you are curious to see the most complete and awesome shell in go: https://github.com/creack/goshell :) |
I did reproduce on OSX, and while I didn't find anything conclusive just yet, I am more and more thinking it is localized to the testing scenario and is unlikely to cause issues in a "real world" scenario, even for large supervisors like Docker. Monitoring the goroutines and memory, it doesn't seem to be a leak. Looking back at the self-contained snippet to reproduce, it is actually not the same use-case. In the snippet, the hangup is indeed expected, even if not obvious. When closing the primary before the end of 'Wait', it may or may not have been completed so getting a hangup signal is to be expected sometimes. Thinking more about the idea of having a KeepAlive or something to ensure the parent stays alive, it would cause some unexpected / unintended behaviors as it is something existing code might depend on. I suspect it may be related to the stdlib ForkExec race, or it could be a bug in ioctl call, or maybe a bug in the kernel. As mentioned earlier, there is only one "primary", which is a device, not a regular file. It looks like that closing one of the opened "primary" results in closing another and/or the wrong "secondary". I will keep digging tomorrow. |
Bug or not, I agree that more docs around "be careful about the primary terminal being closed too early" would be useful, including how files being GCed means they'll get closed too. |
Closing with #167 |
Thanks for writing this package! I've been using it successfully for some time to test a shell interpreter over at https://github.com/mvdan/sh :)
I've been running into sporadic flakes with
pty.Start
for some time; see mvdan/sh#513. It stumped me for a long time, but today it just clicked why it's happening: mvdan/sh#513 (comment)The culprit seems to be this line:
pty/run.go
Line 50 in 8ac0cc1
As proof, see how swapping
pty.Start
with a more manualpty.Open
that closes aftercmd.Wait()
never fails: mvdan/sh@f7684ecI'm not sure why the deferred close is there, or why I'm the only one running into this issue. But given that the API is called
pty.Start
, notpty.Run
, it feels wrong to close the terminal as soon as the function returns.Would you accept a patch with the fix? That is, removing line 50, and relying on the caller to close the primary/master terminal (pty) after they have done
cmd.Wait()
, just like in my updated test.The text was updated successfully, but these errors were encountered: