Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ptys unexpectedly interrupted and exit early in GitHub Actions, Linux with io_uring enabled #630

Closed
lydell opened this issue Sep 23, 2023 · 9 comments
Labels
bug Issue identified by VS Code Team member as probable bug

Comments

@lydell
Copy link

lydell commented Sep 23, 2023

Environment details

  • OS: ubuntu-latest on GitHub Actions
  • node-pty version: 1.0.0, 1.1.0-beta4
  • Node.js version: 20

Summary of the below updates: If Node.js has io_uring enabled, the bugs occurs.
Update: It works with Node.js 18! https://github.com/lydell/node-pty-bug/actions/runs/6283383630/job/17063792075
And 19: https://github.com/lydell/node-pty-bug/actions/runs/6283390976/job/17063807172
And 20.0.0: https://github.com/lydell/node-pty-bug/actions/runs/6283396940/job/17063819451
So it must have been introduced in some 20.x release … the search continues.
Aha! It’s Node.js 20.3.0 that introduced it: https://github.com/lydell/node-pty-bug/actions/runs/6283409434
Switched the child process from node to bash and it got a bit flaky: https://github.com/lydell/node-pty-bug/actions/runs/6283443471/attempts/1 vs https://github.com/lydell/node-pty-bug/actions/runs/6283443471/attempts/2
Update: Can happen on Node.js 18.18.0 too: https://github.com/lydell/node-pty-bug/actions/runs/6415765977

Issue description

When running ptys in parallel, some of them unexpectedly receive “signal 1” (SIGHUP?) and exit early.

This happens in ubuntu-latest on GitHub Actions. But not in macOS-latest. And not my own Ubuntu machine (and not on my own macOS machine either).

I made a minimal repo to show it happening: https://github.com/lydell/node-pty-bug

Example failing output (ubuntu-latest): https://github.com/lydell/node-pty-bug/actions/runs/6283275300/job/17063579148

Example successful output (macOS-latest): https://github.com/lydell/node-pty-bug/actions/runs/6283275300/job/17063579184

This is so strange! I wonder what’s special about Linux on GitHub Actions that makes this happen. Any ideas?

@Tyriar
Copy link
Member

Tyriar commented Sep 25, 2023

node-pty is only really tested currently with node up to v18 as it's what vscode ships as part of its Electron builds.

@williamstein
Copy link

This bug has also been significantly impacting us on cocalc.com, and I made an issue in our repo about it here with some more details, including some strace output: sagemathinc/cocalc#6963

In particular, node-pty has this bug on Node.js 18.18.0 (released about 2 weeks ago), but not with Node 18.17.1 (the previous version). My guess is you'll hit this with vscode in December (?), when they might upgrade to Node v18.18. Right now electron is on v18.16, I think.

@lydell
Copy link
Author

lydell commented Oct 5, 2023

I can confirm it happens on 18.18.0: https://github.com/lydell/node-pty-bug/actions/runs/6415765977

(As you can see, that was on attempt 2 of that run – first time it didn’t happen. Seems to not be 100 % reproducible. Also note how 20.8.0 succeeded there, but fails here: https://github.com/lydell/node-pty-bug/actions/runs/6415782042/job/17418282266.)

@williamstein
Copy link

williamstein commented Oct 5, 2023

Thanks for testing on 18.18.0.

Seems to not be 100 % reproducible.

For me at least this problem really is 100% reproducible via the exact same steps. What it takes to reproduce it is extremely weird, since it involves doing something once, then restarting a server, then doing the exact same thing again and then it fails 100% after that. It should be the other way around, where restarting the server should clear any state and make the problem go away, so I'm extremely puzzled by this bug.

I also have no idea if this is a bug in nodejs or in node-pty. It seems more likely to be a nodejs bug, but I didn't find anything clearly relevant when searching their issue tracker. I tried reverting to older versions of node-pty and they also exhibit this behavior. The changelog from 18.17.x --> 18.18.0 includes many updates of dependencies, and also three changes to child_process. This change looks potentially relevant:

nodejs/node@b5df084e1e

They rewrote bits of code specifically related to how they handle subprocess termination in these changes, and when I reproduce this bug, I basically see:

  • terminate the subprocess by hitting ctrl+d to exit,
  • suddenly cpu spikes to 100% as io_uring_enter, etc., get called in an infinite loop (according to strace)

So I bet one of those child_process commits related to subrocesses exiting introduced a bug in node.js.
They are all Javascript code (not C code), so it might be easy to manually revert them and see if the bug goes away.

@kkocdko
Copy link
Contributor

kkocdko commented Dec 7, 2023

Please set export UV_USE_IO_URING=0 as workaround and waiting for a fix (from upsteam maybe).

I'm debugging for this, yeah it's 100% reproducible, the core path is: enable io_uring in libuv, create pty1, create pty2, destory pty1. The order is important, destory by stack order will not hit this bug.

@lydell
Copy link
Author

lydell commented Dec 7, 2023

Thanks for finding that!

UV_USE_IO_URING=1 ➡️ 🚨 failure (https://github.com/lydell/node-pty-bug/actions/runs/7132159128/job/19422149709) (Node.js 20.10.0)

UV_USE_IO_URING=0 ➡️ ✅ expected output (https://github.com/lydell/node-pty-bug/actions/runs/7132168279/job/19422177649) (Node.js 20.10.0)

@lydell
Copy link
Author

lydell commented Mar 10, 2024

Note for people trying to reproduce: Node.js 20.11.1 and 21.6.2 disabled io_uring by default due to a security issue. So right now, the latest Node.js versions don’t have the bug by default, but it can be enabled by setting UV_USE_IO_URING=1.

@lydell
Copy link
Author

lydell commented Mar 10, 2024

Is there anyone who has managed to make a shorter reproduction than mine, that also reproduces outside of GitHub Actions? I think having such a reproduction here would increase the chances of fixing this.

@lydell lydell changed the title ptys unexpectedly interrupted and exit early in GitHub Actions ptys unexpectedly interrupted and exit early in GitHub Actions, Linux with io_uring enabled Apr 17, 2024
@lydell
Copy link
Author

lydell commented Sep 7, 2024

Here is my conclusion of the issue. I’m not 100 % certain I’m right, but I think I have a plausible explanation.

  1. Node.js enables io_uring.
  2. This causes high CPU usage in node-pty on Linux.
  3. I notice the problem with node-pty in GitHub Actions. What I think happens, is that the process uses 100 % CPU and that GitHub Actions notices that and kills the process with SIGHUP. (I have no idea if GitHub Actions actually works like that, but it seems like something a CI could do to avoid wasting resources.)
  4. My CI job still succeeded, because I only looked at exit code and not signals. Since then, I’ve switched to exiting with 128+signal (see https://tldp.org/LDP/abs/html/exitcodes.html). If I had done that from the start, I would have noticed this issue through CI failures instead of silently ignoring failures.
  5. We learn about the UV_USE_IO_URING=0 workaround in this issue. (ptys unexpectedly interrupted and exit early in GitHub Actions, Linux with io_uring enabled #630 (comment))
  6. Node.js disables io_uring by default due to security concerns. (ptys unexpectedly interrupted and exit early in GitHub Actions, Linux with io_uring enabled #630 (comment))
  7. Node.js has a tracking issue for enabling io_uring by default again: Enable io_uring support in libuv by default nodejs/node#52156

I looks like there is nothing node-pty can do about this currently. I think it’s better to wait for Node.js to keep working on their io_uring implementation, and see how things work when (if) they enable it by default again. So I’m going to close this issue. If there is somebody in this thread who still has a problem, open a new issue. I consider the problem I reported at the top of this issue done for now: It doesn’t happen by default anymore, there is a workaround, and Node.js seems to be working on things.

@lydell lydell closed this as not planned Won't fix, can't repro, duplicate, stale Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue identified by VS Code Team member as probable bug
Projects
None yet
Development

No branches or pull requests

4 participants