-
Notifications
You must be signed in to change notification settings - Fork 30.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
100% CPU native-only (no JS runs) loop for 2-5 minutes on Mac OS 10.15 in libuv.stream.uv__try_write #43916
Comments
PR #42340 should fix this when merged and back-ported to v16.x. Unfortunately, it's not exactly been smooth sailing... |
@bnoordhuis - Can we cherry pick a patch for For example, here is what is needed to resolve the issue on v16 (same actually on v18, possibly v14 too): https://github.com/nodejs/node/compare/v16.x...huntharo:node:v16-mac-loop-fix?expand=1 |
Cherry-picking libuv commits is hardly ever done. I can't remember the last time we did that. |
Hrm... bummer... on an internal survey we got ~30 responses and I think 28 of the 30 people are hitting this 100% CPU loop several times a day. Some have said they are surprised people are not leaving in droves as it's so frustrating. Some report that it happens every time they start node within 2 minutes. It's very hard to get work done like that. We applied the patches in my diff above and the issue is resolved. From the next.js ticket it seems that quite a few other developers in the wild are hitting this too. For the longest time we suspected our own code, then next.js, then webpack hmr, then the I would really encourage a quick patch if possible for 14 and 16. I think the reputation of the ecosystem is being negatively impacted by this. Thanks for looking at this and all that node is! |
I can also attest this has been a major blocker and pain point on our team. Exploring options to implement a targeted patch would be greatly appreciated! |
Original commit log follows: darwin: remove EPROTOTYPE error workaround (nodejs#3405) It's been reported in the past that OS X 10.10, because of a race condition in the XNU kernel, sometimes returns a transient EPROTOTYPE error when trying to write to a socket. Libuv handles that by retrying the operation until it succeeds or fails with a different error. Recently it's been reported that current versions of the operating system formerly known as OS X fail permanently with EPROTOTYPE under certain conditions, resulting in an infinite loop. Because Apple isn't exactly forthcoming with bug fixes or even details, I'm opting to simply remove the workaround and have the error bubble up. Refs: libuv/libuv#482 Fixes: nodejs#43916
Original commit log follows: darwin: translate EPROTOTYPE to ECONNRESET (nodejs#3413) macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too - have a bug where a race condition causes the kernel to return EPROTOTYPE because the socket isn't fully constructed. It's probably the result of the peer closing the connection and that is why libuv translates it to ECONNRESET. Previously, libuv retried until the EPROTOTYPE error went away but some VPN software causes the same behavior except the error is permanent, not transient, turning the retry mechanism into an infinite loop. Refs: libuv/libuv#482 Refs: libuv/libuv#3405 Fixes: nodejs#43916
Original commit log follows: darwin: remove EPROTOTYPE error workaround (libuv/libuv#3405) It's been reported in the past that OS X 10.10, because of a race condition in the XNU kernel, sometimes returns a transient EPROTOTYPE error when trying to write to a socket. Libuv handles that by retrying the operation until it succeeds or fails with a different error. Recently it's been reported that current versions of the operating system formerly known as OS X fail permanently with EPROTOTYPE under certain conditions, resulting in an infinite loop. Because Apple isn't exactly forthcoming with bug fixes or even details, I'm opting to simply remove the workaround and have the error bubble up. Refs: libuv/libuv#482 Fixes: nodejs#43916
Original commit log follows: darwin: translate EPROTOTYPE to ECONNRESET (libuv/libuv#3413) macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too - have a bug where a race condition causes the kernel to return EPROTOTYPE because the socket isn't fully constructed. It's probably the result of the peer closing the connection and that is why libuv translates it to ECONNRESET. Previously, libuv retried until the EPROTOTYPE error went away but some VPN software causes the same behavior except the error is permanent, not transient, turning the retry mechanism into an infinite loop. Refs: libuv/libuv#482 Refs: libuv/libuv#3405 Fixes: nodejs#43916
I've opened #43950 to cherry-pick the fixes but be aware that the way the release process works means they're first released in the next v18.x before getting back-ported to v16 and finally v14 - and that's of course assuming they actually get merged. |
THANK YOU!!! I understand it may take some time. Thanks so much! |
Original commit log follows: darwin: translate EPROTOTYPE to ECONNRESET (libuv/libuv#3413) macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too - have a bug where a race condition causes the kernel to return EPROTOTYPE because the socket isn't fully constructed. It's probably the result of the peer closing the connection and that is why libuv translates it to ECONNRESET. Previously, libuv retried until the EPROTOTYPE error went away but some VPN software causes the same behavior except the error is permanent, not transient, turning the retry mechanism into an infinite loop. Refs: libuv/libuv#482 Refs: libuv/libuv#3405 Fixes: #43916 PR-URL: #43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Should we keep this open to track the backport PRs? Or do we normally let this stay closed and just use the PRs for those specific versions do the tracking? |
Back-porting is normally an automated process unless the commits don't apply cleanly (but I expect they will.) |
Original commit log follows: darwin: remove EPROTOTYPE error workaround (libuv/libuv#3405) It's been reported in the past that OS X 10.10, because of a race condition in the XNU kernel, sometimes returns a transient EPROTOTYPE error when trying to write to a socket. Libuv handles that by retrying the operation until it succeeds or fails with a different error. Recently it's been reported that current versions of the operating system formerly known as OS X fail permanently with EPROTOTYPE under certain conditions, resulting in an infinite loop. Because Apple isn't exactly forthcoming with bug fixes or even details, I'm opting to simply remove the workaround and have the error bubble up. Refs: libuv/libuv#482 Fixes: #43916 PR-URL: #43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Original commit log follows: darwin: translate EPROTOTYPE to ECONNRESET (libuv/libuv#3413) macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too - have a bug where a race condition causes the kernel to return EPROTOTYPE because the socket isn't fully constructed. It's probably the result of the peer closing the connection and that is why libuv translates it to ECONNRESET. Previously, libuv retried until the EPROTOTYPE error went away but some VPN software causes the same behavior except the error is permanent, not transient, turning the retry mechanism into an infinite loop. Refs: libuv/libuv#482 Refs: libuv/libuv#3405 Fixes: #43916 PR-URL: #43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Original commit log follows: darwin: remove EPROTOTYPE error workaround (libuv/libuv#3405) It's been reported in the past that OS X 10.10, because of a race condition in the XNU kernel, sometimes returns a transient EPROTOTYPE error when trying to write to a socket. Libuv handles that by retrying the operation until it succeeds or fails with a different error. Recently it's been reported that current versions of the operating system formerly known as OS X fail permanently with EPROTOTYPE under certain conditions, resulting in an infinite loop. Because Apple isn't exactly forthcoming with bug fixes or even details, I'm opting to simply remove the workaround and have the error bubble up. Refs: libuv/libuv#482 Fixes: #43916 PR-URL: #43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Original commit log follows: darwin: translate EPROTOTYPE to ECONNRESET (libuv/libuv#3413) macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too - have a bug where a race condition causes the kernel to return EPROTOTYPE because the socket isn't fully constructed. It's probably the result of the peer closing the connection and that is why libuv translates it to ECONNRESET. Previously, libuv retried until the EPROTOTYPE error went away but some VPN software causes the same behavior except the error is permanent, not transient, turning the retry mechanism into an infinite loop. Refs: libuv/libuv#482 Refs: libuv/libuv#3405 Fixes: #43916 PR-URL: #43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Original commit log follows: darwin: remove EPROTOTYPE error workaround (libuv/libuv#3405) It's been reported in the past that OS X 10.10, because of a race condition in the XNU kernel, sometimes returns a transient EPROTOTYPE error when trying to write to a socket. Libuv handles that by retrying the operation until it succeeds or fails with a different error. Recently it's been reported that current versions of the operating system formerly known as OS X fail permanently with EPROTOTYPE under certain conditions, resulting in an infinite loop. Because Apple isn't exactly forthcoming with bug fixes or even details, I'm opting to simply remove the workaround and have the error bubble up. Refs: libuv/libuv#482 Fixes: #43916 PR-URL: #43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Original commit log follows: darwin: translate EPROTOTYPE to ECONNRESET (libuv/libuv#3413) macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too - have a bug where a race condition causes the kernel to return EPROTOTYPE because the socket isn't fully constructed. It's probably the result of the peer closing the connection and that is why libuv translates it to ECONNRESET. Previously, libuv retried until the EPROTOTYPE error went away but some VPN software causes the same behavior except the error is permanent, not transient, turning the retry mechanism into an infinite loop. Refs: libuv/libuv#482 Refs: libuv/libuv#3405 Fixes: #43916 PR-URL: #43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Original commit log follows: darwin: remove EPROTOTYPE error workaround (libuv/libuv#3405) It's been reported in the past that OS X 10.10, because of a race condition in the XNU kernel, sometimes returns a transient EPROTOTYPE error when trying to write to a socket. Libuv handles that by retrying the operation until it succeeds or fails with a different error. Recently it's been reported that current versions of the operating system formerly known as OS X fail permanently with EPROTOTYPE under certain conditions, resulting in an infinite loop. Because Apple isn't exactly forthcoming with bug fixes or even details, I'm opting to simply remove the workaround and have the error bubble up. Refs: libuv/libuv#482 Fixes: nodejs#43916 PR-URL: nodejs#43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Original commit log follows: darwin: translate EPROTOTYPE to ECONNRESET (libuv/libuv#3413) macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too - have a bug where a race condition causes the kernel to return EPROTOTYPE because the socket isn't fully constructed. It's probably the result of the peer closing the connection and that is why libuv translates it to ECONNRESET. Previously, libuv retried until the EPROTOTYPE error went away but some VPN software causes the same behavior except the error is permanent, not transient, turning the retry mechanism into an infinite loop. Refs: libuv/libuv#482 Refs: libuv/libuv#3405 Fixes: nodejs#43916 PR-URL: nodejs#43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Original commit log follows: darwin: remove EPROTOTYPE error workaround (libuv/libuv#3405) It's been reported in the past that OS X 10.10, because of a race condition in the XNU kernel, sometimes returns a transient EPROTOTYPE error when trying to write to a socket. Libuv handles that by retrying the operation until it succeeds or fails with a different error. Recently it's been reported that current versions of the operating system formerly known as OS X fail permanently with EPROTOTYPE under certain conditions, resulting in an infinite loop. Because Apple isn't exactly forthcoming with bug fixes or even details, I'm opting to simply remove the workaround and have the error bubble up. Refs: libuv/libuv#482 Fixes: nodejs/node#43916 PR-URL: nodejs/node#43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Original commit log follows: darwin: translate EPROTOTYPE to ECONNRESET (libuv/libuv#3413) macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too - have a bug where a race condition causes the kernel to return EPROTOTYPE because the socket isn't fully constructed. It's probably the result of the peer closing the connection and that is why libuv translates it to ECONNRESET. Previously, libuv retried until the EPROTOTYPE error went away but some VPN software causes the same behavior except the error is permanent, not transient, turning the retry mechanism into an infinite loop. Refs: libuv/libuv#482 Refs: libuv/libuv#3405 Fixes: nodejs/node#43916 PR-URL: nodejs/node#43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Original commit log follows: darwin: remove EPROTOTYPE error workaround (libuv/libuv#3405) It's been reported in the past that OS X 10.10, because of a race condition in the XNU kernel, sometimes returns a transient EPROTOTYPE error when trying to write to a socket. Libuv handles that by retrying the operation until it succeeds or fails with a different error. Recently it's been reported that current versions of the operating system formerly known as OS X fail permanently with EPROTOTYPE under certain conditions, resulting in an infinite loop. Because Apple isn't exactly forthcoming with bug fixes or even details, I'm opting to simply remove the workaround and have the error bubble up. Refs: libuv/libuv#482 Fixes: #43916 PR-URL: #43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Original commit log follows: darwin: translate EPROTOTYPE to ECONNRESET (libuv/libuv#3413) macOS versions 10.10 and 10.15 - and presumbaly 10.11 to 10.14, too - have a bug where a race condition causes the kernel to return EPROTOTYPE because the socket isn't fully constructed. It's probably the result of the peer closing the connection and that is why libuv translates it to ECONNRESET. Previously, libuv retried until the EPROTOTYPE error went away but some VPN software causes the same behavior except the error is permanent, not transient, turning the retry mechanism into an infinite loop. Refs: libuv/libuv#482 Refs: libuv/libuv#3405 Fixes: #43916 PR-URL: #43950 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Version
v16.13.1
Platform
Darwin hhunt-mbp-m1pro 21.5.0 Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:37 PDT 2022; root:xnu-8020.121.3~4/RELEASE_ARM64_T6000 arm64
Subsystem
libuv
What steps will reproduce the bug?
libuv
project may have notes on that as they fixed the problem:How often does it reproduce? Is there a required condition?
Within my team, every few minutes.
We do not know how to reproduce this in isolation.
The
libuv
ticket mentions "specific VPN software" that can exacerbate the problem and we are all using the same VPN software on Mac OS (both Intel and ARM) so that could be the key.A major key is that the loop is stuck 100% in native code in a tight loop - As a result, Ctrl-C will not interrupt the application and it must be forced to exit.
What is the expected behavior?
No infinite loops in native code.
What do you see instead?
Infinite loops in native code (stepped through in debugger).
Additional information
This
next.js
issue appears to be due to this. All the reports are from Mac OS users and the behavior described is "does not respond to Ctrl-C until a few minutes have passed" (paraphrasing), so it seems this is happening in the wild but nobody looked at this in a debug build to find that the loop was already known and fixed :)vercel/next.js#10061
These two fixes cherry-picked from libuv will address the issue:
libuv/libuv#3405
libuv/libuv#3413
I can prepare PRs if desired.
We're really like to have this fix for at least
node16
and later versions.Thanks!
The text was updated successfully, but these errors were encountered: