-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uncaught exception thrown in way that can't be intercepted in userland #3848
Comments
This is indeed a bad bug. However it'd be impossible to fix without a reproduction. Would you mind to send a PR? |
Looks to me like the body is missing an error handler, e.g. through |
@ronag we've tried to cover that by systematically attaching an |
Additional debugging shows some additional symptoms:
Given those symptoms, I wonder if another potential avenue of exploration is whether the Unfortunately, our local repro is statistically frequent but not 100%. I'll try using a different type of |
I've been able to reproduce the symptoms in our system where things would normally break but without the process-level crash. To do so, I stopped using long-lived This observation strengthens the hypothesis that a potential bug might lie within the |
Without a repro there is not much we can do here... |
Understood. Hopefully this issue can provide context just in case someone's spelunking through the codebase and a spark of brilliance emerges. In the interim, we'll be taking a bit of perf hit and using per-operation |
Note that I think you should try adding a undici/lib/dispatcher/client.js Line 434 in a427e4b
|
Seems like the issue was already noticed and fixed (by my colleague @metcoder95 💪🏼) but the auto back-port failed. I put up a PR in #3855 but Node 22 CI seems broken on a particular test when targeting |
False alarm, #3855 doesn't actually fix the problem but seems like a good idea nonetheless. |
Here are some
There are multiple PIDs because our service is using PID 36 is the process that crashed with the stack:
|
Hey @mcollina I'm pretty sure I've found the root cause and I don't think it would be addressed by your PR. I was working on putting together a test case to try to prove my understanding. The crux of it is this:
A work-around would be to add It makes me ask the question: is it actually helpful for users of this library for response body errors to become uncaught exceptions? This is a single-use stateful stream already so it's not like such errors are putting the system in an undefined state. Instead, it seems like pure risk with little upside. |
Destroy should be deferred by a set immediate so I'm not sure that's the cause. |
OK, new evidence that this is actually not a bug as you have probably long suspected. |
Apologies for this wild goose chase. I've tracked down the issue and it was indeed related to a window of time in which no I'm closing this as not an issue 🤦🏼. I do, however, appreciate every bit of support you folks brought while we were trying to track this down. I apologize for leading you down this dead end. 😊 |
Bug Description
We had a recent incident wherein a specific workload was able to cause Node.js to crash due to an uncaught exception.
We saw two distinct cases of uncaught exceptions:
Both of these were caught via
process.on('uncaughtException')
and had theorigin
ofuncaughtException
(these were not unhandled rejections AFAICT).To the best of our knowledge, all opportunities for exhaustive error handling have been exercised though we've been unable to reproduce a minimal repro outside our codebase. Within our codebase, we've been able to get deeper stack traces through Chrome Dev Tools:
In that longer stack trace, you can see that we call
request
on aPool
instance in theexecuteValidatedActionsBatch
function, which is declared asasync function
. While that usually allows us to capture errors viaPromise
rejection, in the case of the incident, a customer's workload was reliably causing these exceptions to bubble up to the process level.We have some weak hypotheses:
end
event and gaining access to the request.body
Readable. We don't have a reference to theReadable
by the time the error happens.AbortSignal
and other clean up.Reproducible By
This is reproduced (sometimes) when the process tree within which the target server is running is suddenly OOM killed. The undici
Pool
is connected via unix domain socket which might present unique ways of blowing up vs the usual tcp sockets.Expected Behavior
We expect that no process level uncaught exceptions or unhandled rejections are possible in pseudo code like this:
Logs & Screenshots
Added in description
Environment
Docker image
node:18.19.1-bullseye
via docker-for-mac on MacOS 14.17.1.Additional context
Sort of terrible drawing of what our stuff is doing:
The text was updated successfully, but these errors were encountered: