Uncaught exception thrown in way that can't be intercepted in userland #3848

ggoodman · 2024-11-20T00:31:00Z

Bug Description

We had a recent incident wherein a specific workload was able to cause Node.js to crash due to an uncaught exception.

We saw two distinct cases of uncaught exceptions:

Error: read ECONNRESET
    at Pipe.onStreamRead (node:internal/stream_base_commons:217:20)

SocketError: other side closed
    at Socket.<anonymous> (/data/node_modules/undici/lib/dispatcher/client-h1.js:701:24)
    at Socket.emit (node:events:529:35)
    at Socket.emit (node:domain:489:12)
    at endReadableNT (node:internal/streams/readable:1400:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)

Both of these were caught via process.on('uncaughtException') and had the origin of uncaughtException (these were not unhandled rejections AFAICT).

To the best of our knowledge, all opportunities for exhaustive error handling have been exercised though we've been unable to reproduce a minimal repro outside our codebase. Within our codebase, we've been able to get deeper stack traces through Chrome Dev Tools:

onUncaughtException (graceful_termination.js:53)
emit (node:events:529)
emit (node:domain:489)
(anonymous) (node:internal/process/execution:158)
TickObject
init (node:internal/inspector_async_hook:25)
emitInitNative (node:internal/async_hooks:200)
emitInitScript (node:internal/async_hooks:503)
nextTick (node:internal/process/task_queues:132)
onDestroy (node:internal/streams/destroy:103)
(anonymous) (readable.js:68)
processImmediate (node:internal/timers:476)
topLevelDomainCallback (node:domain:161)
callbackTrampoline (node:internal/async_hooks:126)
Immediate
init (node:internal/inspector_async_hook:25)
emitInitNative (node:internal/async_hooks:200)
emitInitScript (node:internal/async_hooks:503)
initAsyncResource (node:internal/timers:164)
Immediate (node:internal/timers:620)
setImmediate (node:timers:307)
_destroy (readable.js:67)
_destroy (node:internal/streams/destroy:109)
destroy (node:internal/streams/destroy:71)
destroy (readable.js:58)
destroy (util.js:290)
(anonymous) (api-request.js:176)
(anonymous) (node:internal/process/task_queues:140)
runInAsyncScope (node:async_hooks:203)
runMicrotask (node:internal/process/task_queues:137)
Microtask
init (node:internal/inspector_async_hook:25)
emitInitNative (node:internal/async_hooks:200)
emitInitScript (node:internal/async_hooks:503)
AsyncResource (node:async_hooks:186)
queueMicrotask (node:internal/process/task_queues:152)
onError (api-request.js:175)
onError (request.js:299)
errorRequest (util.js:638)
(anonymous) (client-h1.js:740)
emit (node:events:529)
emit (node:domain:489)
(anonymous) (node:net:350)
callbackTrampoline (node:internal/async_hooks:128)
PIPEWRAP
init (node:internal/inspector_async_hook:25)
emitInitNative (node:internal/async_hooks:200)
Socket.connect (node:net:1218)
connect (node:net:249)
connect (connect.js:126)
socket (client.js:428)
connect (client.js:427)
_resume (client.js:600)
resume (client.js:534)
Client.<computed> (client.js:259)
[dispatch] (client.js:314)
Intercept (redirect-interceptor.js:11)
dispatch (dispatcher-base.js:177)
[dispatch] (pool-base.js:143)
dispatch (dispatcher-base.js:177)
request (api-request.js:203)
(anonymous) (api-request.js:196)
request (api-request.js:195)
executeValidatedActionsBatch (execute_actions_batch.js:493) // Where we invoke `.request()` of a `Pool` instance.
// ... snip ...

In that longer stack trace, you can see that we call request on a Pool instance in the executeValidatedActionsBatch function, which is declared as async function. While that usually allows us to capture errors via Promise rejection, in the case of the incident, a customer's workload was reliably causing these exceptions to bubble up to the process level.

We have some weak hypotheses:

Timing issue between a socket end event and gaining access to the request .body Readable. We don't have a reference to the Readable by the time the error happens.
Timing issue between AbortSignal and other clean up.
😕 ❓

Reproducible By

This is reproduced (sometimes) when the process tree within which the target server is running is suddenly OOM killed. The undici Pool is connected via unix domain socket which might present unique ways of blowing up vs the usual tcp sockets.

Expected Behavior

We expect that no process level uncaught exceptions or unhandled rejections are possible in pseudo code like this:

// Wrap undici in an async function so that we can handle all rejections in the async 
// continuation.
async function doRequest(options, sink) {
  const res = await pool.request({ ...options, signal });
  
  // Avoid uncaughts via EventEmitter legacy footguns.
  res.body.on('error', () => {});

  const firstChunk = await consumeStreamUpToDelimiter(res.body, '\n', { signal });

  // Do stuff with first chunk. Conditionally do another client request whose body
  // uses `res.body`.

  // After we've done all the client requests, we want to pipe the tail of the last request
  // back into a supplied sink (Writable stream).
  await pipeline([res.body, sink], { signal });
}

Logs & Screenshots

Added in description

Environment

Docker image node:18.19.1-bullseye via docker-for-mac on MacOS 14.17.1.

Additional context

Sort of terrible drawing of what our stuff is doing:

Server request --> Repeat N times --> await pipeline(lastClientRes.body, server response)
                          |--> lastClientRes = await Pool.request({ socketPath })
                          |--> for (i in [0...M]) { await consumeStreamUpToDelimiter(lastClientRes) }

The text was updated successfully, but these errors were encountered:

mcollina · 2024-11-20T06:29:57Z

This is indeed a bad bug. However it'd be impossible to fix without a reproduction. Would you mind to send a PR?

ronag · 2024-11-20T07:04:19Z

Looks to me like the body is missing an error handler, e.g. through consumeStreamUpToDelimiter

ggoodman · 2024-11-20T13:51:54Z

Looks to me like the body is missing an error handler, e.g. through consumeStreamUpToDelimiter

@ronag we've tried to cover that by systematically attaching an "error" handler to clientRes.body immediately in the same microtask that we first get a reference to it.
Are there additional forms of error handling we should be considering?

ggoodman · 2024-11-20T15:38:28Z

Additional debugging shows some additional symptoms:

Prior to actual breakage, we see several requests fail in a consistent way. The target endpoint to which undici's Pool is managing connections get's in a broken state (AFAICT, totally our fault but nonetheless a contributing factor here).
The behaviour manifests after several consecutive failures of this sort using the same Pool instance. Other Pool instances are unaffected (we have many pools in this multi-tenant system).

Given those symptoms, I wonder if another potential avenue of exploration is whether the Pool's socket management might be a contributing factor here. Are there windows of opportunity wherein a Socket owned by a Pool has no "error" listener?

Unfortunately, our local repro is statistically frequent but not 100%. I'll try using a different type of Dispatcher that doesn't use keepAlive and see if I can still repro.

ggoodman · 2024-11-20T15:48:47Z

I've been able to reproduce the symptoms in our system where things would normally break but without the process-level crash. To do so, I stopped using long-lived Pool instances and instead swapped to per-request Client instances (with pipelining=0, which is a possible red-herring).

This observation strengthens the hypothesis that a potential bug might lie within the Pool's socket lifecycle management outside the critical request path.

ronag · 2024-11-20T16:14:22Z

Without a repro there is not much we can do here...

ggoodman · 2024-11-20T16:33:02Z

Understood. Hopefully this issue can provide context just in case someone's spelunking through the codebase and a spark of brilliance emerges.

In the interim, we'll be taking a bit of perf hit and using per-operation Client instances as we've been unable to reproduce the issue when we avoid socket re-use.

mcollina · 2024-11-20T16:42:11Z

To do so, I stopped using long-lived Pool instances and instead swapped to per-request Client instances (with pipelining=0, which is a possible red-herring).

Note that Client holds the socket. Pool keep a list of Client around. Switching to one--client per request, means that the problem is in the logic that keep socket long-lived in Client.

I think you should try adding a socket.on('error', console.error) somewhere before here:

undici/lib/dispatcher/client.js

Line 434 in a427e4b

util.destroy(socket.on('error', noop), new ClientDestroyedError())

.

ggoodman · 2024-11-20T18:46:57Z

Seems like the issue was already noticed and fixed (by my colleague @metcoder95 💪🏼) but the auto back-port failed. I put up a PR in #3855 but Node 22 CI seems broken on a particular test when targeting v6.x.

ggoodman · 2024-11-20T20:32:21Z

False alarm, #3855 doesn't actually fix the problem but seems like a good idea nonetheless.

ggoodman · 2024-11-21T15:01:16Z

Here are some NODE_DEBUG=net,undici logs immediately before the crash:

UNDICI 36: connected to ac689308-1430-4214-93ec-e845383e982a.containers.local using http:h1
UNDICI 36: sending request to POST http://ac689308-1430-4214-93ec-e845383e982a.containers.local//service2/execute_batch
UNDICI 36: connecting to ac689308-1430-4214-93ec-e845383e982a.containers.local using http:undefined
NET 36: createConnection [
  {
    highWaterMark: 65536,
    path: '/data/ports/service1-ac689308-1430-4214-93ec-e845383e982a/io/port.sock',
    localAddress: null,
    port: 80,
    host: 'ac689308-1430-4214-93ec-e845383e982a.containers.local'
  },
  null,
  [Symbol(normalizedArgs)]: true
]
NET 36: pipe true /data/ports/service1-ac689308-1430-4214-93ec-e845383e982a/io/port.sock
NET 36: connect: attempting to connect to /data/ports/service1-ac689308-1430-4214-93ec-e845383e982a/io/port.sock:NaN (addressType: NaN)
NET 36: afterConnect
NET 36: _read - n 65536 isConnecting? false hasHandle? true
NET 36: Socket._handle.readStart
UNDICI 36: connected to ac689308-1430-4214-93ec-e845383e982a.containers.local using http:h1
UNDICI 36: sending request to POST http://ac689308-1430-4214-93ec-e845383e982a.containers.local//service2/execute_batch
NET 36: _read - n 65536 isConnecting? false hasHandle? true
UNDICI 36: received response to POST http://ac689308-1430-4214-93ec-e845383e982a.containers.local//service2/execute_batch - HTTP 200
NET 36: destroy
NET 36: close
NET 36: close handle
NET 36: emit close
UNDICI 36: request to POST http://ac689308-1430-4214-93ec-e845383e982a.containers.local//service2/execute_batch errored - other side closed

There are multiple PIDs because our service is using undici as a client in cluster workers.

PID 36 is the process that crashed with the stack:

SocketError: other side closed
    at Socket.<anonymous> (/data/node_modules/undici/lib/dispatcher/client-h1.js:701:24)
    at Socket.emit (node:events:529:35)
    at Socket.emit (node:domain:489:12)
    at endReadableNT (node:internal/streams/readable:1400:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)

ggoodman · 2024-11-21T17:01:51Z

Hey @mcollina I'm pretty sure I've found the root cause and I don't think it would be addressed by your PR. I was working on putting together a test case to try to prove my understanding.

The crux of it is this:

When the res is created here, it gets handed off to the callback via async_hooks.AsyncResource.
Because we're using the Promise-based api for request, the res gets handed off to a future micro tick here.
That gives a window of opportunity for any queued micro tick in which a "close" event on the Socket will result in res.destroy() being called BEFORE userspace ever gets a reference to res.

A work-around would be to add res.on("error", util.nop) after constructing the response body stream.

It makes me ask the question: is it actually helpful for users of this library for response body errors to become uncaught exceptions? This is a single-use stateful stream already so it's not like such errors are putting the system in an undefined state. Instead, it seems like pure risk with little upside.

ronag · 2024-11-21T17:18:11Z

Destroy should be deferred by a set immediate so I'm not sure that's the cause.

ggoodman · 2024-11-21T17:37:32Z

OK, new evidence that this is actually not a bug as you have probably long suspected.

ggoodman · 2024-11-21T17:43:49Z

Apologies for this wild goose chase. I've tracked down the issue and it was indeed related to a window of time in which no "error" handler was attached to res.body. However it was a self-created window hidden in a finally with an await.

I'm closing this as not an issue 🤦🏼.

I do, however, appreciate every bit of support you folks brought while we were trying to track this down. I apologize for leading you down this dead end. 😊

ggoodman added the bug Something isn't working label Nov 20, 2024

ggoodman mentioned this issue Nov 20, 2024

fix(#3736): back-port 183f8e9 to v6.x #3855

Merged

5 tasks

mcollina mentioned this issue Nov 21, 2024

Always add a on(error) handler on sockets #3861

Closed

ggoodman closed this as not planned Won't fix, can't repro, duplicate, stale Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uncaught exception thrown in way that can't be intercepted in userland #3848

Uncaught exception thrown in way that can't be intercepted in userland #3848

ggoodman commented Nov 20, 2024

mcollina commented Nov 20, 2024

ronag commented Nov 20, 2024

ggoodman commented Nov 20, 2024

ggoodman commented Nov 20, 2024

ggoodman commented Nov 20, 2024

ronag commented Nov 20, 2024

ggoodman commented Nov 20, 2024

mcollina commented Nov 20, 2024

ggoodman commented Nov 20, 2024

ggoodman commented Nov 20, 2024

ggoodman commented Nov 21, 2024 •

edited

Loading

ggoodman commented Nov 21, 2024 •

edited

Loading

ronag commented Nov 21, 2024

ggoodman commented Nov 21, 2024

ggoodman commented Nov 21, 2024

Uncaught exception thrown in way that can't be intercepted in userland #3848

Uncaught exception thrown in way that can't be intercepted in userland #3848

Comments

ggoodman commented Nov 20, 2024

Bug Description

Reproducible By

Expected Behavior

Logs & Screenshots

Environment

Additional context

mcollina commented Nov 20, 2024

ronag commented Nov 20, 2024

ggoodman commented Nov 20, 2024

ggoodman commented Nov 20, 2024

ggoodman commented Nov 20, 2024

ronag commented Nov 20, 2024

ggoodman commented Nov 20, 2024

mcollina commented Nov 20, 2024

ggoodman commented Nov 20, 2024

ggoodman commented Nov 20, 2024

ggoodman commented Nov 21, 2024 • edited Loading

ggoodman commented Nov 21, 2024 • edited Loading

ronag commented Nov 21, 2024

ggoodman commented Nov 21, 2024

ggoodman commented Nov 21, 2024

ggoodman commented Nov 21, 2024 •

edited

Loading

ggoodman commented Nov 21, 2024 •

edited

Loading