-
-
Notifications
You must be signed in to change notification settings - Fork 134
Properly surfacing errors to shut down process when appropriate #257
Comments
Decided on this a couple days ago but will write it down here: We're going to make the callback passed to
This will be a change in behavior, but for the better in making us consistent with when the process would exit in the absence of Raven. This should make it easy for users to do the right thing in the common case and gives us control over timing to ensure errors are reported to Sentry before the process shuts down. Note that if someone wants to be notified on every capture, they can use We'll also provide a |
Implemented this and more in #308. Did not implement the |
An important part of node's error handling scheme is that when an exception bubbles up to the top level uncaught (no try/catch, not in a domain, not an
error
event with a listener, etc), the process shuts down to avoid continuing to run in an unexpected and unknown state. If the program was meant to continue running under those conditions, there should have been some code somewhere to handle the exception. This philosophy is documented in various places.For this reason, adding raven-node to a node program should not add or remove any cases where the program shuts down due to an unhandled exception. In other words, the "when does this process crash" behavior of a program should be unaffected by the presence of raven-node, its instrumentation, handlers, context wraps, etc.
This is not currently the case and has not been the case since almost two years ago when raven-node started to catch and capture asynchronous errors in express apps via the
requestHandler
middleware (to associate them with the current request). The recent context/wrap functionality is a generalization of that. raven-node's current usage of domains results in errors that happen inside a raven-node context domain wrapper being captured, but they are then swallowed rather than making it to the top level and shutting down the process:We can break this down into a few requirements for what we want raven-node to do/capture:
Before we started to capture asynchronous errors in express apps via the requestHandler middleware, we did (1) and (2) but not (3). Now with context/wrap, we do (2) and (3) but not always (1). We'd ultimately like to do all 3, and this issue is meant to propose/discuss the most reasonable way to get there.
A couple additional considerations:
process.on('uncaughtException', function () { ... })
does, for clean up, making a best effort to serve remaining requests and close existing connections, etcFor a simple example of where we don't do (1), the following snippet does not currently result in a top-level exception:
But the same code minus raven does result in a top-level exception:
Resolving this is not as easy as rethrowing like so:
because any exception which is caught by a domain/uncaughtException handler first goes through
process._fatalException
(source), which dispatches exceptions for handling by eitherdomain.on('error')
orprocess.on('uncaughtException')
and, to avoid crazy loops/error conditions/etc, proceeds straight to shutting down the process if any of those handlers throw (which is what we'd be doing here).So, proposals:
Two main ideas
Raven-wide fatal exception pipeline
We can basically have our own catch-all "there was a fatal exception, now what?" pipeline where we capture the exception and then call our catch-all-cleanup-and-exit callback sort of like a
process.on(‘uncaughtException’)
handler. That callback can just be the callback passed toinstall()
, and if none is passed it defaults to something likefunction (err) { console.error(err.stack); process.exit(1); }
so the process exits, just like if there's nouncaughtException
handler. Then we just have to drop into that capture+callback pipeline everywhere that we think an exception should otherwise bubble to the top level, namely the following flows:uncaughtException
handler catches an error, captures, invokes callbackunhandledRejection
handler (based on config)The union of these cases is effectively “times when an exception would hit the top level and make node exit”, so we should meet requirement (1) and will exit the process whenever it would exit if we were absent. If a user would want to attach their own
uncaughtException
handler, they instead provide it to us ininstall
.Downsides:
process.exit()
ingprocess.on('uncaughtException')
handlersOn the other hand, this would:
Monkeypatch
process._fatalException
internal methodThis is where all potentially-fatal exceptions go and follow these steps:
uncaughtException
fromprocess
and check if a handler caught itHooking into this would let us guarantee capturing any potentially-fatal exception without caring where it came from, and would let us avoid worrying about all individual cases where we use domains. It'd be sort of like a souped up
process.on('uncaughtException')
handler that also deals with domain errors. At first it seemed like we'd have the downside of also capturing domain-level errors from any other uses of domain error handling, but that should be avoidable by checking if the active domain was created by Raven.Downside: timing is sort of weird to say "hey we're in the middle of dispatching an exception to either process/domain handler, but let's wait and capture it first before we do that"; if we dispatch first and capture in background, I'm not sure if we can guarantee the capture finishes should the process end up shutting down from the exception being unhandled.
Note: I came up with these two ideas mostly independently, but I think the optimal result might be some sort of happy medium between them. It seems like maybe some downsides of the first idea could be polished away by some
_fatalException
monkeypatching, but I haven't considered it in much detail yet.Two probably-not-ideal alternate possibilities
Monkeypatch
emit()
ondomain
andprocess
This would essentially be a step below patching
_fatalException
; these are the two key calls that_fatalException
might make. This would give us easy control over what errors we capture, but downside is these two calls byfatalException
are made synchronously. I don't think we'd have any way at that point to guarantee capturing an exception (i.e. completing request to sentry server) before process shutdown. Seems we might be able to pull it off by also monkeypatchingprocess.exit()
but that's pretty sketchy.Just rethrow from context domain wrap error handler, but asynchronously
This works pretty nicely in a basic example and isn't hard to implement (just
setImmediate(function () { throw err; })
, but it gives us a one-tick gap between the domain-level catch and the rethrowing to top level which feels weird. I don't think I trust it to not behave weirdly in some situation, even usingsetImmediate
, but I haven't thought about it enough to be certain either way.Open to thoughts for/against any of these ideas, alternative possibilities, clarification q's, etc.
/cc @MaxBittker @benvinegar
also /cc @Qard - while researching
_fatalException
I came across some node irc logs with your gist here, seems you were investigating similar things a while back; curious if you have any thoughts on what seems reasonable hereThe text was updated successfully, but these errors were encountered: