Skip to content
This repository has been archived by the owner on Oct 23, 2023. It is now read-only.

Properly surfacing errors to shut down process when appropriate #257

Closed
LewisJEllis opened this issue Jan 19, 2017 · 2 comments
Closed

Properly surfacing errors to shut down process when appropriate #257

LewisJEllis opened this issue Jan 19, 2017 · 2 comments

Comments

@LewisJEllis
Copy link
Contributor

LewisJEllis commented Jan 19, 2017

An important part of node's error handling scheme is that when an exception bubbles up to the top level uncaught (no try/catch, not in a domain, not an error event with a listener, etc), the process shuts down to avoid continuing to run in an unexpected and unknown state. If the program was meant to continue running under those conditions, there should have been some code somewhere to handle the exception. This philosophy is documented in various places.

For this reason, adding raven-node to a node program should not add or remove any cases where the program shuts down due to an unhandled exception. In other words, the "when does this process crash" behavior of a program should be unaffected by the presence of raven-node, its instrumentation, handlers, context wraps, etc.

This is not currently the case and has not been the case since almost two years ago when raven-node started to catch and capture asynchronous errors in express apps via the requestHandler middleware (to associate them with the current request). The recent context/wrap functionality is a generalization of that. raven-node's current usage of domains results in errors that happen inside a raven-node context domain wrapper being captured, but they are then swallowed rather than making it to the top level and shutting down the process:

wrap: function (options, func, onErr) {
  ...
  var wrapDomain = domain.create();
  wrapDomain.sentryContext = options;

  var self = this;
  if (typeof onErr !== 'function') {
    onErr = function (err) {
      // note we just capture here, we don't rethrow or otherwise propagate
      self.captureException(err);
    };
  }

  wrapDomain.on('error', onErr);
  var wrapped = wrapDomain.bind(func);

We can break this down into a few requirements for what we want raven-node to do/capture:

  1. We want to capture any unhandled exceptions that happen, without then swallowing or ignoring them in a way that would not happen without raven present (preserve existing exit cases)
  2. We want to capture some recoverable errors (like 500s in an express app), but we don't want to exit the process when it otherwise wouldn't (don't add new exit cases)
  3. We want to associate any available context data like the current request, breadcrumbs, etc with captured exceptions when they happen, and avoid associating the wrong context data with the wrong exception (context collection)

Before we started to capture asynchronous errors in express apps via the requestHandler middleware, we did (1) and (2) but not (3). Now with context/wrap, we do (2) and (3) but not always (1). We'd ultimately like to do all 3, and this issue is meant to propose/discuss the most reasonable way to get there.

A couple additional considerations:

  • We should allow the user to define their own handling for otherwise fatal exceptions, the same way process.on('uncaughtException', function () { ... }) does, for clean up, making a best effort to serve remaining requests and close existing connections, etc
  • We should continue to work well with Express and other frameworks that have a bit of their own error handling process going on

For a simple example of where we don't do (1), the following snippet does not currently result in a top-level exception:

var Raven = require('raven');
Raven.config(myDsn).install();

process.on('uncaughtException', function () {
  console.log('I do not print');
});

// runs in a wrapper domain, error handler captures but swallows error
Raven.context(function () {
  // captured but does not cause process exit
  throw new Error('boom');
});

But the same code minus raven does result in a top-level exception:

process.on('uncaughtException', function () {
  console.log('I do print');
});

throw new Error('boom');

Resolving this is not as easy as rethrowing like so:

wrap: function (options, func, onErr) {
  ...ƒ
  if (typeof onErr !== 'function') {
    onErr = function (err) {
      self.captureException(err);
      throw err;
    };
  }

because any exception which is caught by a domain/uncaughtException handler first goes through process._fatalException (source), which dispatches exceptions for handling by either domain.on('error') or process.on('uncaughtException') and, to avoid crazy loops/error conditions/etc, proceeds straight to shutting down the process if any of those handlers throw (which is what we'd be doing here).

So, proposals:

Two main ideas

Raven-wide fatal exception pipeline

We can basically have our own catch-all "there was a fatal exception, now what?" pipeline where we capture the exception and then call our catch-all-cleanup-and-exit callback sort of like a process.on(‘uncaughtException’) handler. That callback can just be the callback passed to install(), and if none is passed it defaults to something like function (err) { console.error(err.stack); process.exit(1); } so the process exits, just like if there's no uncaughtException handler. Then we just have to drop into that capture+callback pipeline everywhere that we think an exception should otherwise bubble to the top level, namely the following flows:

  • express middleware domain catches an error, attaches req context, captures, invokes callback
  • raven context function catches an error, attaches context, captures, invokes callback
  • uncaughtException handler catches an error, captures, invokes callback
  • optionally, unhandledRejection handler (based on config)

The union of these cases is effectively “times when an exception would hit the top level and make node exit”, so we should meet requirement (1) and will exit the process whenever it would exit if we were absent. If a user would want to attach their own uncaughtException handler, they instead provide it to us in install.

Downsides:

  • it'd be nice not to be explicitly process.exit()ing
  • a bit heavy on contact points/take-over-everything-ness
  • might not play super well with additional process.on('uncaughtException') handlers
  • asks a decent bit of investment from users: "here trust us with your when-program-blows-up-cleanup-and-exit-callback and learn our specific error handling setup"

On the other hand, this would:

  • unify a few different “we caught something blowing up here, what do we do in this case vs elsewhere" cases to just go down one pipeline
  • give us complete control over timing to ensure exceptions are captured etc
  • potentially allow a simpler user story for users who don't have a deep knowledge of error handling in node, but are willing to trust us ('the callback you pass here will be invoked when we catch a fatal error, use it to clean up and exit, don't worry about it past that, we're catching stuff')
    • could help some less advanced users achieve reasonable error/crash-handling when they might otherwise do it poorly

Monkeypatchprocess._fatalException internal method

This is where all potentially-fatal exceptions go and follow these steps:

  • if active domain, emit from domain and check if a handler caught it
  • else emit uncaughtException from process and check if a handler caught it
  • if not handled by either of above, exit process

Hooking into this would let us guarantee capturing any potentially-fatal exception without caring where it came from, and would let us avoid worrying about all individual cases where we use domains. It'd be sort of like a souped up process.on('uncaughtException') handler that also deals with domain errors. At first it seemed like we'd have the downside of also capturing domain-level errors from any other uses of domain error handling, but that should be avoidable by checking if the active domain was created by Raven.

Downside: timing is sort of weird to say "hey we're in the middle of dispatching an exception to either process/domain handler, but let's wait and capture it first before we do that"; if we dispatch first and capture in background, I'm not sure if we can guarantee the capture finishes should the process end up shutting down from the exception being unhandled.

Note: I came up with these two ideas mostly independently, but I think the optimal result might be some sort of happy medium between them. It seems like maybe some downsides of the first idea could be polished away by some _fatalException monkeypatching, but I haven't considered it in much detail yet.

Two probably-not-ideal alternate possibilities

Monkeypatch emit() on domain and process

This would essentially be a step below patching _fatalException; these are the two key calls that _fatalException might make. This would give us easy control over what errors we capture, but downside is these two calls by fatalException are made synchronously. I don't think we'd have any way at that point to guarantee capturing an exception (i.e. completing request to sentry server) before process shutdown. Seems we might be able to pull it off by also monkeypatching process.exit() but that's pretty sketchy.

Just rethrow from context domain wrap error handler, but asynchronously

This works pretty nicely in a basic example and isn't hard to implement (just setImmediate(function () { throw err; }), but it gives us a one-tick gap between the domain-level catch and the rethrowing to top level which feels weird. I don't think I trust it to not behave weirdly in some situation, even using setImmediate, but I haven't thought about it enough to be certain either way.

Open to thoughts for/against any of these ideas, alternative possibilities, clarification q's, etc.

/cc @MaxBittker @benvinegar

also /cc @Qard - while researching _fatalException I came across some node irc logs with your gist here, seems you were investigating similar things a while back; curious if you have any thoughts on what seems reasonable here

@LewisJEllis
Copy link
Contributor Author

LewisJEllis commented Jan 26, 2017

Decided on this a couple days ago but will write it down here:

We're going to make the callback passed to Raven.install be thought of as the onFatalError callback. It will be invoked:

  • after uncaughtException handler captures and reports an exception
  • after one of our domain context error handlers captures and reports an exception
    • and thus also after express requestHandler domain context error handlers capture+report
  • not after just express errorHandler middleware captures + reports an error
  • not after unhandled rejections: see Proper handling of unhandledRejection event #262 for more details
  • not just for any old (potentially manual) capture - only for captures that would, in the absence of Raven, be fatal and result in the process shutting down

This will be a change in behavior, but for the better in making us consistent with when the process would exit in the absence of Raven. This should make it easy for users to do the right thing in the common case and gives us control over timing to ensure errors are reported to Sentry before the process shuts down. Note that if someone wants to be notified on every capture, they can use Raven.on('logged').

We'll also provide a Raven.shutdown(afterSentCb) method which will ensure any in-progress captures finish and then invoke its callback. This won't be used by or necessary for most users, but for users needing advanced control or with their own error/shutdown handling, it can be useful. The capabilities of these two callbacks combined should provide sufficient flexibility when necessary. This should also make interop with other instrumentation-heavy things like NewRelic work okay.

@LewisJEllis
Copy link
Contributor Author

Implemented this and more in #308. Did not implement the Raven.shutdown idea mentioned above because I realized it could be a bit tricky to make it as reliable as I'd like it to be and the use case for it is potentially theoretical; gonna wait for the use case/motivation to become more clear.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant