Better error handling in daemon + gateway #1325

victorb · 2018-04-24T12:46:02Z

Today, if there is a error in a stream or any other error within the daemon or the gateway, the entire process crashes. While not a problem for testing, it's a problem when deploying and trying to run a long-term daemon (for example, for the js-ipfs daemons).

I raised this before in #1243

Basically, we should be able to gracefully handle errors down the stack without having the daemon and gateway crash.

Current issue is the following error that happens every ~10 minutes:

/usr/src/app/node_modules/pull-pair/index.js:11
      throw new Error('already piped')
      ^

Error: already piped
    at sink (/usr/src/app/node_modules/pull-pair/index.js:11:13)
    at consume (/usr/src/app/node_modules/pull-defer/sink.js:7:17)
    at consume (/usr/src/app/node_modules/pull-defer/sink.js:7:17)
    at Connection.consume (/usr/src/app/node_modules/pull-defer/sink.js:7:17)
    at pull (/usr/src/app/node_modules/pull-stream/pull.js:43:9)
    at Dialer.handle (/usr/src/app/node_modules/multistream-select/src/dialer/index.js:47:5)
    at attemptMuxerUpgrade (/usr/src/app/node_modules/libp2p-switch/src/dial.js:164:10)
    at gotWarmedUpConn (/usr/src/app/node_modules/libp2p-switch/src/dial.js:55:7)
    at Switch.dial (/usr/src/app/node_modules/libp2p-switch/src/dial.js:41:9)
    at _getPeerInfo (/usr/src/app/node_modules/libp2p/src/index.js:238:19)

I'm not saying that we should (or should not) fix this particular error, but rather when this error (and similar) happens, the daemon and gateway should not die but continue running.

The text was updated successfully, but these errors were encountered:

victorb · 2018-04-25T15:07:57Z

@fsdiogo as you're the current OKR owner for this, I've assigned this to you

victorb · 2018-04-25T15:08:31Z

Another error that happens directly after starting daemon right now:

Initializing daemon...
Using wrtc for webrtc support
/usr/src/app/src/cli/commands/daemon.js:40
        throw err
        ^

Error: websocket error
    at WS.Transport.onError (/usr/src/app/node_modules/engine.io-client/lib/transport.js:64:13)
    at WebSocket.ws.onerror (/usr/src/app/node_modules/engine.io-client/lib/transports/websocket.js:150:10)
    at WebSocket.onError (/usr/src/app/node_modules/engine.io-client/node_modules/ws/lib/EventTarget.js:109:16)
    at WebSocket.emit (events.js:160:13)
    at WebSocket.finalize (/usr/src/app/node_modules/engine.io-client/node_modules/ws/lib/WebSocket.js:182:41)
    at ClientRequest._req.on (/usr/src/app/node_modules/engine.io-client/node_modules/ws/lib/WebSocket.js:653:12)
    at ClientRequest.emit (events.js:160:13)
    at HTTPParser.parserOnIncomingClient [as onIncoming] (_http_client.js:539:21)
    at HTTPParser.parserOnHeadersComplete (_http_common.js:112:17)
    at TLSSocket.socketOnData (_http_client.js:439:20)

fsdiogo · 2018-05-10T14:02:10Z

@victorbjelkholm I can't seem to be able to reproduce the error in your last comment, can you share the steps or configs you have that produces the error?

victorb · 2018-05-14T15:39:39Z

@fsdiogo this issue (#1325) is not about a specific issue but rather having having a error handler for uncaught exceptions from within the daemon. Currently, any error that happens down the stack (in js-bitswap for example) crashes the daemon, which is why it's so unstable currently. We need to make it so the daemon won't fully be crashed and needing a restart just because a error in one stream or for one operation.

fsdiogo · 2018-05-14T16:31:11Z

Alright, so it's best to start off on the daemon itself, i.e. try to reproduce an error and make it so that the daemon doesn't crash?

victorb · 2018-05-14T16:34:34Z

@fsdiogo I think just adding if (Math.random() > 0.5) throw new Error('Test error') in various levels in the stack and testing manually will let you discover one or two places where we can catch the errors appropriately.

fsdiogo · 2018-05-16T15:22:22Z

I've been thinking about a better approach to this problem. Instead of a band aid approach, we should really understand how we're dealing with errors and have a robust solution to propagate them.

We are using async throughout the project, but check this:

/*
  With async the stack trace gets lost: it only tells the line
  where we throw the error, but it came from one of the calls
  in the series, we don't have a way to know from which one.
 */

const async = require('async')

async.series([
  (cb) => cb(null, 'async x'),
  (cb) => cb('async error'),
  (cb) => cb(null, 'async xxx'),
], (err, res) => {
  if (err) {
	  throw new Error(err)
	}
})

/*
> node async.js
  Users/diogo/async-promises/async.js:15
  	throw new Error(err);
  	^

  Error: async error
    at async.series (/Users/diogo/async-promises/async.js:15:10)
    at /Users/diogo/async-promises/node_modules/async/dist/async.js:3874:9
    at /Users/diogo/async-promises/node_modules/async/dist/async.js:473:16
    at iterateeCallback (/Users/diogo/async-promises/node_modules/async/dist/async.js:976:17)
    at /Users/diogo/async-promises/node_modules/async/dist/async.js:958:16
    at /Users/diogo/async-promises/node_modules/async/dist/async.js:3871:13
    at async.series (/Users/diogo/async-promises/async.js:11:11)
    at /Users/diogo/async-promises/node_modules/async/dist/async.js:3866:24
    at replenish (/Users/diogo/async-promises/node_modules/async/dist/async.js:998:17)
    at iterateeCallback (/Users/diogo/async-promises/node_modules/async/dist/async.js:983:17)
*/

A solution for this is Promises:

/*
  With promises the stack trace is intact, being possible
  to observe all the lines and modules that lead to the error.
 */

new Promise((resolve, reject) => {
  setTimeout(() => resolve('promise x'), 100)
})
.then((res) => res + 'x')
.then((res) => randomFn())
.then((res) => res + 'xxx')
.catch((err) => console.log(err))

const randomFn = () => {
  throw new Error('randomFn error')
}

/*
> node promises.js
  Error: randomFn error
    at randomFn (/Users/diogo/async-promises/promises.js:14:9)
    at Promise.then.then (/Users/diogo/async-promises/promises.js:9:16)
    at <anonymous>
*/

Promises will vastly improve our debug capability and therefore help us with the handling of future errors. Moreover, we standardise the way we handle errors and improve the codebase by handling errors that are being lost along the way right now.

PR #1335 by @hugomrdias is a good example of that.

I had already opened the issue #1311 to start using promises, but I think this KR is the perfect case of why we should really do that. It's not a matter of callbacks vs promises, but of what works vs what doesn't.

I've been thinking of possible solutions but this is the best way that I've found, I've talked with @vasco-santos and @hugomrdias and they agree with this approach.

What do you guys think? @victorbjelkholm @diasdavid

fsdiogo · 2018-05-16T15:29:30Z

@dryajov want to chime in as you've dealt with this too?

dryajov · 2018-05-16T17:06:20Z

@fsdiogo so the issue here is not really with async or Promise. The resason the stack is lost in your async example is because you are instantiating the Error object inside the series done callback:

async.series([
  (cb) => cb(null, 'async x'),
  (cb) => cb('async error'),
  (cb) => cb(null, 'async xxx'),
], (err, res) => {
  if (err) {
	  throw new Error(err) // --> error instantiated here
	}
})

The Error object will capture the stack at the current place of instantiation, so if we change you're example to:

const async = require('async')

async.series([
  (cb) => cb(null, 'async x'),
  (cb) => cb(new Error('async error')),  // --> error instantiated here
  (cb) => cb(null, 'async xxx'),
], (err, res) => {
  if (err) {
    throw err
  }
})

We get the correct stacktrace:

/Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/test.js:11
    throw err
    ^

Error: async error
    at async.series (/Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/test.js:7:14)
    at /Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/node_modules/async/dist/async.js:3866:24
    at replenish (/Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/node_modules/async/dist/async.js:998:17)
    at iterateeCallback (/Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/node_modules/async/dist/async.js:983:17)
    at /Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/node_modules/async/dist/async.js:958:16
    at /Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/node_modules/async/dist/async.js:3871:13
    at async.series (/Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/test.js:6:11)
    at /Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/node_modules/async/dist/async.js:3866:24
    at replenish (/Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/node_modules/async/dist/async.js:998:17)
    at /Users/dryajov/personal/projects/ipfs/ipfs/js-ipfsd-ctl/node_modules/async/dist/async.js:1002:9

Note that in you're promise example you do instantiate the error at the place of rejection, so the stack comes up correctly as well:

/*
  With promises the stack trace is intact, being possible
  to observe all the lines and modules that lead to the error.
 */

new Promise((resolve, reject) => {
  setTimeout(() => resolve('promise x'), 100)
})
.then((res) => res + 'x')
.then((res) => randomFn())
.then((res) => res + 'xxx')
.catch((err) => console.log(err))

const randomFn = () => {
  throw new Error('randomFn error')
}

/*
> node promises.js
  Error: randomFn error
    at randomFn (/Users/diogo/async-promises/promises.js:14:9)
    at Promise.then.then (/Users/diogo/async-promises/promises.js:9:16)
    at <anonymous>
*/

hugomrdias · 2018-05-16T17:35:38Z

@dryajov sure ur correct, but the problem isn't where you create the errors. It's propagation, with async, errors instances are just passed around (or teleported as i say it). With promises, errors are automatically re throwed and stack-traces follow the code path and with a lot less noise from async internals.

Not having full stack traces is really painful because we have to manually search the code base, commenting or whatever until we find the line we need. In contrast with complete stacks we can cmd+click BOOM editor jumped to the line we need perfect developer experience and full productivity.

dryajov · 2018-05-16T17:39:49Z

Aside from my comment above, this is still obviously a legitimate issue overall. Promises do help with that somewhat, but they are IMO not the final solution for this particular issue and I would argue that bringing them up here as an end all solution for this problem, might end up doing more harm than good.

I though about it for some time and I think some of the issues we have boil down to mixing error handling styles. We rely heavily on callback style async constructs in our code base, which occasionally get mixed with try/catch somewhere deep in the codebase or a third party module, add EventEmmiter error event types and Promises on top of that and we get a pretty wild compound of things to that can and do go wrong.

IMO, in order to handle errors correctly we need to:

find all the places where we're not handling errors appropriately and fix them
Standarise error handling across our code base
- look for throws and return them as callback errors where possible
- ditto for promises
EventEmmiters errors should also be appropriately handled - non fatal (error) type errors should be emited as warnings/user errors, since error is a special case and throws if there are no listeners attached
Finally, have a global error handler that catches unhandled errors and terminates the process gracefuly

hugomrdias · 2018-05-16T17:50:49Z

This stuff below also is tricky because callback might not be called, in some places we even have setImmediate(self.emit('error', err)) to solve that.

  const done = (err, res) => {
      if (err) {
        self.emit('error', err)
        return callback(err)
      }

      self.state.initialized()
      self.emit('init')
      callback(null, res)
    }

What do you think about removing event emitters ? moving stuff like boot() inside ipfs constructor to a instance method or a static factory method where we can return a promise or use callbacks.

dryajov · 2018-05-16T18:06:02Z

Ah, yeah - setImmediate(self.emit('error', err)) is mostly there because emit is sync, and the handler might actually never return which breaks the flow, hence the setImmediate to force it async. I think the same thing would happen with promises too?

Yeah, I think removing EventEmiter can be a good thing, we use it in js-ipfs (js-libp2p might still be apropriate?) primarily because we want to propagate errors during instantiation, so factory method seems like the right way to go.

hugomrdias · 2018-05-16T18:13:16Z

Yes it's true for promises, that's why I mention removing event emitters. IMO this should be one of the first things to do then all the stuff you said and finally add promises to help with stacks, debugging , back track cancelables, retry/timeouts etc

fsdiogo · 2018-05-17T10:23:15Z

Thanks for the input @dryajov and @hugomrdias!

Find where we're not handling errors appropriately and fix them
Standardise error handling across our code base
- look for throws and return them as callback errors where possible
- same for promises
Have a global error handler that catches unhandled errors and terminates the process gracefuly
Remove EventEmitter and move boot to a factory method (requires a breaking change)

Mr0grog · 2018-05-17T19:41:22Z

So, I realize that removing event emitting can simplify things (no argument from me that any critical error during startup should translate to an error in the callback/rejection in a promise), but we should keep in mind that the events are part of the public API. Various users may be relying on them. They at least deserve notice of the removal well before it happens, and ideally a release or two where the events still work but log a warning when used.

(That said, event emitters should all be wrapped in setImmediate, nextTick, or try/catch to ensure that a listener cannot break the emitting code.)

dryajov · 2018-05-17T20:55:58Z

@Mr0grog agree, this change should be treated as a breaking change and possibly a major version bump (0.3.x?).

daviddias · 2018-05-18T08:37:16Z

@fsdiogo before refactoring the code, make sure to have tests that check for the situations where the daemon errors out today. Try things like opening a connection and write trash at different levels of the dialing (transport -> secio -> stream muxing -> protocols).

fsdiogo · 2018-05-18T14:48:56Z

Yeah @Mr0grog @dryajov, it will be a breaking change and apps using ipfs will probably have to be refactored, so that's why in its first iteration we should offer both versions. I've opened a small WIP PR with that: #1359.

But as @diasdavid pointed out, I shall start with a different task. I've updated #1325 (comment) with the tasks order.

alanshaw · 2018-05-25T11:21:36Z

I have a couple more suggestions for your consideration:

When dealing with Node.js streams, use pump whenever possible to ensure errors are caught and handled
Add context to errors and allow our tests to assert on error messages we control in our modules by using explain-error

To address @victorbjelkholm's original request:

In Node.js the best practice on uncaught exceptions is to crash the process. We need to track down and fix any issues that cause the process to crash. In Node.js you can't meaningfully reason about the state of the system in the case of an unhandled exception so you need to restart the process to avoid leaking memory, holding open connections unnecessarily, corrupting shared state etc. etc.

It's worth having a read of https://www.joyent.com/node-js/production/design/errors - relevant section:

The best way to recover from programmer errors is to crash immediately.

The only downside to crashing on programmer errors is that connected clients may be temporarily disrupted, but remember:

By definition, these errors are always bugs. We're not talking about legitimate system or network failures, but actual bugs in the program. They should be rare in production, and the top priority has to be to debug and fix them.

For all the cases described above (and many more), the requests in flight are not necessarily going to complete successfully anyway. They may complete successfully, they may crash the server again, they may complete incorrectly in obvious ways, or they may complete wrongly in very subtle ways that are very hard to debug.

In a reliable distributed system, clients must be able to deal with server failure by reconnecting and retrying requests. Network and system failure are a reality, whether or not the Node.js program itself is allowed to crash.

If your production program is crashing so often that these disconnections are a problem, then the real problem is that the server is so buggy, not that it crashes in the case of a bug.

If disconnecting clients is a frequently problem because a server crashes so often, you should focus on the bugs that cause the service to crash — and make those exceptional — rather than trying to avoid crashing in cases where the code is obviously wrong

alanshaw · 2018-05-25T15:25:38Z

I'm not clear on the benefits of removing EventEmitter. It's used for boot but IPFS also emits init start and stop events in addition to ready and error (note that error is used for startup failures as well as boot failures). People are probably using all of those.

To avoid crashing the process on an error event you can check and emit the error event only if there are listeners for it.

I get the point about handlers throwing and not allowing the callback to be called, but they are unhandled exceptions - you don't really want to catch them (what would you do? - it's not your code that err'd) and if you put them on the next tick then it's just going to be an unhandled exception there too. Our code should be doing any cleanup and/or running critical code paths before emit is called.

BTW I'm +1 on all the other points that @fsdiogo has listed.

Mr0grog · 2018-05-25T15:44:41Z

When dealing with Node.js streams, use pump whenever possible

👍 pump and pumpify are both excellent and hugely useful little tools for streams. It’s a lot of work to make sure you do handle errors right without them.

In Node.js you can't meaningfully reason about the state of the system in the case of an unhandled exception so you need to restart the process to avoid leaking memory, holding open connections unnecessarily, corrupting shared state etc. etc.

I guess I read this issue more as “there are a lot of errors we could be catching and handling so as not to leave ourselves in a bad state, so we don’t need to crash the daemon as often.” Because otherwise I totally agree that simply ignoring any old error is a recipe for really confusing and possibly more dangerous disasters, like data loss, sometime later in the process’s lifetime.

I get the point about handlers throwing and not allowing the callback to be called, but they are unhandled exceptions - you don't really want to catch them (what would you do? - it's not your code that err'd)

In any other large system I’ve worked on, we’ve generally followed the logic that, if you catch them, you do you clean-up and then re-throw (or just use a finally without a catch). The point isn’t to stop the error; it’s to stop an external error from breaking your internal state. (e.g. you want to be as recoverable as possible if the external error is caught and recovered in some other way, like with a global error handler.)

and if you put them on the next tick then it's just going to be an unhandled exception there too.

Exactly! Same as above; that’s the point. It’s not your error, so you shouldn’t be preventing/stopping it. You should only be preventing it from breaking your internal state because, again, it’s not your error.

Our code should be doing any cleanup and/or running critical code paths before emit is called.

Wherever possible, 👍 ! It’s just not always reasonable/possible. Sometimes it’s important that the ordering of the event come before some other code (e.g. before[action]/will[action] events, which I know we don’t have here, but in the abstract, it still applies).

dryajov · 2018-05-25T16:26:09Z

👍 pump and pumpify are both excellent and hugely useful little tools for streams. It’s a lot of work to make sure you do handle errors right without them.

I would actually abstain of using node stream at all and instead use pull-stream (converting the node stream to pull stream with stream-to-pull-stream), the parts of our codebase that is still using node streams are most likely legacy leftovers that need to be reworked anyways.

Mr0grog · 2018-05-25T18:55:03Z

I would actually abstain of using node stream at all and instead use pull-stream

Yeah, you’re probably right for the case of all the stuff here. I think pump/pumpify are good light-weight tools for making built-in streams sane, while pull-stream is a heavier-weight replacement for built-in streams that already solves most of those problems :)

dryajov · 2018-05-25T20:53:00Z

Also, since we made the decision of using pull-stream exclusively a while back. Plus, although full featured, pull-stream are still light weight compared to node streams in general.

alanshaw · 2018-05-29T14:13:20Z

@dryajov point taken! The sentiment was to ensure stream errors do not cause the daemon to crash so as long as we're ensuring that I'm happy 😄

daviddias · 2018-05-30T08:49:19Z

@fsdiogo during last js-ipfs weekly, there were some question with regards on how to approach this. Do you feel it would be good to sync?

fsdiogo · 2018-05-30T09:26:51Z

Well, I was going to follow these steps, if you feel that's the way to go, awesome! If not, we should sync 🙂

daviddias · 2018-05-30T09:34:56Z

What is the approach to achieve this:

fsdiogo · 2018-05-30T10:27:21Z

Fixing the emitters as per Better error handling in daemon + gateway #1325 (comment).
Remove throws and return them as new errors with appropriate codes and messages a la Node style

daviddias · 2018-05-30T10:39:12Z

Ok, I see, so that is where we diverge. You should first of all create the tests that cause the crashes.

fsdiogo · 2018-05-30T10:54:29Z

You want that approach so we don't start changing the code blindly, right? I get it.

I just think that way the process will be a lot slower, no?

daviddias · 2018-05-30T11:31:11Z

@fsdiogo we are not optimizing for shipping fast here, the target should be correctness and ensuring that the same problems don't come back to haunt us.

dryajov · 2018-05-30T17:01:44Z

Agree, we should have a test for every failure we run into 👍

daviddias · 2018-06-07T23:59:27Z

@fsdiogo a good source of problems to solve is the list of bugs -- https://github.com/ipfs/js-ipfs/labels/bug --. Creating tests that reproduce those will help you nail down where the problems are coming from.

@alanshaw, @pgte and @satazor have a vast amount of experience in making large Node.js services robust. They can give a ton of tips on how to move forward the best way possible, reach out!

fsdiogo · 2018-06-08T10:24:18Z

Alright, thanks @diasdavid! I'll reach out then!

daviddias · 2018-07-29T09:47:21Z

No more "already piped" with 0.31 - #1458

daviddias · 2019-10-01T13:59:27Z

Duplicate/Same as #1406. Let's continue there

victorb assigned fsdiogo Apr 25, 2018

Mr0grog mentioned this issue May 4, 2018

fix: error event propagation ipfs/js-ipfsd-ctl#233

Merged

fsdiogo mentioned this issue May 14, 2018

Deploying js-ipfs to the Galaxy #804

Closed

fsdiogo mentioned this issue May 18, 2018

[WIP] remove boot from constructor #1359

Closed

fsdiogo added the exp/wizard Extensive knowledge (implications, ramifications) required label May 29, 2018

fsdiogo added help wanted Seeking public contribution on this issue awesome endeavour labels May 29, 2018

daviddias added the status/ready Ready to be worked label May 30, 2018

daviddias mentioned this issue Jun 7, 2018

Fuzz testing #1295

Closed

Mr0grog mentioned this issue Jun 11, 2018

Websocket connection errors on startup should be clearer and not crash IPFS #1326

Closed

Mr0grog mentioned this issue Jun 18, 2018

Using IPFS with electron (or muon) ipfs/notes#256

Open

daviddias removed the help wanted Seeking public contribution on this issue label Jun 20, 2018

fsdiogo mentioned this issue Jun 20, 2018

Taking control of Error Handling, reducing js-ipfs Daemon crashes #1406

Closed

fsdiogo removed their assignment Jul 4, 2018

daviddias closed this as completed Oct 1, 2019

Better error handling in daemon + gateway #1325

Better error handling in daemon + gateway #1325

Comments

victorb commented Apr 24, 2018 • edited Loading

victorb commented Apr 25, 2018

victorb commented Apr 25, 2018

fsdiogo commented May 10, 2018

victorb commented May 14, 2018

fsdiogo commented May 14, 2018

victorb commented May 14, 2018

fsdiogo commented May 16, 2018

fsdiogo commented May 16, 2018 • edited Loading

dryajov commented May 16, 2018

hugomrdias commented May 16, 2018

dryajov commented May 16, 2018

hugomrdias commented May 16, 2018 • edited Loading

dryajov commented May 16, 2018

hugomrdias commented May 16, 2018

fsdiogo commented May 17, 2018 • edited Loading

Mr0grog commented May 17, 2018

dryajov commented May 17, 2018

daviddias commented May 18, 2018

fsdiogo commented May 18, 2018 • edited Loading

alanshaw commented May 25, 2018

alanshaw commented May 25, 2018

Mr0grog commented May 25, 2018 • edited Loading

dryajov commented May 25, 2018

Mr0grog commented May 25, 2018

dryajov commented May 25, 2018

alanshaw commented May 29, 2018

daviddias commented May 30, 2018

fsdiogo commented May 30, 2018

daviddias commented May 30, 2018

fsdiogo commented May 30, 2018

daviddias commented May 30, 2018

fsdiogo commented May 30, 2018

daviddias commented May 30, 2018

dryajov commented May 30, 2018

daviddias commented Jun 7, 2018

fsdiogo commented Jun 8, 2018

daviddias commented Jul 29, 2018

daviddias commented Oct 1, 2019

victorb commented Apr 24, 2018 •

edited

Loading

fsdiogo commented May 16, 2018 •

edited

Loading

hugomrdias commented May 16, 2018 •

edited

Loading

fsdiogo commented May 17, 2018 •

edited

Loading

fsdiogo commented May 18, 2018 •

edited

Loading

Mr0grog commented May 25, 2018 •

edited

Loading