Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: catch uncaught exceptions & gc handles request aborts #102

Merged
merged 3 commits into from
Mar 29, 2024

Conversation

SgtPooki
Copy link
Member

Title

fix: catch uncaught exceptions & gc handles request aborts

Description

A few changes here from #18 (comment)

  1. catch uncaughtExceptions and handle if allowlisted
  2. catch unhandledRejections and handle if allowlisted
  3. handle empty strings in FILE_DATASTORE_PATH & FILE_BLOCKSTORE_PATH
  4. pull out logic for creating a request aware signal
  5. helia.gc is given a request aware signal

Notes & open questions

a lot of fixes here from discoveries made investigating #18

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation if necessary (this includes comments as well)
  • I have added tests that prove my fix is effective or that my feature works

src/helia-server.ts Outdated Show resolved Hide resolved
@SgtPooki SgtPooki merged commit e70742f into main Mar 29, 2024
4 checks passed
@SgtPooki SgtPooki deleted the feat/signal-handling branch March 29, 2024 22:45
const uncaughtHandler = (error: any): void => {
log.error('Uncaught Exception:', error)
if (ALLOW_UNHANDLED_ERROR_RECOVERY && (RECOVERABLE_ERRORS === 'all' || RECOVERABLE_ERRORS.includes(error?.code) || RECOVERABLE_ERRORS.includes(error?.name))) {
log.trace('Ignoring error')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, but we need some kind of error recovery instead of allowing things to just die like they were in Tiros. We could probably default this to FALSE and add a warning in the readme. Tiros needs this and probably some restarting of the server (to follow best practices).

Unhandled exceptions inherently mean that an application is in an undefined state.

I started to just listen for ERR_STREAM_PREMATURE_CLOSE which we know is a recoverable state. "in an undefined state" is not true in this instance.

this change allows us to recover from anything which could cause problems in the future:

There is a case where some error in libp2p/helia/fastify/helia-server.ts could happen that is unrecoverable, and this does an infinite loop of "On error resume next" and we don't find out until money has been eaten up running a dead service.

However, we still need to block this server from dying in instances where we know we can safely recover, and unblocking Tiros was foremost on my mind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of the linked warning is that you can't tell if you can safely recover so the only safe thing you can do is exit the process and restart.

If we have unhandled exceptions being thrown, these are bugs that should be fixed.

Copy link
Member Author

@SgtPooki SgtPooki Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree they're bugs that should be fixed, but should we allow a server to die given underlying libraries bugs if they're recoverable? Given the expectation of helia-http-server, I don't think so. It's supposed to fetch content using helia/libp2p, and return a result. If they die when fetching, we should certainly return a 500 error instead, and recover, right?

BTW, The only other place I saw ERR_STREAM_PREMATURE_CLOSE was in https://github.com/ChainSafe/js-libp2p-yamux, which is listening for those errors, so i don't think the uncaught exception is coming from there. it's likely coming from somewhere else in the libp2p stack.

edit: or stream_premature_close error is coming from fastify req/resp

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized we should add a listener on the req and resp stream for ERR_STREAM_PREMATURE_CLOSE...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated #112

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we allow a server to die given underlying libraries bugs if they're recoverable?

Again, the point of the linked warning is that you can't tell if you can safely recover.

Consider something like this:

process.on('uncaughtException', (err) => {
  if (err.message === 'recoverable') {
    // it's ok, it's recoverable
    return
  }

  console.error(err)
  process.exit(1)
})

const fd = await fs.open('/path/to/file.txt')

// something that causes an error
throw new Error('recoverable')

fd.close()

It looks recoverable, but the file descriptor is never closed so it leaks memory.

Basically if you're in an uncaught exception handler all bets are off.

@@ -272,8 +266,26 @@ export class HeliaServer {
*/
request.raw.on('close', cleanupFn)

if (timeout != null) {
setTimeout(() => {
Copy link
Member

@achingbrain achingbrain Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeouts like this use resources and can keep the process running. Better to use AbortSignal.timeout(ms) and combine signals with any-signal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have cleared the timeout here, good catch, thanks. any-signal does not prevent duplicate handlers from being added to the same signal, so I prefer not to use it.

I will open an issue to update this to abortSignal.timeout(50) and addEventListener.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any-signal does not prevent duplicate handlers from being added to the same signal, so I prefer not to use it.

Are you passing the same signal to anySignal multiple times? Sounds like a bug if so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not, but I know libraries that do :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are not, then it should be okay to use it.

If you know of libraries that do, can you please open issues or better yet, PRs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants