Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improve pg write queue #1878

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

bestmike007
Copy link
Contributor

@bestmike007 bestmike007 commented Mar 1, 2024

Description

This PR is to fix the potential issue described in https://medium.com/@alkor_shikyaro/transactions-and-promises-in-node-js-ca5a3aeb6b74

Type of Change

  • New feature
  • Bug fix
  • API reference/documentation update
  • Other

Does this introduce a breaking change?

No.

Are documentation updates required?

No.

Testing information

Current tests covered this.

Checklist

  • Code is commented where needed
  • Unit test coverage for new or modified code paths
  • npm run test passes
  • Changelog is updated
  • Tag 1 of @rafaelcr or @zone117x for review

Signed-off-by: bestmike007 <i@bestmike007.com>
@bestmike007
Copy link
Contributor Author

@zone117x

Copy link

codecov bot commented Mar 1, 2024

Codecov Report

Attention: Patch coverage is 90.00000% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 71.87%. Comparing base (a0856f9) to head (fdb956d).

Files Patch % Lines
src/datastore/helpers.ts 90.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #1878   +/-   ##
========================================
  Coverage    71.87%   71.87%           
========================================
  Files           92       92           
  Lines        11913    11920    +7     
  Branches      2624     2625    +1     
========================================
+ Hits          8562     8568    +6     
- Misses        3197     3198    +1     
  Partials       154      154           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

done(): Promise<void> {
return this.queue.onIdle();
async done(): Promise<void> {
// https://medium.com/@alkor_shikyaro/transactions-and-promises-in-node-js-ca5a3aeb6b74
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe the problem described in this post applies to the tasks handled by this queue for a couple of reasons:

  1. PQueue's onIdle already behaves like allSettled

    Returns a promise that settles when the queue becomes empty, and all promises have completed

  2. The API only enqueues work to this queue inside of a pre-existing SQL transaction (all the block processing code is contained in a single sqlWriteTransaction). If there's ever an exception thrown inside, the transaction aborts and rolls back all the pending writes. We've done extensive tests on this and it's worked correctly for a long time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. That's what I thought, but it was wrong; here's a demonstration: https://replit.com/@bestmike007/TroubledEarnestSandboxes#index.ts
  2. onIdle does not throw an exception, it simply waits all promises to complete.

What I'm seeing is, there was an unhandledRejection and then the process restarted.

@rafaelcr
Copy link
Collaborator

rafaelcr commented Mar 1, 2024

@bestmike007 I'm curious to now if this fix was submitted because of a problem you're experiencing? We'd be happy to test some edge case if you've found one we haven't covered yet

@bestmike007
Copy link
Contributor Author

Yes I was trying to resolve a problem, our nodes run into unrecoverable state, not sure if it's related to this PR.

Here's one of the situations I saw, which should not be happening:

telegram-cloud-photo-size-5-6273870458726235191-y

Another one, I couldn't found a screenshot, was reporting something like table tid from new index tuple (13374,63) overlaps with invalid duplicate tuple at offset 4 of block 6211 in inde, and can be fixed by 3 steps: remove the nft_custody_unique constraints, remove duplicates with lower block heights, and then rebuild the unique constraint.

Signed-off-by: bestmike007 <i@bestmike007.com>
@rafaelcr
Copy link
Collaborator

rafaelcr commented Mar 6, 2024

Hmm I don't believe that error is related to this change. It seems strange, though, because the ON CONFLICT ON CONSTRAINT is there precisely to avoid this error. Can you tell me more about your setup? Is this error occurring in an event replay, or a normal Stacks node sync?

}
enqueue(task: Parameters<PQueue['add']>[0]): void {
void this.queue.add(task);
const p = this.queue.add(task);
p.catch(e => logger.error(e, 'PgWriteQueue task failed'));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this catch needed, or would it be handled later in the throw firstRejected.reason code?

If the error is logged here, it would have no meaningful stack trace compared to the below code which preserves the async call stack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's no catch, the process will be terminated by an unhandledRejection.

The error is logged here because throw firstRejected.reason only throws the first one, we don't have any log for the other errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, would it make sense to use the p-queue error event instead? https://www.npmjs.com/package/p-queue#error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what we saw using a node built from this PR.

image

Copy link
Member

@zone117x zone117x Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we can manually capture the errors ourself with .catch to prevent a terminated by an unhandledRejection. But could we instead use the build-in p-queue errror handler, e.g.

queue.on('error', error => {
	console.error(error);
});

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bestmike007
Copy link
Contributor Author

Hmm I don't believe that error is related to this change. It seems strange, though, because the ON CONFLICT ON CONSTRAINT is there precisely to avoid this error. Can you tell me more about your setup? Is this error occurring in an event replay, or a normal Stacks node sync?

It it not related, it's another issue I saw when the node keeps crashing because of unhandledRejection when it fails to process a new_block event.

Our nodes are set up using the script in this repo: https://github.com/alexgo-io/stacks-node-mainnet, and nodes restoring from both hiro archive and the cold backup saw this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🏗 In Progress
Development

Successfully merging this pull request may close these issues.

3 participants