Skip to content

[MongoDB] Fix resume token handling when no events are received #251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 30, 2025

Conversation

rkistner
Copy link
Contributor

@rkistner rkistner commented Apr 30, 2025

Our current implementation used resume tokens received from change stream events to resume the stream after a restart.

However, if the source collections do not receive writes, we do not receive any actual events, and the token is never advanced. If the source cluster receives many other oplog events, resuming the stream can become very slow after a while, or time out completely. Once it starts timing out, we don't have a way to recover other than restarting replication from scratch. The timeout typically shows up as an error "connection 2 to 159.41.94.47:27017 timed out". The error can be simulated by reducing socketTimeoutMS significantly (say 50ms).

MongoDB does however provide new resume tokens even when there are no events:
https://www.mongodb.com/docs/manual/changeStreams/#std-label-change-stream-resume-token

The $changeStream aggregation stage includes a resume token on the cursor.postBatchResumeToken field.
The getMore command includes a resume token on the cursor.postBatchResumeToken field.

The MongoDB driver exposes these as stream.resumeToken. There are some implementation details on exactly when this is updated, but in general it is safe to use that when stream.tryNext() returned null.

This will avoid the case of the resume token falling to far behind, at least in the case where the service is running. For self-hosted cases, when the service is stopped for a while (or when there were connection errors for a while), it is still possible to run into the timeout. This doesn't implement automatic recovery from the timeout yet (since it is difficult to know whether the timeout is permanent or can just be retried), but does improve the error message.

Copy link

changeset-bot bot commented Apr 30, 2025

🦋 Changeset detected

Latest commit: 23dd6be

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 14 packages
Name Type
@powersync/service-errors Patch
@powersync/service-module-mongodb Patch
@powersync/lib-service-mongodb Patch
@powersync/service-core Patch
@powersync/service-image Patch
@powersync/lib-services-framework Patch
@powersync/service-module-mongodb-storage Patch
@powersync/service-core-tests Patch
@powersync/service-module-mysql Patch
@powersync/service-module-postgres-storage Patch
@powersync/service-module-postgres Patch
test-client Patch
@powersync/service-rsocket-router Patch
@powersync/lib-service-postgres Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Rentacookie
Rentacookie previously approved these changes Apr 30, 2025
Copy link
Contributor

@Rentacookie Rentacookie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must have been challenging to debug 😬

@rkistner rkistner requested a review from Rentacookie April 30, 2025 11:45
@rkistner rkistner merged commit 08f6ae8 into main Apr 30, 2025
21 checks passed
@rkistner rkistner deleted the fix-changestream-resumetoken branch April 30, 2025 14:57
@rkistner rkistner mentioned this pull request Jul 17, 2025
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants