Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor recovery to support Stores belonging to multiple topics #774

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

daniil-quix
Copy link
Collaborator

@daniil-quix daniil-quix commented Mar 5, 2025

Main changes

Updated the format of the "__processed_offsets__" headers in the changelog messages

  • The header name is now __processed_offsets__ (previously __processed_offset__)
  • The header contains JSON encoded dict with topic offsets for this partition (e.g. {"topic": 123})
  • During recovery, we check if all offsets in the dictionary are below the committed offsets for the same TPs. Currently, there's only one offset possible in the dict.

Upgrade considerations:

  • ALOS mode: the last checkpoint must be committed successfully before upgrading to the new version.
    Otherwise, the app won't be able to filter out over-produced changelog messages because it won't recognize the old header format used previously.
  • EOS mode: works fine as before.

Removed the warning about processed offset being behind the committed offset.

Long before implementing changelog recovery, we added a mechanism to check if the processed offset in the state is aligned with the committed offset as a best-effort to signal state inconsistency.
Since changelog topics are widely used now, we can remove the check without any harm, and also stop storing the processed offsets directly in the State.

Moved Recovery checks and validations to RecoveryPartition

Previously, StorePartition subclasses verified the changelog message format and checked the processed and committed offsets.
It was a repeatable code, and it now lives in RecoveryPartition.

Also, I added the StorePartition.write_changelog_offsets() method to roll the changelog offsets for the skipped overproduced changelogs.

@daniil-quix daniil-quix force-pushed the feature/recovery-refactoring-for-merge branch from 44b944e to 96b1249 Compare March 5, 2025 16:42
Copy link
Contributor

@tim-quix tim-quix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did an initial pass, might look again tomorrow. Only one thing that feels like it could maybe use a second glance so far.

Either way, it was easy to follow and I like that it's actually a bit more simplified now! Really nice 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants