-
Notifications
You must be signed in to change notification settings - Fork 514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-6961. [Snapshot] Bootstrapping slow followers/new followers. #3980
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This was referenced Aug 23, 2024
This was referenced Sep 3, 2024
This was referenced Sep 9, 2024
This was referenced Sep 29, 2024
This was referenced Nov 18, 2024
Merged
This was referenced Dec 8, 2024
Closed
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR updates the om follower bootstrap mechanism to include the snapshot state.
It considers snapshot state to be all files under metadataDir/db.snapshots.
That includes the om snapshot directories, as well as the snapshot diff compaction logs and backup sst files, (which have been moved to the "db.snapshots" dir by this PR.)
This PR adds the contents db.snapshots dir to the tarball sent to the follower. To reduce the size of the tarball, it does not include multiple copies of any hard links found. Instead, it includes a list of hard links to be generated by the follower.
Design doc here: https://docs.google.com/document/d/1cFZj-7NRxiHaZ56ndcf1Z1EqapPFy4fo4dDVIy_aCx4/edit
Recon
Recon also uses the same tarball to initialize its copy of the OM rocksdb. Since it doesn't need the snapshot data, I've added the "includeSnapshotData" parameter to the http request.
Renamed OzoneManagerSnapshotProvider
The ratis code uses the term "snapshot" to mean something other than what we mean. It uses "snapshot" to refer to the tarball as a whole, (which now includes all of the individual "OM snapshots".)
In particular, this class, in the "om/snapshot" directory is ambiguously named:
To reduce potential confusion, I've renamed it to:
Internal Consistency of Tarball
There are two areas of consistency I've thought about:
Snapshot Info Table Entries -> Snapshot Directories
There needs to be a directory for each snapshot info table entry. These directories sometimes appear a short while after the snapshot info table entry is created.
This PR addresses that by ensuring the directories exist before creating the tarball, (pausing for a few seconds if needed.)
Compaction Logs -> SST Files
If the tarball is created during compaction, the snap diff compaction logs for the most recent compaction may not be included. I'm not sure how bad a problem this is. Please consider it in your review of this PR.
Incremental checkpointing
The addition of snapshot data will increase the size of the tarball, exacerbating the problem described here: https://issues.apache.org/jira/browse/HDDS-6510
We'll need to decide if incremental checkpointing needs to a part of the initial snapshot release. If not, it may need to come soon afterwards, otherwise users could be stranded.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-6961
How was this patch tested?
Unit/integration tests added