Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover to exact latest seqno of data committed to MANIFEST #9305

Closed
wants to merge 8 commits into from

Conversation

ajkr
Copy link
Contributor

@ajkr ajkr commented Dec 17, 2021

The LastSequence field in the MANIFEST file is the baseline seqno for a recovered DB. Recovering WAL entries might cause the recovered DB's seqno to advance above this baseline, but the recovered DB will never use a smaller seqno.

Before this PR, we were writing the DB's seqno at the time of LogAndApply() as the LastSequence value. This works in the sense that it is a large enough baseline for the recovered DB that it'll never overwrite any records in existing SST files. At the same time, it's arbitrarily larger than what's needed. This behavior comes from LevelDB, where there was no tracking of largest seqno in an SST file.

Now we know the largest seqno of newly written SST files, so we can write an exact value in LastSequence that actually reflects the largest seqno in any file referred to by the MANIFEST. This is primarily useful for correctness testing with unsynced data loss, where the recovered DB's seqno needs to indicate what records were recovered.

Test Plan:

@facebook-github-bot
Copy link
Contributor

@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@riversand963 riversand963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ajkr for the PR. LGTM with two minor comments.

@@ -4530,6 +4530,7 @@ Status VersionSet::ProcessManifestWrites(
// For each column family, update its log number indicating that logs
// with number smaller than this should be ignored.
uint64_t last_min_log_number_to_keep = 0;
uint64_t last_sequence = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe better to declare last_sequence as SequenceNumber (which is actually uint64_t).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

db/version_set.h Outdated
// (manifest file).
//
// Requires DB mutex held.
uint64_t DescriptorLastSequence() const { return descriptor_last_sequence_; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this called by anybody?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, removed it.

@ajkr
Copy link
Contributor Author

ajkr commented Dec 19, 2021

Thanks @ajkr for the PR. LGTM with two minor comments.

Thanks for the review! I think this PR is risky. One thing I plan to add is assertions on LastSequence being >= all FileMetaData::largest_seqno post-recovery. Another thing I plan to add is a crash test that does correctness testing with WAL disabled because that's the primary scenario where the MANIFEST's LastSequence will actually be used as the recovered DB's sequence number.

Do you see anything else interesting to test? Possible interactions with best-effort recovery or atomic flush, for example?

@riversand963
Copy link
Contributor

This is indeed a big change from the behavior since day 1, and we should be careful.
This change should not affect how version edits in the same atomic group are applied.
For best-efforts recovery, the exact seq set by this PR will still be an upper bound (not considering WAL recovery).

@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

@ajkr ajkr force-pushed the manifest-last-sequence-exact-bound branch from 617dfb3 to 82ac77e Compare January 5, 2022 00:46
@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

@@ -4530,6 +4530,7 @@ Status VersionSet::ProcessManifestWrites(
// For each column family, update its log number indicating that logs
// with number smaller than this should be ignored.
uint64_t last_min_log_number_to_keep = 0;
uint64_t last_sequence = 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

db/version_set.h Outdated
// (manifest file).
//
// Requires DB mutex held.
uint64_t DescriptorLastSequence() const { return descriptor_last_sequence_; }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, removed it.

@facebook-github-bot
Copy link
Contributor

@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ajkr
Copy link
Contributor Author

ajkr commented Jan 5, 2022

Manual downgrade compatibility testing found something. When the MANIFEST tail looks like this:

VersionEdit {
  LogNumber: 8914
  PrevLogNumber: 0
  NextFileNumber: 9161
  LastSeq: 3201408
  AddFile: 0 9159 22430 '^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@  x' seq:3198869, type:1 .. '^@^@^@^@^@^@^@c^@^@^@^@^@^@^A+^@^@^@^@^@^@^BU' seq:3199820, type:1 oldest_ancester_time:1641352186 file_creation_time:1641352186 file_checksum: file_checks
um_func_name: Unknown
  ColumnFamily: 8
  AtomicGroup: 1 entries remains
}
VersionEdit {
  LogNumber: 8914
  PrevLogNumber: 0
  NextFileNumber: 9161
  LastSeq: 3201399
  AddFile: 0 9160 22438 '^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^Oxxxxxxx' seq:3200501, type:1 .. '^@^@^@^@^@^@^@c^@^@^@^@^@^@^A+^@^@^@^@^@^@^B<98>' seq:3199305, type:1 oldest_ancester_time:1641352186 file_creation_time:1641352186 file_checksum: file_checksum_func_name: Unknown
  ColumnFamily: 9
  AtomicGroup: 0 entries remains
}

That is, the tail has a group of VersionEdits whose LastSeqs are descending. The new code recovers to the max LastSeq out of all VersionEdits. But the old code recovers to the final VersionEdit's LastSeq (3201399), which is too small. We need to only write the largest LastSeq in the group to make it downgrade safe.

@riversand963
Copy link
Contributor

I suspect the descending seq at the end of MANIFEST is caused by this part: https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_compaction_flush.cc#L495:L510.

For atomic flush, we process jobs[0] after all other jobs. Maybe I should change that.

@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

@ajkr
Copy link
Contributor Author

ajkr commented Jan 5, 2022

I suspect the descending seq at the end of MANIFEST is caused by this part: https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_compaction_flush.cc#L495:L510.

For atomic flush, we process jobs[0] after all other jobs. Maybe I should change that.

Oh I see. Well that would help in the old code, but in the new code we're using FileDescriptor::largest_seqno to advance LastSeq, and it's kind of arbitrary which entry of jobs will have the file with the maximum largest_seqno.

In any case I changed the loop that calls LogAndApply{,CF}Helper() for each VersionEdit and made it track the maximum LastSeq of all the edits, and advance any smaller ones in the same group. So far it seems to work but I'll run the upgrade/downgrade test for like 24 hours to be sure.

@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@riversand963
Copy link
Contributor

Make sense, especially there are multiple column families.

facebook-github-bot pushed a commit that referenced this pull request Jan 6, 2022
…9338)

Summary:
Recently we added the ability to verify some prefix of operations are recovered (AKA no "hole" in the recovered data) (#8966). Besides testing unsynced data loss scenarios, it is also useful to test WAL disabled use cases, where unflushed writes are expected to be lost. Note RocksDB only offers the prefix-recovery guarantee to WAL-disabled use cases that use atomic flush, so crash test always enables atomic flush when WAL is disabled.

To verify WAL-disabled crash-recovery correctness globally, i.e., also in whitebox and blackbox transaction tests, it is possible but requires further changes. I added TODOs in db_crashtest.py.

Depends on #9305.

Pull Request resolved: #9338

Test Plan: Running all crash tests and many instances of blackbox. Sandcastle links are in Phabricator diff test plan.

Reviewed By: riversand963

Differential Revision: D33345333

Pulled By: ajkr

fbshipit-source-id: f56dd7d2e5a78d59301bf4fc3fedb980eb31e0ce
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants