-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recover to exact latest seqno of data committed to MANIFEST #9305
Conversation
@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ajkr for the PR. LGTM with two minor comments.
db/version_set.cc
Outdated
@@ -4530,6 +4530,7 @@ Status VersionSet::ProcessManifestWrites( | |||
// For each column family, update its log number indicating that logs | |||
// with number smaller than this should be ignored. | |||
uint64_t last_min_log_number_to_keep = 0; | |||
uint64_t last_sequence = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe better to declare last_sequence
as SequenceNumber
(which is actually uint64_t
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
db/version_set.h
Outdated
// (manifest file). | ||
// | ||
// Requires DB mutex held. | ||
uint64_t DescriptorLastSequence() const { return descriptor_last_sequence_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this called by anybody?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, removed it.
Thanks for the review! I think this PR is risky. One thing I plan to add is assertions on LastSequence being >= all FileMetaData::largest_seqno post-recovery. Another thing I plan to add is a crash test that does correctness testing with WAL disabled because that's the primary scenario where the MANIFEST's LastSequence will actually be used as the recovered DB's sequence number. Do you see anything else interesting to test? Possible interactions with best-effort recovery or atomic flush, for example? |
This is indeed a big change from the behavior since day 1, and we should be careful. |
efceff8
to
617dfb3
Compare
@ajkr has updated the pull request. You must reimport the pull request before landing. |
617dfb3
to
82ac77e
Compare
@ajkr has updated the pull request. You must reimport the pull request before landing. |
@ajkr has updated the pull request. You must reimport the pull request before landing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review!
db/version_set.cc
Outdated
@@ -4530,6 +4530,7 @@ Status VersionSet::ProcessManifestWrites( | |||
// For each column family, update its log number indicating that logs | |||
// with number smaller than this should be ignored. | |||
uint64_t last_min_log_number_to_keep = 0; | |||
uint64_t last_sequence = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
db/version_set.h
Outdated
// (manifest file). | ||
// | ||
// Requires DB mutex held. | ||
uint64_t DescriptorLastSequence() const { return descriptor_last_sequence_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, removed it.
@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1 similar comment
@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Manual downgrade compatibility testing found something. When the MANIFEST tail looks like this:
That is, the tail has a group of |
I suspect the descending seq at the end of MANIFEST is caused by this part: https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_compaction_flush.cc#L495:L510. For atomic flush, we process jobs[0] after all other jobs. Maybe I should change that. |
@ajkr has updated the pull request. You must reimport the pull request before landing. |
Oh I see. Well that would help in the old code, but in the new code we're using FileDescriptor::largest_seqno to advance LastSeq, and it's kind of arbitrary which entry of jobs will have the file with the maximum largest_seqno. In any case I changed the loop that calls |
@ajkr has updated the pull request. You must reimport the pull request before landing. |
@ajkr has updated the pull request. You must reimport the pull request before landing. |
@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Make sense, especially there are multiple column families. |
…9338) Summary: Recently we added the ability to verify some prefix of operations are recovered (AKA no "hole" in the recovered data) (#8966). Besides testing unsynced data loss scenarios, it is also useful to test WAL disabled use cases, where unflushed writes are expected to be lost. Note RocksDB only offers the prefix-recovery guarantee to WAL-disabled use cases that use atomic flush, so crash test always enables atomic flush when WAL is disabled. To verify WAL-disabled crash-recovery correctness globally, i.e., also in whitebox and blackbox transaction tests, it is possible but requires further changes. I added TODOs in db_crashtest.py. Depends on #9305. Pull Request resolved: #9338 Test Plan: Running all crash tests and many instances of blackbox. Sandcastle links are in Phabricator diff test plan. Reviewed By: riversand963 Differential Revision: D33345333 Pulled By: ajkr fbshipit-source-id: f56dd7d2e5a78d59301bf4fc3fedb980eb31e0ce
The LastSequence field in the MANIFEST file is the baseline seqno for a recovered DB. Recovering WAL entries might cause the recovered DB's seqno to advance above this baseline, but the recovered DB will never use a smaller seqno.
Before this PR, we were writing the DB's seqno at the time of LogAndApply() as the LastSequence value. This works in the sense that it is a large enough baseline for the recovered DB that it'll never overwrite any records in existing SST files. At the same time, it's arbitrarily larger than what's needed. This behavior comes from LevelDB, where there was no tracking of largest seqno in an SST file.
Now we know the largest seqno of newly written SST files, so we can write an exact value in LastSequence that actually reflects the largest seqno in any file referred to by the MANIFEST. This is primarily useful for correctness testing with unsynced data loss, where the recovered DB's seqno needs to indicate what records were recovered.
Test Plan:
VersionSet::descriptor_last_sequence_
consistency with filesdb_stress
built with and without this PR (for old code it must run with-disable_wal=0
)TEST_TMPDIR=/dev/shm/ DEBUG_LEVEL=0 python3 tools/db_crashtest.py blackbox --max_key=10000 --interval=10 --duration=86400 --value_size_mult=33 --write_buffer_size=262144 --compression_type=none