Recover to exact latest seqno of data committed to MANIFEST #9305

ajkr · 2021-12-17T07:02:01Z

The LastSequence field in the MANIFEST file is the baseline seqno for a recovered DB. Recovering WAL entries might cause the recovered DB's seqno to advance above this baseline, but the recovered DB will never use a smaller seqno.

Before this PR, we were writing the DB's seqno at the time of LogAndApply() as the LastSequence value. This works in the sense that it is a large enough baseline for the recovered DB that it'll never overwrite any records in existing SST files. At the same time, it's arbitrarily larger than what's needed. This behavior comes from LevelDB, where there was no tracking of largest seqno in an SST file.

Now we know the largest seqno of newly written SST files, so we can write an exact value in LastSequence that actually reflects the largest seqno in any file referred to by the MANIFEST. This is primarily useful for correctness testing with unsynced data loss, where the recovered DB's seqno needs to indicate what records were recovered.

Test Plan:

Test correctness with WAL disabled in non-txn blackbox crash tests #9338 adds crash-recovery correctness testing coverage for WAL disabled use cases
Enable IngestExternalFile() in crash test #9357 will extend that testing to cover file ingestion
Added assertion at end of LogAndApply() for VersionSet::descriptor_last_sequence_ consistency with files
Manually tested upgrade/downgrade compatibility with a custom crash test that randomly picks between a db_stress built with and without this PR (for old code it must run with -disable_wal=0)
- Branch: https://github.com/ajkr/rocksdb/tree/638171597be8132c99cbba37de19b3ea8926194e
- Command: TEST_TMPDIR=/dev/shm/ DEBUG_LEVEL=0 python3 tools/db_crashtest.py blackbox --max_key=10000 --interval=10 --duration=86400 --value_size_mult=33 --write_buffer_size=262144 --compression_type=none

facebook-github-bot · 2021-12-17T07:05:40Z

@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

riversand963

Thanks @ajkr for the PR. LGTM with two minor comments.

riversand963 · 2021-12-19T04:47:17Z

db/version_set.cc

@@ -4530,6 +4530,7 @@ Status VersionSet::ProcessManifestWrites(
      // For each column family, update its log number indicating that logs
      // with number smaller than this should be ignored.
      uint64_t last_min_log_number_to_keep = 0;
+      uint64_t last_sequence = 0;


Nit: maybe better to declare last_sequence as SequenceNumber (which is actually uint64_t).

riversand963 · 2021-12-19T05:07:17Z

db/version_set.h

+  // (manifest file).
+  //
+  // Requires DB mutex held.
+  uint64_t DescriptorLastSequence() const { return descriptor_last_sequence_; }


Is this called by anybody?

Done, removed it.

ajkr · 2021-12-19T19:17:21Z

Thanks @ajkr for the PR. LGTM with two minor comments.

Thanks for the review! I think this PR is risky. One thing I plan to add is assertions on LastSequence being >= all FileMetaData::largest_seqno post-recovery. Another thing I plan to add is a crash test that does correctness testing with WAL disabled because that's the primary scenario where the MANIFEST's LastSequence will actually be used as the recovered DB's sequence number.

Do you see anything else interesting to test? Possible interactions with best-effort recovery or atomic flush, for example?

riversand963 · 2021-12-20T21:34:59Z

This is indeed a big change from the behavior since day 1, and we should be careful.
This change should not affect how version edits in the same atomic group are applied.
For best-efforts recovery, the exact seq set by this PR will still be an upper bound (not considering WAL recovery).

facebook-github-bot · 2022-01-02T00:34:18Z

@ajkr has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-01-05T00:46:33Z

@ajkr has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-01-05T02:02:16Z

@ajkr has updated the pull request. You must reimport the pull request before landing.

ajkr

Thanks for the review!

ajkr · 2022-01-05T02:00:43Z

db/version_set.cc

@@ -4530,6 +4530,7 @@ Status VersionSet::ProcessManifestWrites(
      // For each column family, update its log number indicating that logs
      // with number smaller than this should be ignored.
      uint64_t last_min_log_number_to_keep = 0;
+      uint64_t last_sequence = 0;


ajkr · 2022-01-05T02:00:52Z

db/version_set.h

+  // (manifest file).
+  //
+  // Requires DB mutex held.
+  uint64_t DescriptorLastSequence() const { return descriptor_last_sequence_; }


Done, removed it.

facebook-github-bot · 2022-01-05T02:06:26Z

@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-01-05T02:18:30Z

@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ajkr · 2022-01-05T05:06:46Z

Manual downgrade compatibility testing found something. When the MANIFEST tail looks like this:

VersionEdit {
  LogNumber: 8914
  PrevLogNumber: 0
  NextFileNumber: 9161
  LastSeq: 3201408
  AddFile: 0 9159 22430 '^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@  x' seq:3198869, type:1 .. '^@^@^@^@^@^@^@c^@^@^@^@^@^@^A+^@^@^@^@^@^@^BU' seq:3199820, type:1 oldest_ancester_time:1641352186 file_creation_time:1641352186 file_checksum: file_checks
um_func_name: Unknown
  ColumnFamily: 8
  AtomicGroup: 1 entries remains
}
VersionEdit {
  LogNumber: 8914
  PrevLogNumber: 0
  NextFileNumber: 9161
  LastSeq: 3201399
  AddFile: 0 9160 22438 '^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^Oxxxxxxx' seq:3200501, type:1 .. '^@^@^@^@^@^@^@c^@^@^@^@^@^@^A+^@^@^@^@^@^@^B<98>' seq:3199305, type:1 oldest_ancester_time:1641352186 file_creation_time:1641352186 file_checksum: file_checksum_func_name: Unknown
  ColumnFamily: 9
  AtomicGroup: 0 entries remains
}

That is, the tail has a group of VersionEdits whose LastSeqs are descending. The new code recovers to the max LastSeq out of all VersionEdits. But the old code recovers to the final VersionEdit's LastSeq (3201399), which is too small. We need to only write the largest LastSeq in the group to make it downgrade safe.

riversand963 · 2022-01-05T18:17:06Z

I suspect the descending seq at the end of MANIFEST is caused by this part: https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_compaction_flush.cc#L495:L510.

For atomic flush, we process jobs[0] after all other jobs. Maybe I should change that.

facebook-github-bot · 2022-01-05T20:50:39Z

@ajkr has updated the pull request. You must reimport the pull request before landing.

ajkr · 2022-01-05T20:57:25Z

I suspect the descending seq at the end of MANIFEST is caused by this part: https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_compaction_flush.cc#L495:L510.

For atomic flush, we process jobs[0] after all other jobs. Maybe I should change that.

Oh I see. Well that would help in the old code, but in the new code we're using FileDescriptor::largest_seqno to advance LastSeq, and it's kind of arbitrary which entry of jobs will have the file with the maximum largest_seqno.

In any case I changed the loop that calls LogAndApply{,CF}Helper() for each VersionEdit and made it track the maximum LastSeq of all the edits, and advance any smaller ones in the same group. So far it seems to work but I'll run the upgrade/downgrade test for like 24 hours to be sure.

facebook-github-bot · 2022-01-05T21:13:15Z

@ajkr has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-01-05T22:04:33Z

@ajkr has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-01-05T22:05:04Z

@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

riversand963 · 2022-01-05T23:22:56Z

Make sense, especially there are multiple column families.

…9338) Summary: Recently we added the ability to verify some prefix of operations are recovered (AKA no "hole" in the recovered data) (#8966). Besides testing unsynced data loss scenarios, it is also useful to test WAL disabled use cases, where unflushed writes are expected to be lost. Note RocksDB only offers the prefix-recovery guarantee to WAL-disabled use cases that use atomic flush, so crash test always enables atomic flush when WAL is disabled. To verify WAL-disabled crash-recovery correctness globally, i.e., also in whitebox and blackbox transaction tests, it is possible but requires further changes. I added TODOs in db_crashtest.py. Depends on #9305. Pull Request resolved: #9338 Test Plan: Running all crash tests and many instances of blackbox. Sandcastle links are in Phabricator diff test plan. Reviewed By: riversand963 Differential Revision: D33345333 Pulled By: ajkr fbshipit-source-id: f56dd7d2e5a78d59301bf4fc3fedb980eb31e0ce

facebook-github-bot added the CLA Signed label Dec 17, 2021

ajkr mentioned this pull request Dec 17, 2021

Fix unsynced data loss correctness test with mixed -test_batches_snapshots #9302

Closed

riversand963 approved these changes Dec 19, 2021

View reviewed changes

ajkr mentioned this pull request Dec 27, 2021

Test correctness with WAL disabled in non-txn blackbox crash tests #9338

Closed

ajkr force-pushed the manifest-last-sequence-exact-bound branch from efceff8 to 617dfb3 Compare January 2, 2022 00:34

ajkr added 3 commits January 4, 2022 16:46

Store/restore exact last sequence referred to by MANIFEST entries

f6e0794

make format

7740e1d

more comments about descriptor last sequence

82ac77e

ajkr force-pushed the manifest-last-sequence-exact-bound branch from 617dfb3 to 82ac77e Compare January 5, 2022 00:46

ajkr added 2 commits January 4, 2022 17:58

add the promised assertion

cd3d9fc

addressed comments

073650c

ajkr commented Jan 5, 2022

View reviewed changes

write non-decreasing LastSequences for downgrade compatibility

678b2ee

touchup comments

4be70dc

fix strange Java test expectation

6806fe8

facebook-github-bot closed this in b860a42 Jan 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover to exact latest seqno of data committed to MANIFEST #9305

Recover to exact latest seqno of data committed to MANIFEST #9305

ajkr commented Dec 17, 2021 •

edited

Loading

facebook-github-bot commented Dec 17, 2021

riversand963 left a comment

riversand963 Dec 19, 2021

ajkr Jan 5, 2022

riversand963 Dec 19, 2021

ajkr Jan 5, 2022

ajkr commented Dec 19, 2021 •

edited

Loading

riversand963 commented Dec 20, 2021

facebook-github-bot commented Jan 2, 2022

facebook-github-bot commented Jan 5, 2022

facebook-github-bot commented Jan 5, 2022

ajkr left a comment

ajkr Jan 5, 2022

ajkr Jan 5, 2022

facebook-github-bot commented Jan 5, 2022

facebook-github-bot commented Jan 5, 2022

ajkr commented Jan 5, 2022 •

edited

Loading

riversand963 commented Jan 5, 2022

facebook-github-bot commented Jan 5, 2022

ajkr commented Jan 5, 2022

facebook-github-bot commented Jan 5, 2022

facebook-github-bot commented Jan 5, 2022

facebook-github-bot commented Jan 5, 2022

riversand963 commented Jan 5, 2022

Recover to exact latest seqno of data committed to MANIFEST #9305

Recover to exact latest seqno of data committed to MANIFEST #9305

Conversation

ajkr commented Dec 17, 2021 • edited Loading

facebook-github-bot commented Dec 17, 2021

riversand963 left a comment

Choose a reason for hiding this comment

riversand963 Dec 19, 2021

Choose a reason for hiding this comment

ajkr Jan 5, 2022

Choose a reason for hiding this comment

riversand963 Dec 19, 2021

Choose a reason for hiding this comment

ajkr Jan 5, 2022

Choose a reason for hiding this comment

ajkr commented Dec 19, 2021 • edited Loading

riversand963 commented Dec 20, 2021

facebook-github-bot commented Jan 2, 2022

facebook-github-bot commented Jan 5, 2022

facebook-github-bot commented Jan 5, 2022

ajkr left a comment

Choose a reason for hiding this comment

ajkr Jan 5, 2022

Choose a reason for hiding this comment

ajkr Jan 5, 2022

Choose a reason for hiding this comment

facebook-github-bot commented Jan 5, 2022

facebook-github-bot commented Jan 5, 2022

ajkr commented Jan 5, 2022 • edited Loading

riversand963 commented Jan 5, 2022

facebook-github-bot commented Jan 5, 2022

ajkr commented Jan 5, 2022

facebook-github-bot commented Jan 5, 2022

facebook-github-bot commented Jan 5, 2022

facebook-github-bot commented Jan 5, 2022

riversand963 commented Jan 5, 2022

ajkr commented Dec 17, 2021 •

edited

Loading

ajkr commented Dec 19, 2021 •

edited

Loading

ajkr commented Jan 5, 2022 •

edited

Loading