format_version=6 and context-aware block checksums #9058

pdillinger · 2021-10-19T23:32:05Z

Summary:

Context checksum

All RocksDB checksums currently use 32 bits of checking
power, which should be 1 in 4 billion false negative (FN) probability (failing to
detect corruption). This is true for random corruptions, and in some cases
small corruptions are guaranteed to be detected. But some possible
corruptions, such as in storage metadata rather than storage payload data,
would have a much higher FN rate. For example:

Data larger than one SST block is replaced by data from elsewhere in
the same or another SST file. Especially with block_align=true, the
probability of exact block size match is probably around 1 in 100, making
the FN probability around that same. Without block_align=true the
probability of same block start location is probably around 1 in 10,000,
for FN probability around 1 in a million.

To solve this problem in new format_version=6, we add "context awareness"
to block checksum checks. The stored and expected checksum value is
modified based on the block's position in the file and which file it is in. The
modifications are cleverly chosen so that, for example

blocks within about 4GB of each other are guaranteed to use different context
blocks that are offset by exactly some multiple of 4GiB are guaranteed to use
different context
files generated by the same process are guaranteed to use different context
for the same offsets, until wrap-around after 2^32 - 1 files

Thus, with format_version=6, if a valid SST block and checksum is misplaced,
its checksum FN probability should be essentially ideal, 1 in 4B.

Footer checksum

This change also adds checksum protection to the SST footer (with
format_version=6), for the first time without relying on whole file checksum.
To prevent a corruption of the format_version in the footer (e.g. 6 -> 5) to
defeat the footer checksum, we change much of the footer data format
including an "extended magic number" in format_version 6 that would be
interpreted as empty index and metaindex block handles in older footer
versions. We also change the encoding of handles to free up space for
other new data in footer.

More detail: making space in footer

In order to keep footer the same size in format_version=6 (avoid change to IO
patterns), we have to free up some space for new data. We do this two ways:

Metaindex block handle is encoded down to 4 bytes (from 10) by assuming
it immediately precedes the footer, and by assuming it is < 4GB.
Index block handle is moved into metaindex. (I don't know why it was
in footer to begin with.)

Performance

In case of small performance penalty, I've made a "pay as you go" optimization
to compensate: replace MutableCFOptions in BlockBasedTableBuilder::Rep
with the only field used in that structure after construction: prefix_extractor.
This makes the PR an overall performance improvement (results below).

Nevertheless I'm seeing essentially no difference going from fv=5 to fv=6,
even including that improvement for both. That's based on extreme case table
write performance testing, many files with many blocks. This is relatively
checksum intensive (small blocks) and salt generation intensive (small files).

(for I in `seq 1 100`; do TEST_TMPDIR=/dev/shm/dbbench2 ./db_bench -benchmarks=fillseq -memtablerep=vector -disable_wal=1 -allow_concurrent_memtable_write=false -num=3000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -write_buffer_size=100000 -compression_type=none -block_size=1000; done) 2>&1 | grep micros/op | tee out
awk '{ tot += $5; n += 1; } END { print int(1.0 * tot / n) }' < out

Each value below is ops/s averaged over 100 runs, run simultaneously with competing
configuration for load fairness

Before -> after (both fv=5): 483530 -> 483673 (negligible)
Re-run 1: 480733 -> 485427 (1.0% faster)
Re-run 2: 483821 -> 484541 (0.1% faster)
Before (fv=5) -> after (fv=6): 482006 -> 485100 (0.6% faster)
Re-run 1: 482212 -> 485075 (0.6% faster)
Re-run 2: 483590 -> 484073 (0.1% faster)
After fv=5 -> after fv=6: 483878 -> 485542 (0.3% faster)
Re-run 1: 485331 -> 483385 (0.4% slower)
Re-run 2: 485283 -> 483435 (0.4% slower)
Re-run 3: 483647 -> 486109 (0.5% faster)

Test Plan: unit tests included (table_test, db_properties_test, salt in env_test). General DB tests
and crash test updated to test new format_version.

Also temporarily updated the default format version to 6 and saw some test failures. Almost all
were due to an inadvertent additional read in VerifyChecksum to verify the index block checksum,
though it's arguably a bug that VerifyChecksum does not appear to (re-)verify the index block
checksum, just assuming it was verified in opening the index reader (probably usually true but
probably not always true). Some other concerns about VerifyChecksum are left in FIXME
comments. The only remaining test failure on change of default (in block_fetcher_test) now
has a comment about how to upgrade the test.

The format compatibility test does not need updating because we have not updated the default
format_version.

pdillinger · 2021-10-20T02:59:29Z

It was easier to implement "with context" as a bitwise-or flag on existing
checksum algorithms than to duplicate/imbue only some algorithms with
"WithContext" variants.

For the user, checksum_with_context might make more sense as its own boolean option, but we still have to put that in the footer somehow, and the ChecksumType byte seems like the natural place. But it also seems weird to have a "hidden" internal-only flag on ChecksumType restricting its value range.

An alternative could be to use format_version=6 to enable context-aware checksum, so it doesn't have to be in ChecksumType in options or in the footer. I can't think of any reason this feature shouldn't be used by everyone eventually.

mrambacher

There should be a XXH3 test added to table_test/FooterTest. Also tests for the context cases.
There should be a test where an invalid checksum type is written to the file and validate that the code does the right thing
Do you need to do backward compatibility tests to verify that what happens in earlier versions if a file with XXH3 (or contexts) as the hash is used? This is likely true both from a normal start or an ingest.

mrambacher · 2021-10-19T23:58:58Z

include/rocksdb/table.h

+// "WithContext" checksums offer essentially "full" protection against
+// misplaced data because the checksum values depend on which file they are
+// in and the location within the file. There is essentially no performance
+// penalty for this enhanced checking.
 enum ChecksumType : char {


Just wondering if this would be better as an unsigned char.

table/format.cc

table/format.h

tools/db_bench_tool.cc

include/rocksdb/table.h

db/db_basic_test.cc

db/external_sst_file_test.cc

table/block_based/reader_common.h

mrambacher · 2021-10-20T01:37:07Z

table/format.cc

-//    <padding> to make the total size 2 * BlockHandle::kMaxEncodedLength + 1
+//    <padding> up to part2 size = 2 * BlockHandle::kMaxEncodedLength - 4
+//    base_context_checksum (4 bytes, 0s if unused)
+//  ---- part3
 //    footer version (4 bytes)
 //    table_magic_number (8 bytes)
 void Footer::EncodeTo(std::string* dst) const {


Should the Footer::ToString method also contain the context? And can the "ToString" method change to print out the string version of the enum (rather than its number)?

table/format.cc

Summary: XXH3 - latest hash function that is extremely fast on large data, easily faster than crc32c on most any x86_64 hardware. In integrating this hash function, I have handled the compression type byte in a non-standard way to avoid using the streaming API (extra data movement and active code size because of hash function complexity). This approach got a thumbs-up from Yann Collet. This change split off from facebook#9058 because context-aware checksum is likely to be handled through different configuration than ChecksumType. Test Plan: tests updated, including stress test. db_bench also updated for microbenchmarking checksums. ### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor) ./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3 crc32c : 0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op) xxhash : 0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op) xxhash64 : 0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op) xxh3 : 0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op) crc32c : 0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op) xxhash : 0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op) xxhash64 : 0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op) xxh3 : 0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op) crc32c : 0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op) xxhash : 0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op) xxhash64 : 0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op) xxh3 : 0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op) As you can see, especially once warmed up, xxh3 is fastest. ### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor) Test for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done Results (ops/sec) for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done results-0 252118 # kNoChecksum results-1 251588 # kCRC32c results-2 251863 # kxxHash results-3 252016 # kxxHash64 results-4 252038 # kXXH3

Summary: XXH3 - latest hash function that is extremely fast on large data, easily faster than crc32c on most any x86_64 hardware. In integrating this hash function, I have handled the compression type byte in a non-standard way to avoid using the streaming API (extra data movement and active code size because of hash function complexity). This approach got a thumbs-up from Yann Collet. Existing functionality change: * reject bad ChecksumType in options with InvalidArgument This change split off from #9058 because context-aware checksum is likely to be handled through different configuration than ChecksumType. Pull Request resolved: #9069 Test Plan: tests updated, and substantially expanded. Unit tests now check that we don't accidentally change the values generated by the checksum algorithms ("schema test") and that we properly handle invalid/unrecognized checksum types in options or in file footer. DBTestBase::ChangeOptions (etc.) updated from two to one configuration changing from default CRC32c ChecksumType. The point of this test code is to detect possible interactions among features, and the likelihood of some bad interaction being detected by including configurations other than XXH3 and CRC32c--and then not detected by stress/crash test--is extremely low. Stress/crash test also updated (manual run long enough to see it accepts new checksum type). db_bench also updated for microbenchmarking checksums. ### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor) ./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3 crc32c : 0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op) xxhash : 0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op) xxhash64 : 0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op) xxh3 : 0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op) crc32c : 0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op) xxhash : 0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op) xxhash64 : 0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op) xxh3 : 0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op) crc32c : 0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op) xxhash : 0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op) xxhash64 : 0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op) xxh3 : 0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op) As you can see, especially once warmed up, xxh3 is fastest. ### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor) Test for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done Results (ops/sec) for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done results-0 252118 # kNoChecksum results-1 251588 # kCRC32c results-2 251863 # kxxHash results-3 252016 # kxxHash64 results-4 252038 # kXXH3 Reviewed By: mrambacher Differential Revision: D31905249 Pulled By: pdillinger fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1

Summary: I'm working on a new format_version=6 to support context checksum (facebook#9058) and this includes much of the refactoring and test updates to support that change. Test coverage data and manual inspection agree on dead code in block_based_table_reader.cc (removed). Test Plan: tests enhanced to cover more cases etc.

Summary: I'm working on a new format_version=6 to support context checksum (#9058) and this includes much of the refactoring and test updates to support that change. Test coverage data and manual inspection agree on dead code in block_based_table_reader.cc (removed). Pull Request resolved: #9240 Test Plan: tests enhanced to cover more cases etc. Extreme case performance testing indicates small % regression in fillseq (w/ compaction), though CPU profile etc. doesn't suggest any explanation. There is enhanced correctness checking in Footer::DecodeFrom, but this should be negligible. TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=1 --disable_wal={false,true} (Each is ops/s averaged over 50 runs, run simultaneously with competing configuration for load fairness) Before w/ wal: 454512 After w/ wal: 444820 (-2.1%) Before w/o wal: 1004560 After w/o wal: 998897 (-0.6%) Since this doesn't modify WAL code, one would expect real effects to be larger in w/o wal case. This regression will be corrected in a follow-up PR. Reviewed By: ajkr Differential Revision: D32813769 Pulled By: pdillinger fbshipit-source-id: 444a244eabf3825cd329b7d1b150cddce320862f

Summary: Again, ahead of planned changes in facebook#9058. This change improves performance (vs. pre-facebook#9240 baseline) by separating a FooterBuilder from Footer, where FooterBuilder includes (inline owns) the serialized data so that it can be stack allocated. Test Plan: existing tests + performance testing below Extreme case performance testing as in facebook#9240 with TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 (Each is ops/s averaged over 50 runs, run simultaneously with competing configuration for load fairness) Pre-facebook#9240 baseline (f577458): 436389 With facebook#9240 (653c392): 417946 (-4.2% vs. baseline) This change: 443762 (+1.7% vs. baseline)

Summary: Again, ahead of planned changes in #9058. This change improves performance (vs. pre-#9240 baseline) by separating a FooterBuilder from Footer, where FooterBuilder includes (inline owns) the serialized data so that it can be stack allocated. Pull Request resolved: #9280 Test Plan: existing tests + performance testing below Extreme case performance testing as in #9240 with TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 (Each is ops/s averaged over 50 runs, run simultaneously with competing configuration for load fairness) Pre-#9240 baseline (f577458): 436389 With #9240 (653c392): 417946 (-4.2% vs. baseline) This change: 443762 (+1.7% vs. baseline) Reviewed By: ajkr Differential Revision: D33077220 Pulled By: pdillinger fbshipit-source-id: 7eaa6499589aac1693414a758e8c799216c5016c

facebook-github-bot · 2021-12-14T19:00:03Z

@pdillinger has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-12-14T23:10:35Z

@pdillinger has updated the pull request. You must reimport the pull request before landing.

pdillinger · 2021-12-14T23:17:31Z

Baseline, format_version=5: 66396
This PR, format_version=5: 65419 (-1.5%)
This PR, format_version=6: 64849 (-2.3%)

This is after Small optimization. After that change, it's hard to explain any difference between format_version 5 and 6 on new code, because this test is dominated by reading blocks on already open files and there's (now) no difference in control flow path on checking block checksum with or without context.

facebook-github-bot · 2021-12-16T15:27:40Z

@pdillinger has updated the pull request. You must reimport the pull request before landing.

mrambacher

LGTM.

mrambacher · 2021-12-14T17:52:19Z

include/rocksdb/table.h

@@ -418,6 +421,9 @@ struct BlockBasedTableOptions {
  // 5 -- Can be read by RocksDB's versions since 6.6.0. Full and partitioned
  // filters use a generally faster and more accurate Bloom filter
  // implementation, with a different schema.
+  // 6 -- Modified the file footer and checksum matching so that SST data


Should the comment contain since version info?

pdillinger · 2022-01-05T22:34:10Z

Still not totally satisfied with performance, and simulation data on new cache key suggests little value from "improved" unique id. Will reconsider this some time.

Summary: Pull Request resolved: facebook#11624 <queue> must be included to use std::queue. Reviewed By: pdillinger Differential Revision: D47562433 fbshipit-source-id: 7c5b19fd9e411694c782dfc0dff0231d4f92ef24

Summary: An internal user wants to implement a key-aware row cache policy. For that, they need to know the components of the cache key, especially the user key component. With a specialized `RowCache` interface, we will be able to tell them the components so they won't have to make assumptions about our internal key schema. This PR prepares for the specialized `RowCache` interface by updating the migration plan of facebook#11450. I added a release note for the removed APIs and didn't mention the added ones for now. Pull Request resolved: facebook#11620 Reviewed By: pdillinger Differential Revision: D47536962 Pulled By: ajkr fbshipit-source-id: bbee0fc4ad67fc699a66b8f2b4ea4544dd003691

…acebook#11575) Summary: This adds proper support for using rocksdb with FetchContent, without this PR the user must include the following with their own `CMakeLists.txt` file: ```cmake include_directories(./build/_deps/rocksdb-src/include) ``` Pull Request resolved: facebook#11575 Reviewed By: jowlyzhang Differential Revision: D47163520 Pulled By: ajkr fbshipit-source-id: a202dcf435ecc9dd8d51c88f90e98c04814721ca

…not executed in db_compaction_test (facebook#11583) Summary: In [db_impl_open.cc](https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_open.cc), the sync point `SanitizeOptions::AfterChangeMaxOpenFiles` is used to set `max_open_files` with some specified "**invalid**" value even if it has been sanitized. However, in [db_compaction_test.cc](https://github.com/facebook/rocksdb/blob/main/db/db_compaction_test.cc), `SanitizeOptions::AfterChangeMaxOpenFiles` would not be executed since `SyncPoint::EnableProcessing()` is run after `DBTestBase::Reopen()`. To enable `SanitizeOptions::AfterChangeMaxOpenFiles`, `SyncPoint::EnableProcessing()` should be put ahead of `DBTestBase::Reopen()`. Pull Request resolved: facebook#11583 Test Plan: run unit tests locally as below: ``` make J=1 check [ RUN ] DBCompactionTest.LevelTtlCascadingCompactions [ OK ] DBCompactionTest.LevelTtlCascadingCompactions (85 ms) [ RUN ] DBCompactionTest.LevelPeriodicCompaction [ OK ] DBCompactionTest.LevelPeriodicCompaction (57 ms) ``` Reviewed By: jowlyzhang Differential Revision: D47311827 Pulled By: ajkr fbshipit-source-id: 99165e87a8129e404af06fdf9b4c96eca540fd23

Summary: Fixes facebook#11604 Pull Request resolved: facebook#11605 Reviewed By: jowlyzhang Differential Revision: D47459254 Pulled By: ajkr fbshipit-source-id: 4420e443fbf4bd01ddaa2b47285fc4445bf36246

Summary: Add path existence check in the script to avoid script running even when db_bench executable does not exist or relative path is not right. Pull Request resolved: facebook#11621 Reviewed By: jowlyzhang Differential Revision: D47552590 Pulled By: ajkr fbshipit-source-id: f09ea069f69e067212b249a22ad755b76bc6063a

Summary: improvement code by std::move and c++17 Pull Request resolved: facebook#11614 Reviewed By: ajkr Differential Revision: D47599519 Pulled By: jowlyzhang fbshipit-source-id: 6b897876f4e87e94a74c53d8db2a01303d500bff

Summary: Pull Request resolved: facebook#11595 Reviewed By: ajkr Differential Revision: D47600701 Pulled By: jowlyzhang fbshipit-source-id: 22375b51c726b176e4bc502b49cf3343f45f8a0a

Summary: Pull Request resolved: facebook#11617 Test Plan: make check Reviewed By: ajkr Differential Revision: D47599209 Pulled By: jowlyzhang fbshipit-source-id: 00e96266c75128875663083a2877d27fd7392eea

Summary: Add `rocksdb_transactiondb_get_base_db` and `rocksdb_transactiondb_close_base_db` functions to the C API modeled after `rocksdb_optimistictransactiondb_get_base_db` and `rocksdb_optimistictransactiondb_close_base_db`: https://github.com/facebook/rocksdb/blob/ca50ccc71a1ce89008b4737e74f321b8df8a3b5b/include/rocksdb/c.h#L2711-L2716 With this pair of functions, it is possible to get a `rocksdb_t *` from a `rocksdb_transactiondb_t *`. The main goal is to be able to use the approximate memory usage API, only accessible to the `rocksdb_t *` type: https://github.com/facebook/rocksdb/blob/ca50ccc71a1ce89008b4737e74f321b8df8a3b5b/include/rocksdb/c.h#L2821-L2833 Pull Request resolved: facebook#11562 Reviewed By: ajkr Differential Revision: D47603343 Pulled By: jowlyzhang fbshipit-source-id: c70cf6af5834026e232fe7791634db3a396f7d5e

Summary: Pull Request resolved: facebook#11596 Reviewed By: ajkr Differential Revision: D47635614 Pulled By: jowlyzhang fbshipit-source-id: 651a06049a54d15fd4b4f010bb4b82f53ff9c9d4

…er non directIO usecase (facebook#11631) Summary: **Context/Summary** As titled. The benefit of doing so is to explicitly call readahead() instead of relying page cache behavior for compaction read when we know that we most likely need readahead as compaction read is sequential read . **Test** Extended the existing UT to cover compaction read case Pull Request resolved: facebook#11631 Reviewed By: ajkr Differential Revision: D47681437 Pulled By: hx235 fbshipit-source-id: 78792f64985c4dc44aa8f2a9c41ab3e8bbc0bc90

…book#11267) Summary: Plaintable will miss properties. It should have some behavior like blockbasedtable. Here is a unit test for reproduce this bug. ``` #include <gflags/gflags.h> #include "rocksdb/db.h" #include "rocksdb/options.h" #include "rocksdb/table.h" #include "rocksdb/slice_transform.h" #include <iostream> #include <thread> #include <csignal> const std::string kKey = "key"; DEFINE_bool(use_plaintable, true, "use plain table"); DEFINE_string(db_path, "/dev/shm/test_zyx_path", "db_path"); rocksdb::DB* db = nullptr; class NoopTransform : public rocksdb::SliceTransform { public: explicit NoopTransform() { } virtual const char* Name() const override { return "rocksdb.Noop"; } virtual rocksdb::Slice Transform(const rocksdb::Slice& src) const override { return src; } virtual bool InDomain(const rocksdb::Slice& src) const override { return true; } virtual bool InRange(const rocksdb::Slice& dst) const override { return true; } virtual bool SameResultWhenAppended(const rocksdb::Slice& prefix) const override { return false; } }; class TestPropertiesCollector : public ::rocksdb::TablePropertiesCollector { public: explicit TestPropertiesCollector() { } private: ::rocksdb::Status AddUserKey(const ::rocksdb::Slice& key, const ::rocksdb::Slice& value, ::rocksdb::EntryType type, ::rocksdb::SequenceNumber seq, uint64_t file_size) override { count++; return ::rocksdb::Status::OK(); } ::rocksdb::Status Finish(::rocksdb::UserCollectedProperties* properties) override { properties->insert({kKey, std::to_string(count)}); return ::rocksdb::Status::OK(); } ::rocksdb::UserCollectedProperties GetReadableProperties() const override { ::rocksdb::UserCollectedProperties properties; properties.insert({kKey, std::to_string(count)}); return properties; } const char* Name() const override { return "TestPropertiesCollector"; } int count = 0; }; class TestTablePropertiesCollectorFactory : public ::rocksdb::TablePropertiesCollectorFactory { public: explicit TestTablePropertiesCollectorFactory() { } private: ::rocksdb::TablePropertiesCollector* CreateTablePropertiesCollector( ::rocksdb::TablePropertiesCollectorFactory::Context context) override { return new TestPropertiesCollector(); } const char* Name() const override { return "test.TablePropertiesCollectorFactory"; } }; class TestFlushListener : rocksdb::EventListener { public: const char* Name() const override { return "TestFlushListener"; } void OnFlushCompleted(rocksdb::DB* /*db*/, const rocksdb::FlushJobInfo& flush_job_info) override { if (flush_job_info.table_properties.user_collected_properties.find(kKey) == flush_job_info.table_properties.user_collected_properties.end()) { std::cerr << "OnFlushCompleted: properties not found" << std::endl; return; } std::cerr << "OnFlushCompleted: properties found " << flush_job_info.table_properties.user_collected_properties.at(kKey) << std::endl; } explicit TestFlushListener() { } }; int main(int argc, char* argv[]) { gflags::ParseCommandLineFlags(&argc, &argv, true); rocksdb::DBOptions rocksdb_options; std::shared_ptr<rocksdb::EventListener> flush_offset; rocksdb_options.create_if_missing = true; rocksdb_options.create_missing_column_families = true; std::shared_ptr<::rocksdb::TablePropertiesCollectorFactory> properties_collector( new TestTablePropertiesCollectorFactory()); rocksdb::ColumnFamilyOptions cfoptions; cfoptions.table_properties_collector_factories.emplace_back(properties_collector); std::shared_ptr<rocksdb::EventListener> test_cleaner; test_cleaner.reset((rocksdb::EventListener*)new TestFlushListener()); rocksdb_options.listeners.emplace_back(test_cleaner); std::vector<rocksdb::ColumnFamilyDescriptor> cf_desc_; cf_desc_.emplace_back(rocksdb::kDefaultColumnFamilyName, cfoptions); std::vector<rocksdb::ColumnFamilyHandle*> cfhs; cfoptions.prefix_extractor.reset(new NoopTransform()); if (FLAGS_use_plaintable) { cfoptions.table_factory.reset(rocksdb::NewPlainTableFactory()); std::cerr << "use plaintable" << std::endl; } else { cfoptions.table_factory.reset(rocksdb::NewBlockBasedTableFactory()); std::cerr << "use blockbasedtable" << std::endl; } auto s = rocksdb::DB::Open(rocksdb_options, FLAGS_db_path, cf_desc_, &cfhs, &db); if (s.ok()) { rocksdb::WriteOptions wops; wops.disableWAL = true; for (int i = 0; i < 1000000; i++) { auto status = db->Put(wops, std::to_string(i), std::string(1024, '3')); if (!status.ok()) { std::cerr << "write fail " << status.getState() << std::endl; } } } else { std::cerr << "open rocksdb failed" << s.getState() << std::endl; } std::this_thread::sleep_for(std::chrono::seconds(1000)); delete db; } ``` Pull Request resolved: facebook#11267 Reviewed By: jowlyzhang Differential Revision: D47689943 Pulled By: hx235 fbshipit-source-id: 585589cc48f8b26c7dd2323fc7ac4a0c3d4df6bb

Summary: ... ahead of dynamic variant. * Introduce an Unref function for a common pattern. Cases that were previously using std::memory_order_acq_rel we doing so because we were saving the pre-updated value in case it might be used. Now we are explicitly throwing away the pre-updated value so do not need the acquire semantic, just release. * Introduce a reusable EvictionData struct and TrackAndReleaseEvictedEntry() function. * Based on a linter suggesting, use const Func& parameter type instead of Func for templated callable parameters. Pull Request resolved: facebook#11630 Test Plan: existing tests, and performance test with release build of cache_bench. Getting 1-2% difference between before & after from run to run, but inconsistent about which one is faster. Reviewed By: jowlyzhang Differential Revision: D47657334 Pulled By: pdillinger fbshipit-source-id: 5cf2377c0d47a39143b04be6735f98c550e8bdc3

Summary: Remove obsolete comment. Support for WriteBatchWithIndex::NewIteratorWithBase when overwrite_key=false is added in facebook#8135, as you can clearly see in the HISTORY.md. Pull Request resolved: facebook#11636 Reviewed By: jowlyzhang Differential Revision: D47722955 Pulled By: ajkr fbshipit-source-id: 4fa44a309d9708e9f4a1530918a9aaf7114c9032

ajkr

LGTM, thanks!

ajkr · 2023-07-24T20:58:17Z

table/format.cc

+    // Start populating Part 2
+    char* cur = data_.data() + /* part 1 size */ 1;
+    // Set extended magic of part2
+    std::copy(kExtendedMagic.begin(), kExtendedMagic.end(), cur);


Good idea to exploit format_version <= 5's odd use of varints plus padding to make part 2 a fixed size. I never would've guessed that old encoding has this advantage

… universal compaction (facebook#11552) Summary: this is stacked on facebook#11550 to further clarify usage of these two options for universal compaction. Similar to FIFO, the two options have the same meaning for universal compaction, which can be confusing to use. For example, for universal compaction, dynamically changing the value of `ttl` has no impact on periodic compactions. Users should dynamically change `periodic_compaction_seconds` instead. From feature matrix (https://fburl.com/daiquery/5s647hwh), there are instances where users set `ttl` to non-zero value and `periodic_compaction_seconds` to 0. For backward compatibility reason, instead of deprecating `ttl`, comments are added to mention that `periodic_compaction_seconds` are preferred. In `SanitizeOptions()`, we update the value of `periodic_compaction_seconds` to take into account value of `ttl`. The logic is documented in relevant option comment. Pull Request resolved: facebook#11552 Test Plan: * updated existing unit test `DBTestUniversalCompaction2.PeriodicCompactionDefault` Reviewed By: ajkr Differential Revision: D47381434 Pulled By: cbi42 fbshipit-source-id: bc41f29f77318bae9a96be84dd89bf5617c7fd57

Summary: Make flush respect the cutoff timestamp `full_history_ts_low` as much as possible for the user-defined timestamps in Memtables only feature. We achieve this by not proceeding with the actual flushing but instead reschedule the same `FlushRequest` so a follow up flush job can continue with the check after some interval. This approach doesn't work well for atomic flush, so this feature currently is not supported in combination with atomic flush. Furthermore, this approach also requires a customized method to get the next immediately bigger user-defined timestamp. So currently it's limited to comparator that use uint64_t as the user-defined timestamp format. This support can be extended when we add such a customized method to `AdvancedColumnFamilyOptions`. For non atomic flush request, at any single time, a column family can only have as many as one FlushRequest for it in the `flush_queue_`. There is deduplication done at `FlushRequest` enqueueing(`SchedulePendingFlush`) and dequeueing time (`PopFirstFromFlushQueue`). We hold the db mutex between when a `FlushRequest` is popped from the queue and the same FlushRequest get rescheduled, so no other `FlushRequest` with a higher `max_memtable_id` can be added to the `flush_queue_` blocking us from re-enqueueing the same `FlushRequest`. Flush is continued nevertheless if there is risk of entering write stall mode had the flush being postponed, e.g. due to accumulation of write buffers, exceeding the `max_write_buffer_number` setting. When this happens, the newest user-defined timestamp in the involved Memtables need to be tracked and we use it to increase the `full_history_ts_low`, which is an inclusive cutoff timestamp for which RocksDB promises to keep all user-defined timestamps equal to and newer than it. Tet plan: ``` ./column_family_test --gtest_filter="*RetainUDT*" ./memtable_list_test --gtest_filter="*WithTimestamp*" ./flush_job_test --gtest_filter="*WithTimestamp*" ``` Pull Request resolved: facebook#11599 Reviewed By: ajkr Differential Revision: D47561586 Pulled By: jowlyzhang fbshipit-source-id: 9400445f983dd6eac489e9dd0fb5d9b99637fe89

…facebook#11623) Summary: Add support to allow enabling / disabling user-defined timestamps feature for an existing column family in combination with the in-Memtable only feature. To do this, this PR includes: 1) Log the `persist_user_defined_timestamps` option per column family in Manifest to facilitate detecting an attempt to enable / disable UDT. This entry is enforced to be logged in the same VersionEdit as the user comparator name entry. 2) User-defined timestamps related options are validated when re-opening a column family, including user comparator name and the `persist_user_defined_timestamps` flag. These type of settings and settings change are considered valid: a) no user comparator change and no effective `persist_user_defined_timestamp` flag change. b) switch user comparator to enable UDT provided the immediately effective `persist_user_defined_timestamps` flag is false. c) switch user comparator to disable UDT provided that the before-change `persist_user_defined_timestamps` is already false. 3) when an attempt to enable UDT is detected, we mark all its existing SST files as "having no UDT" by marking its `FileMetaData.user_defined_timestamps_persisted` flag to false and handle their file boundaries `FileMetaData.smallest`, `FileMetaData.largest` by padding a min timestamp. 4) while enabling / disabling UDT feature, timestamp size inconsistency in existing WAL logs are handled to make it compatible with the running user comparator. Pull Request resolved: facebook#11623 Test Plan: ``` make all check ./db_with_timestamp_basic_test --gtest-filter="*EnableDisableUDT*" ./db_wal_test --gtest_filter="*EnableDisableUDT*" ``` Reviewed By: ltamasi Differential Revision: D47636862 Pulled By: jowlyzhang fbshipit-source-id: dcd19f67292da3c3cc9584c09ad00331c9ab9322

facebook-github-bot · 2023-07-27T17:23:22Z

@pdillinger has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2023-07-27T17:32:41Z

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-07-27T17:56:12Z

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-07-30T23:43:59Z

@pdillinger merged this pull request in 7a1b020.

facebook-github-bot added the CLA Signed label Oct 19, 2021

mrambacher reviewed Oct 20, 2021

View reviewed changes

pdillinger mentioned this pull request Oct 23, 2021

Implement XXH3 block checksum type #9069

Closed

pdillinger changed the title ~~Implement XXH3 checksum and context-aware checksums~~ [To revise] Context-aware block checksums Oct 28, 2021

pdillinger mentioned this pull request Nov 11, 2021

New stable, fixed-length cache keys #9126

Closed

pdillinger mentioned this pull request Dec 2, 2021

More refactoring ahead of footer & meta changes #9240

Closed

pdillinger force-pushed the context_checksum branch from 472f142 to aed9ad8 Compare December 3, 2021 23:51

pdillinger changed the title ~~[To revise] Context-aware block checksums~~ [Dependencies pending] format_version=6 and context-aware block checksums Dec 4, 2021

pdillinger mentioned this pull request Dec 10, 2021

Optimize & clean up footer code #9280

Closed

pdillinger force-pushed the context_checksum branch from 2b4f811 to eb71ee4 Compare December 10, 2021 22:01

pdillinger force-pushed the context_checksum branch from eb71ee4 to 6adb736 Compare December 14, 2021 05:11

pdillinger changed the title ~~[Dependencies pending] format_version=6 and context-aware block checksums~~ format_version=6 and context-aware block checksums Dec 14, 2021

mrambacher approved these changes Dec 16, 2021

View reviewed changes

pdillinger marked this pull request as draft January 5, 2022 22:34

chadaustin and others added 15 commits July 18, 2023 15:38

add a missing include (facebook#11624)

ff0d618

Summary: Pull Request resolved: facebook#11624 <queue> must be included to use std::queue. Reviewed By: pdillinger Differential Revision: D47562433 fbshipit-source-id: 7c5b19fd9e411694c782dfc0dff0231d4f92ef24

fix: VersionSet::DumpManifest (facebook#11605)

98d0f6e

Summary: Fixes facebook#11604 Pull Request resolved: facebook#11605 Reviewed By: jowlyzhang Differential Revision: D47459254 Pulled By: ajkr fbshipit-source-id: 4420e443fbf4bd01ddaa2b47285fc4445bf36246

optimized code (facebook#11614)

2f71223

Summary: improvement code by std::move and c++17 Pull Request resolved: facebook#11614 Reviewed By: ajkr Differential Revision: D47599519 Pulled By: jowlyzhang fbshipit-source-id: 6b897876f4e87e94a74c53d8db2a01303d500bff

fix typo (facebook#11595)

64b0439

Summary: Pull Request resolved: facebook#11595 Reviewed By: ajkr Differential Revision: D47600701 Pulled By: jowlyzhang fbshipit-source-id: 22375b51c726b176e4bc502b49cf3343f45f8a0a

Fix typo in comment (facebook#11617)

8663488

Summary: Pull Request resolved: facebook#11617 Test Plan: make check Reviewed By: ajkr Differential Revision: D47599209 Pulled By: jowlyzhang fbshipit-source-id: 00e96266c75128875663083a2877d27fd7392eea

Remove some useless qualifier (facebook#11596)

df54346

Summary: Pull Request resolved: facebook#11596 Reviewed By: ajkr Differential Revision: D47635614 Pulled By: jowlyzhang fbshipit-source-id: 651a06049a54d15fd4b4f010bb4b82f53ff9c9d4

ajkr approved these changes Jul 24, 2023

View reviewed changes

cbi42 and others added 6 commits July 26, 2023 11:31

Merge branch 'context_checksum'

b2e963d

Fix VerifyChecksum

309d978

Match old I/O pattern of VerifyChecksum, leave some FIXMEs

d51cacc

facebook-github-bot closed this in 7a1b020 Jul 30, 2023

facebook-github-bot added the Merged label Jul 30, 2023

igorcanadi mentioned this pull request Jan 17, 2024

[SYS-6913] Upgrade RocksDB-Cloud to 8.9.1 rockset/rocksdb-cloud#315

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

format_version=6 and context-aware block checksums #9058

format_version=6 and context-aware block checksums #9058

pdillinger commented Oct 19, 2021 •

edited

Loading

pdillinger commented Oct 20, 2021

mrambacher left a comment

mrambacher Oct 19, 2021

mrambacher Oct 20, 2021

facebook-github-bot commented Dec 14, 2021

facebook-github-bot commented Dec 14, 2021

pdillinger commented Dec 14, 2021

facebook-github-bot commented Dec 16, 2021

mrambacher left a comment

mrambacher Dec 14, 2021

pdillinger commented Jan 5, 2022

ajkr left a comment

ajkr Jul 24, 2023

facebook-github-bot commented Jul 27, 2023

facebook-github-bot commented Jul 27, 2023

facebook-github-bot commented Jul 27, 2023

facebook-github-bot commented Jul 30, 2023

format_version=6 and context-aware block checksums #9058

format_version=6 and context-aware block checksums #9058

Conversation

pdillinger commented Oct 19, 2021 • edited Loading

Context checksum

Footer checksum

More detail: making space in footer

Performance

pdillinger commented Oct 20, 2021

mrambacher left a comment

Choose a reason for hiding this comment

mrambacher Oct 19, 2021

Choose a reason for hiding this comment

mrambacher Oct 20, 2021

Choose a reason for hiding this comment

facebook-github-bot commented Dec 14, 2021

facebook-github-bot commented Dec 14, 2021

pdillinger commented Dec 14, 2021

facebook-github-bot commented Dec 16, 2021

mrambacher left a comment

Choose a reason for hiding this comment

mrambacher Dec 14, 2021

Choose a reason for hiding this comment

pdillinger commented Jan 5, 2022

ajkr left a comment

Choose a reason for hiding this comment

ajkr Jul 24, 2023

Choose a reason for hiding this comment

facebook-github-bot commented Jul 27, 2023

facebook-github-bot commented Jul 27, 2023

facebook-github-bot commented Jul 27, 2023

facebook-github-bot commented Jul 30, 2023

pdillinger commented Oct 19, 2021 •

edited

Loading