-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
format_version=6 and context-aware block checksums #9058
Conversation
For the user, An alternative could be to use format_version=6 to enable context-aware checksum, so it doesn't have to be in ChecksumType in options or in the footer. I can't think of any reason this feature shouldn't be used by everyone eventually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- There should be a XXH3 test added to table_test/FooterTest. Also tests for the context cases.
- There should be a test where an invalid checksum type is written to the file and validate that the code does the right thing
- Do you need to do backward compatibility tests to verify that what happens in earlier versions if a file with XXH3 (or contexts) as the hash is used? This is likely true both from a normal start or an ingest.
// "WithContext" checksums offer essentially "full" protection against | ||
// misplaced data because the checksum values depend on which file they are | ||
// in and the location within the file. There is essentially no performance | ||
// penalty for this enhanced checking. | ||
enum ChecksumType : char { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering if this would be better as an unsigned char.
table/format.cc
Outdated
// <padding> to make the total size 2 * BlockHandle::kMaxEncodedLength + 1 | ||
// <padding> up to part2 size = 2 * BlockHandle::kMaxEncodedLength - 4 | ||
// base_context_checksum (4 bytes, 0s if unused) | ||
// ---- part3 | ||
// footer version (4 bytes) | ||
// table_magic_number (8 bytes) | ||
void Footer::EncodeTo(std::string* dst) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the Footer::ToString method also contain the context? And can the "ToString" method change to print out the string version of the enum (rather than its number)?
Summary: XXH3 - latest hash function that is extremely fast on large data, easily faster than crc32c on most any x86_64 hardware. In integrating this hash function, I have handled the compression type byte in a non-standard way to avoid using the streaming API (extra data movement and active code size because of hash function complexity). This approach got a thumbs-up from Yann Collet. This change split off from facebook#9058 because context-aware checksum is likely to be handled through different configuration than ChecksumType. Test Plan: tests updated, including stress test. db_bench also updated for microbenchmarking checksums. ### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor) ./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3 crc32c : 0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op) xxhash : 0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op) xxhash64 : 0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op) xxh3 : 0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op) crc32c : 0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op) xxhash : 0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op) xxhash64 : 0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op) xxh3 : 0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op) crc32c : 0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op) xxhash : 0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op) xxhash64 : 0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op) xxh3 : 0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op) As you can see, especially once warmed up, xxh3 is fastest. ### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor) Test for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done Results (ops/sec) for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done results-0 252118 # kNoChecksum results-1 251588 # kCRC32c results-2 251863 # kxxHash results-3 252016 # kxxHash64 results-4 252038 # kXXH3
Summary: XXH3 - latest hash function that is extremely fast on large data, easily faster than crc32c on most any x86_64 hardware. In integrating this hash function, I have handled the compression type byte in a non-standard way to avoid using the streaming API (extra data movement and active code size because of hash function complexity). This approach got a thumbs-up from Yann Collet. Existing functionality change: * reject bad ChecksumType in options with InvalidArgument This change split off from #9058 because context-aware checksum is likely to be handled through different configuration than ChecksumType. Pull Request resolved: #9069 Test Plan: tests updated, and substantially expanded. Unit tests now check that we don't accidentally change the values generated by the checksum algorithms ("schema test") and that we properly handle invalid/unrecognized checksum types in options or in file footer. DBTestBase::ChangeOptions (etc.) updated from two to one configuration changing from default CRC32c ChecksumType. The point of this test code is to detect possible interactions among features, and the likelihood of some bad interaction being detected by including configurations other than XXH3 and CRC32c--and then not detected by stress/crash test--is extremely low. Stress/crash test also updated (manual run long enough to see it accepts new checksum type). db_bench also updated for microbenchmarking checksums. ### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor) ./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3 crc32c : 0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op) xxhash : 0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op) xxhash64 : 0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op) xxh3 : 0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op) crc32c : 0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op) xxhash : 0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op) xxhash64 : 0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op) xxh3 : 0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op) crc32c : 0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op) xxhash : 0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op) xxhash64 : 0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op) xxh3 : 0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op) As you can see, especially once warmed up, xxh3 is fastest. ### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor) Test for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done Results (ops/sec) for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done results-0 252118 # kNoChecksum results-1 251588 # kCRC32c results-2 251863 # kxxHash results-3 252016 # kxxHash64 results-4 252038 # kXXH3 Reviewed By: mrambacher Differential Revision: D31905249 Pulled By: pdillinger fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1
Summary: I'm working on a new format_version=6 to support context checksum (facebook#9058) and this includes much of the refactoring and test updates to support that change. Test coverage data and manual inspection agree on dead code in block_based_table_reader.cc (removed). Test Plan: tests enhanced to cover more cases etc.
472f142
to
aed9ad8
Compare
Summary: I'm working on a new format_version=6 to support context checksum (#9058) and this includes much of the refactoring and test updates to support that change. Test coverage data and manual inspection agree on dead code in block_based_table_reader.cc (removed). Pull Request resolved: #9240 Test Plan: tests enhanced to cover more cases etc. Extreme case performance testing indicates small % regression in fillseq (w/ compaction), though CPU profile etc. doesn't suggest any explanation. There is enhanced correctness checking in Footer::DecodeFrom, but this should be negligible. TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=1 --disable_wal={false,true} (Each is ops/s averaged over 50 runs, run simultaneously with competing configuration for load fairness) Before w/ wal: 454512 After w/ wal: 444820 (-2.1%) Before w/o wal: 1004560 After w/o wal: 998897 (-0.6%) Since this doesn't modify WAL code, one would expect real effects to be larger in w/o wal case. This regression will be corrected in a follow-up PR. Reviewed By: ajkr Differential Revision: D32813769 Pulled By: pdillinger fbshipit-source-id: 444a244eabf3825cd329b7d1b150cddce320862f
Summary: Again, ahead of planned changes in facebook#9058. This change improves performance (vs. pre-facebook#9240 baseline) by separating a FooterBuilder from Footer, where FooterBuilder includes (inline owns) the serialized data so that it can be stack allocated. Test Plan: existing tests + performance testing below Extreme case performance testing as in facebook#9240 with TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 (Each is ops/s averaged over 50 runs, run simultaneously with competing configuration for load fairness) Pre-facebook#9240 baseline (f577458): 436389 With facebook#9240 (653c392): 417946 (-4.2% vs. baseline) This change: 443762 (+1.7% vs. baseline)
2b4f811
to
eb71ee4
Compare
Summary: Again, ahead of planned changes in #9058. This change improves performance (vs. pre-#9240 baseline) by separating a FooterBuilder from Footer, where FooterBuilder includes (inline owns) the serialized data so that it can be stack allocated. Pull Request resolved: #9280 Test Plan: existing tests + performance testing below Extreme case performance testing as in #9240 with TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 (Each is ops/s averaged over 50 runs, run simultaneously with competing configuration for load fairness) Pre-#9240 baseline (f577458): 436389 With #9240 (653c392): 417946 (-4.2% vs. baseline) This change: 443762 (+1.7% vs. baseline) Reviewed By: ajkr Differential Revision: D33077220 Pulled By: pdillinger fbshipit-source-id: 7eaa6499589aac1693414a758e8c799216c5016c
eb71ee4
to
6adb736
Compare
@pdillinger has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@pdillinger has updated the pull request. You must reimport the pull request before landing. |
This is after |
@pdillinger has updated the pull request. You must reimport the pull request before landing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@@ -418,6 +421,9 @@ struct BlockBasedTableOptions { | |||
// 5 -- Can be read by RocksDB's versions since 6.6.0. Full and partitioned | |||
// filters use a generally faster and more accurate Bloom filter | |||
// implementation, with a different schema. | |||
// 6 -- Modified the file footer and checksum matching so that SST data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the comment contain since version info?
Still not totally satisfied with performance, and simulation data on new cache key suggests little value from "improved" unique id. Will reconsider this some time. |
Summary: Pull Request resolved: facebook#11624 <queue> must be included to use std::queue. Reviewed By: pdillinger Differential Revision: D47562433 fbshipit-source-id: 7c5b19fd9e411694c782dfc0dff0231d4f92ef24
Summary: An internal user wants to implement a key-aware row cache policy. For that, they need to know the components of the cache key, especially the user key component. With a specialized `RowCache` interface, we will be able to tell them the components so they won't have to make assumptions about our internal key schema. This PR prepares for the specialized `RowCache` interface by updating the migration plan of facebook#11450. I added a release note for the removed APIs and didn't mention the added ones for now. Pull Request resolved: facebook#11620 Reviewed By: pdillinger Differential Revision: D47536962 Pulled By: ajkr fbshipit-source-id: bbee0fc4ad67fc699a66b8f2b4ea4544dd003691
…acebook#11575) Summary: This adds proper support for using rocksdb with FetchContent, without this PR the user must include the following with their own `CMakeLists.txt` file: ```cmake include_directories(./build/_deps/rocksdb-src/include) ``` Pull Request resolved: facebook#11575 Reviewed By: jowlyzhang Differential Revision: D47163520 Pulled By: ajkr fbshipit-source-id: a202dcf435ecc9dd8d51c88f90e98c04814721ca
…not executed in db_compaction_test (facebook#11583) Summary: In [db_impl_open.cc](https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_open.cc), the sync point `SanitizeOptions::AfterChangeMaxOpenFiles` is used to set `max_open_files` with some specified "**invalid**" value even if it has been sanitized. However, in [db_compaction_test.cc](https://github.com/facebook/rocksdb/blob/main/db/db_compaction_test.cc), `SanitizeOptions::AfterChangeMaxOpenFiles` would not be executed since `SyncPoint::EnableProcessing()` is run after `DBTestBase::Reopen()`. To enable `SanitizeOptions::AfterChangeMaxOpenFiles`, `SyncPoint::EnableProcessing()` should be put ahead of `DBTestBase::Reopen()`. Pull Request resolved: facebook#11583 Test Plan: run unit tests locally as below: ``` make J=1 check [ RUN ] DBCompactionTest.LevelTtlCascadingCompactions [ OK ] DBCompactionTest.LevelTtlCascadingCompactions (85 ms) [ RUN ] DBCompactionTest.LevelPeriodicCompaction [ OK ] DBCompactionTest.LevelPeriodicCompaction (57 ms) ``` Reviewed By: jowlyzhang Differential Revision: D47311827 Pulled By: ajkr fbshipit-source-id: 99165e87a8129e404af06fdf9b4c96eca540fd23
Summary: Fixes facebook#11604 Pull Request resolved: facebook#11605 Reviewed By: jowlyzhang Differential Revision: D47459254 Pulled By: ajkr fbshipit-source-id: 4420e443fbf4bd01ddaa2b47285fc4445bf36246
Summary: Add path existence check in the script to avoid script running even when db_bench executable does not exist or relative path is not right. Pull Request resolved: facebook#11621 Reviewed By: jowlyzhang Differential Revision: D47552590 Pulled By: ajkr fbshipit-source-id: f09ea069f69e067212b249a22ad755b76bc6063a
Summary: improvement code by std::move and c++17 Pull Request resolved: facebook#11614 Reviewed By: ajkr Differential Revision: D47599519 Pulled By: jowlyzhang fbshipit-source-id: 6b897876f4e87e94a74c53d8db2a01303d500bff
Summary: Pull Request resolved: facebook#11595 Reviewed By: ajkr Differential Revision: D47600701 Pulled By: jowlyzhang fbshipit-source-id: 22375b51c726b176e4bc502b49cf3343f45f8a0a
Summary: Pull Request resolved: facebook#11617 Test Plan: make check Reviewed By: ajkr Differential Revision: D47599209 Pulled By: jowlyzhang fbshipit-source-id: 00e96266c75128875663083a2877d27fd7392eea
Summary: Add `rocksdb_transactiondb_get_base_db` and `rocksdb_transactiondb_close_base_db` functions to the C API modeled after `rocksdb_optimistictransactiondb_get_base_db` and `rocksdb_optimistictransactiondb_close_base_db`: https://github.com/facebook/rocksdb/blob/ca50ccc71a1ce89008b4737e74f321b8df8a3b5b/include/rocksdb/c.h#L2711-L2716 With this pair of functions, it is possible to get a `rocksdb_t *` from a `rocksdb_transactiondb_t *`. The main goal is to be able to use the approximate memory usage API, only accessible to the `rocksdb_t *` type: https://github.com/facebook/rocksdb/blob/ca50ccc71a1ce89008b4737e74f321b8df8a3b5b/include/rocksdb/c.h#L2821-L2833 Pull Request resolved: facebook#11562 Reviewed By: ajkr Differential Revision: D47603343 Pulled By: jowlyzhang fbshipit-source-id: c70cf6af5834026e232fe7791634db3a396f7d5e
Summary: Pull Request resolved: facebook#11596 Reviewed By: ajkr Differential Revision: D47635614 Pulled By: jowlyzhang fbshipit-source-id: 651a06049a54d15fd4b4f010bb4b82f53ff9c9d4
…er non directIO usecase (facebook#11631) Summary: **Context/Summary** As titled. The benefit of doing so is to explicitly call readahead() instead of relying page cache behavior for compaction read when we know that we most likely need readahead as compaction read is sequential read . **Test** Extended the existing UT to cover compaction read case Pull Request resolved: facebook#11631 Reviewed By: ajkr Differential Revision: D47681437 Pulled By: hx235 fbshipit-source-id: 78792f64985c4dc44aa8f2a9c41ab3e8bbc0bc90
…book#11267) Summary: Plaintable will miss properties. It should have some behavior like blockbasedtable. Here is a unit test for reproduce this bug. ``` #include <gflags/gflags.h> #include "rocksdb/db.h" #include "rocksdb/options.h" #include "rocksdb/table.h" #include "rocksdb/slice_transform.h" #include <iostream> #include <thread> #include <csignal> const std::string kKey = "key"; DEFINE_bool(use_plaintable, true, "use plain table"); DEFINE_string(db_path, "/dev/shm/test_zyx_path", "db_path"); rocksdb::DB* db = nullptr; class NoopTransform : public rocksdb::SliceTransform { public: explicit NoopTransform() { } virtual const char* Name() const override { return "rocksdb.Noop"; } virtual rocksdb::Slice Transform(const rocksdb::Slice& src) const override { return src; } virtual bool InDomain(const rocksdb::Slice& src) const override { return true; } virtual bool InRange(const rocksdb::Slice& dst) const override { return true; } virtual bool SameResultWhenAppended(const rocksdb::Slice& prefix) const override { return false; } }; class TestPropertiesCollector : public ::rocksdb::TablePropertiesCollector { public: explicit TestPropertiesCollector() { } private: ::rocksdb::Status AddUserKey(const ::rocksdb::Slice& key, const ::rocksdb::Slice& value, ::rocksdb::EntryType type, ::rocksdb::SequenceNumber seq, uint64_t file_size) override { count++; return ::rocksdb::Status::OK(); } ::rocksdb::Status Finish(::rocksdb::UserCollectedProperties* properties) override { properties->insert({kKey, std::to_string(count)}); return ::rocksdb::Status::OK(); } ::rocksdb::UserCollectedProperties GetReadableProperties() const override { ::rocksdb::UserCollectedProperties properties; properties.insert({kKey, std::to_string(count)}); return properties; } const char* Name() const override { return "TestPropertiesCollector"; } int count = 0; }; class TestTablePropertiesCollectorFactory : public ::rocksdb::TablePropertiesCollectorFactory { public: explicit TestTablePropertiesCollectorFactory() { } private: ::rocksdb::TablePropertiesCollector* CreateTablePropertiesCollector( ::rocksdb::TablePropertiesCollectorFactory::Context context) override { return new TestPropertiesCollector(); } const char* Name() const override { return "test.TablePropertiesCollectorFactory"; } }; class TestFlushListener : rocksdb::EventListener { public: const char* Name() const override { return "TestFlushListener"; } void OnFlushCompleted(rocksdb::DB* /*db*/, const rocksdb::FlushJobInfo& flush_job_info) override { if (flush_job_info.table_properties.user_collected_properties.find(kKey) == flush_job_info.table_properties.user_collected_properties.end()) { std::cerr << "OnFlushCompleted: properties not found" << std::endl; return; } std::cerr << "OnFlushCompleted: properties found " << flush_job_info.table_properties.user_collected_properties.at(kKey) << std::endl; } explicit TestFlushListener() { } }; int main(int argc, char* argv[]) { gflags::ParseCommandLineFlags(&argc, &argv, true); rocksdb::DBOptions rocksdb_options; std::shared_ptr<rocksdb::EventListener> flush_offset; rocksdb_options.create_if_missing = true; rocksdb_options.create_missing_column_families = true; std::shared_ptr<::rocksdb::TablePropertiesCollectorFactory> properties_collector( new TestTablePropertiesCollectorFactory()); rocksdb::ColumnFamilyOptions cfoptions; cfoptions.table_properties_collector_factories.emplace_back(properties_collector); std::shared_ptr<rocksdb::EventListener> test_cleaner; test_cleaner.reset((rocksdb::EventListener*)new TestFlushListener()); rocksdb_options.listeners.emplace_back(test_cleaner); std::vector<rocksdb::ColumnFamilyDescriptor> cf_desc_; cf_desc_.emplace_back(rocksdb::kDefaultColumnFamilyName, cfoptions); std::vector<rocksdb::ColumnFamilyHandle*> cfhs; cfoptions.prefix_extractor.reset(new NoopTransform()); if (FLAGS_use_plaintable) { cfoptions.table_factory.reset(rocksdb::NewPlainTableFactory()); std::cerr << "use plaintable" << std::endl; } else { cfoptions.table_factory.reset(rocksdb::NewBlockBasedTableFactory()); std::cerr << "use blockbasedtable" << std::endl; } auto s = rocksdb::DB::Open(rocksdb_options, FLAGS_db_path, cf_desc_, &cfhs, &db); if (s.ok()) { rocksdb::WriteOptions wops; wops.disableWAL = true; for (int i = 0; i < 1000000; i++) { auto status = db->Put(wops, std::to_string(i), std::string(1024, '3')); if (!status.ok()) { std::cerr << "write fail " << status.getState() << std::endl; } } } else { std::cerr << "open rocksdb failed" << s.getState() << std::endl; } std::this_thread::sleep_for(std::chrono::seconds(1000)); delete db; } ``` Pull Request resolved: facebook#11267 Reviewed By: jowlyzhang Differential Revision: D47689943 Pulled By: hx235 fbshipit-source-id: 585589cc48f8b26c7dd2323fc7ac4a0c3d4df6bb
Summary: ... ahead of dynamic variant. * Introduce an Unref function for a common pattern. Cases that were previously using std::memory_order_acq_rel we doing so because we were saving the pre-updated value in case it might be used. Now we are explicitly throwing away the pre-updated value so do not need the acquire semantic, just release. * Introduce a reusable EvictionData struct and TrackAndReleaseEvictedEntry() function. * Based on a linter suggesting, use const Func& parameter type instead of Func for templated callable parameters. Pull Request resolved: facebook#11630 Test Plan: existing tests, and performance test with release build of cache_bench. Getting 1-2% difference between before & after from run to run, but inconsistent about which one is faster. Reviewed By: jowlyzhang Differential Revision: D47657334 Pulled By: pdillinger fbshipit-source-id: 5cf2377c0d47a39143b04be6735f98c550e8bdc3
Summary: Remove obsolete comment. Support for WriteBatchWithIndex::NewIteratorWithBase when overwrite_key=false is added in facebook#8135, as you can clearly see in the HISTORY.md. Pull Request resolved: facebook#11636 Reviewed By: jowlyzhang Differential Revision: D47722955 Pulled By: ajkr fbshipit-source-id: 4fa44a309d9708e9f4a1530918a9aaf7114c9032
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
// Start populating Part 2 | ||
char* cur = data_.data() + /* part 1 size */ 1; | ||
// Set extended magic of part2 | ||
std::copy(kExtendedMagic.begin(), kExtendedMagic.end(), cur); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea to exploit format_version <= 5
's odd use of varints plus padding to make part 2 a fixed size. I never would've guessed that old encoding has this advantage
… universal compaction (facebook#11552) Summary: this is stacked on facebook#11550 to further clarify usage of these two options for universal compaction. Similar to FIFO, the two options have the same meaning for universal compaction, which can be confusing to use. For example, for universal compaction, dynamically changing the value of `ttl` has no impact on periodic compactions. Users should dynamically change `periodic_compaction_seconds` instead. From feature matrix (https://fburl.com/daiquery/5s647hwh), there are instances where users set `ttl` to non-zero value and `periodic_compaction_seconds` to 0. For backward compatibility reason, instead of deprecating `ttl`, comments are added to mention that `periodic_compaction_seconds` are preferred. In `SanitizeOptions()`, we update the value of `periodic_compaction_seconds` to take into account value of `ttl`. The logic is documented in relevant option comment. Pull Request resolved: facebook#11552 Test Plan: * updated existing unit test `DBTestUniversalCompaction2.PeriodicCompactionDefault` Reviewed By: ajkr Differential Revision: D47381434 Pulled By: cbi42 fbshipit-source-id: bc41f29f77318bae9a96be84dd89bf5617c7fd57
Summary: Make flush respect the cutoff timestamp `full_history_ts_low` as much as possible for the user-defined timestamps in Memtables only feature. We achieve this by not proceeding with the actual flushing but instead reschedule the same `FlushRequest` so a follow up flush job can continue with the check after some interval. This approach doesn't work well for atomic flush, so this feature currently is not supported in combination with atomic flush. Furthermore, this approach also requires a customized method to get the next immediately bigger user-defined timestamp. So currently it's limited to comparator that use uint64_t as the user-defined timestamp format. This support can be extended when we add such a customized method to `AdvancedColumnFamilyOptions`. For non atomic flush request, at any single time, a column family can only have as many as one FlushRequest for it in the `flush_queue_`. There is deduplication done at `FlushRequest` enqueueing(`SchedulePendingFlush`) and dequeueing time (`PopFirstFromFlushQueue`). We hold the db mutex between when a `FlushRequest` is popped from the queue and the same FlushRequest get rescheduled, so no other `FlushRequest` with a higher `max_memtable_id` can be added to the `flush_queue_` blocking us from re-enqueueing the same `FlushRequest`. Flush is continued nevertheless if there is risk of entering write stall mode had the flush being postponed, e.g. due to accumulation of write buffers, exceeding the `max_write_buffer_number` setting. When this happens, the newest user-defined timestamp in the involved Memtables need to be tracked and we use it to increase the `full_history_ts_low`, which is an inclusive cutoff timestamp for which RocksDB promises to keep all user-defined timestamps equal to and newer than it. Tet plan: ``` ./column_family_test --gtest_filter="*RetainUDT*" ./memtable_list_test --gtest_filter="*WithTimestamp*" ./flush_job_test --gtest_filter="*WithTimestamp*" ``` Pull Request resolved: facebook#11599 Reviewed By: ajkr Differential Revision: D47561586 Pulled By: jowlyzhang fbshipit-source-id: 9400445f983dd6eac489e9dd0fb5d9b99637fe89
…facebook#11623) Summary: Add support to allow enabling / disabling user-defined timestamps feature for an existing column family in combination with the in-Memtable only feature. To do this, this PR includes: 1) Log the `persist_user_defined_timestamps` option per column family in Manifest to facilitate detecting an attempt to enable / disable UDT. This entry is enforced to be logged in the same VersionEdit as the user comparator name entry. 2) User-defined timestamps related options are validated when re-opening a column family, including user comparator name and the `persist_user_defined_timestamps` flag. These type of settings and settings change are considered valid: a) no user comparator change and no effective `persist_user_defined_timestamp` flag change. b) switch user comparator to enable UDT provided the immediately effective `persist_user_defined_timestamps` flag is false. c) switch user comparator to disable UDT provided that the before-change `persist_user_defined_timestamps` is already false. 3) when an attempt to enable UDT is detected, we mark all its existing SST files as "having no UDT" by marking its `FileMetaData.user_defined_timestamps_persisted` flag to false and handle their file boundaries `FileMetaData.smallest`, `FileMetaData.largest` by padding a min timestamp. 4) while enabling / disabling UDT feature, timestamp size inconsistency in existing WAL logs are handled to make it compatible with the running user comparator. Pull Request resolved: facebook#11623 Test Plan: ``` make all check ./db_with_timestamp_basic_test --gtest-filter="*EnableDisableUDT*" ./db_wal_test --gtest_filter="*EnableDisableUDT*" ``` Reviewed By: ltamasi Differential Revision: D47636862 Pulled By: jowlyzhang fbshipit-source-id: dcd19f67292da3c3cc9584c09ad00331c9ab9322
@pdillinger has updated the pull request. You must reimport the pull request before landing. |
@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1 similar comment
@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pdillinger merged this pull request in 7a1b020. |
Summary:
Context checksum
All RocksDB checksums currently use 32 bits of checking
power, which should be 1 in 4 billion false negative (FN) probability (failing to
detect corruption). This is true for random corruptions, and in some cases
small corruptions are guaranteed to be detected. But some possible
corruptions, such as in storage metadata rather than storage payload data,
would have a much higher FN rate. For example:
the same or another SST file. Especially with block_align=true, the
probability of exact block size match is probably around 1 in 100, making
the FN probability around that same. Without
block_align=true
theprobability of same block start location is probably around 1 in 10,000,
for FN probability around 1 in a million.
To solve this problem in new format_version=6, we add "context awareness"
to block checksum checks. The stored and expected checksum value is
modified based on the block's position in the file and which file it is in. The
modifications are cleverly chosen so that, for example
different context
for the same offsets, until wrap-around after 2^32 - 1 files
Thus, with format_version=6, if a valid SST block and checksum is misplaced,
its checksum FN probability should be essentially ideal, 1 in 4B.
Footer checksum
This change also adds checksum protection to the SST footer (with
format_version=6), for the first time without relying on whole file checksum.
To prevent a corruption of the format_version in the footer (e.g. 6 -> 5) to
defeat the footer checksum, we change much of the footer data format
including an "extended magic number" in format_version 6 that would be
interpreted as empty index and metaindex block handles in older footer
versions. We also change the encoding of handles to free up space for
other new data in footer.
More detail: making space in footer
In order to keep footer the same size in format_version=6 (avoid change to IO
patterns), we have to free up some space for new data. We do this two ways:
it immediately precedes the footer, and by assuming it is < 4GB.
in footer to begin with.)
Performance
In case of small performance penalty, I've made a "pay as you go" optimization
to compensate: replace
MutableCFOptions
in BlockBasedTableBuilder::Repwith the only field used in that structure after construction:
prefix_extractor
.This makes the PR an overall performance improvement (results below).
Nevertheless I'm seeing essentially no difference going from fv=5 to fv=6,
even including that improvement for both. That's based on extreme case table
write performance testing, many files with many blocks. This is relatively
checksum intensive (small blocks) and salt generation intensive (small files).
Each value below is ops/s averaged over 100 runs, run simultaneously with competing
configuration for load fairness
Before -> after (both fv=5): 483530 -> 483673 (negligible)
Re-run 1: 480733 -> 485427 (1.0% faster)
Re-run 2: 483821 -> 484541 (0.1% faster)
Before (fv=5) -> after (fv=6): 482006 -> 485100 (0.6% faster)
Re-run 1: 482212 -> 485075 (0.6% faster)
Re-run 2: 483590 -> 484073 (0.1% faster)
After fv=5 -> after fv=6: 483878 -> 485542 (0.3% faster)
Re-run 1: 485331 -> 483385 (0.4% slower)
Re-run 2: 485283 -> 483435 (0.4% slower)
Re-run 3: 483647 -> 486109 (0.5% faster)
Test Plan: unit tests included (table_test, db_properties_test, salt in env_test). General DB tests
and crash test updated to test new format_version.
Also temporarily updated the default format version to 6 and saw some test failures. Almost all
were due to an inadvertent additional read in VerifyChecksum to verify the index block checksum,
though it's arguably a bug that VerifyChecksum does not appear to (re-)verify the index block
checksum, just assuming it was verified in opening the index reader (probably usually true but
probably not always true). Some other concerns about VerifyChecksum are left in FIXME
comments. The only remaining test failure on change of default (in block_fetcher_test) now
has a comment about how to upgrade the test.
The format compatibility test does not need updating because we have not updated the default
format_version.