Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

format_version=6 and context-aware block checksums #9058

Closed
wants to merge 56 commits into from

Conversation

pdillinger
Copy link
Contributor

@pdillinger pdillinger commented Oct 19, 2021

Summary:

Context checksum

All RocksDB checksums currently use 32 bits of checking
power, which should be 1 in 4 billion false negative (FN) probability (failing to
detect corruption). This is true for random corruptions, and in some cases
small corruptions are guaranteed to be detected. But some possible
corruptions, such as in storage metadata rather than storage payload data,
would have a much higher FN rate. For example:

  • Data larger than one SST block is replaced by data from elsewhere in
    the same or another SST file. Especially with block_align=true, the
    probability of exact block size match is probably around 1 in 100, making
    the FN probability around that same. Without block_align=true the
    probability of same block start location is probably around 1 in 10,000,
    for FN probability around 1 in a million.

To solve this problem in new format_version=6, we add "context awareness"
to block checksum checks. The stored and expected checksum value is
modified based on the block's position in the file and which file it is in. The
modifications are cleverly chosen so that, for example

  • blocks within about 4GB of each other are guaranteed to use different context
  • blocks that are offset by exactly some multiple of 4GiB are guaranteed to use
    different context
  • files generated by the same process are guaranteed to use different context
    for the same offsets, until wrap-around after 2^32 - 1 files

Thus, with format_version=6, if a valid SST block and checksum is misplaced,
its checksum FN probability should be essentially ideal, 1 in 4B.

Footer checksum

This change also adds checksum protection to the SST footer (with
format_version=6), for the first time without relying on whole file checksum.
To prevent a corruption of the format_version in the footer (e.g. 6 -> 5) to
defeat the footer checksum, we change much of the footer data format
including an "extended magic number" in format_version 6 that would be
interpreted as empty index and metaindex block handles in older footer
versions. We also change the encoding of handles to free up space for
other new data in footer.

More detail: making space in footer

In order to keep footer the same size in format_version=6 (avoid change to IO
patterns), we have to free up some space for new data. We do this two ways:

  • Metaindex block handle is encoded down to 4 bytes (from 10) by assuming
    it immediately precedes the footer, and by assuming it is < 4GB.
  • Index block handle is moved into metaindex. (I don't know why it was
    in footer to begin with.)

Performance

In case of small performance penalty, I've made a "pay as you go" optimization
to compensate: replace MutableCFOptions in BlockBasedTableBuilder::Rep
with the only field used in that structure after construction: prefix_extractor.
This makes the PR an overall performance improvement (results below).

Nevertheless I'm seeing essentially no difference going from fv=5 to fv=6,
even including that improvement for both. That's based on extreme case table
write performance testing, many files with many blocks. This is relatively
checksum intensive (small blocks) and salt generation intensive (small files).

(for I in `seq 1 100`; do TEST_TMPDIR=/dev/shm/dbbench2 ./db_bench -benchmarks=fillseq -memtablerep=vector -disable_wal=1 -allow_concurrent_memtable_write=false -num=3000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -write_buffer_size=100000 -compression_type=none -block_size=1000; done) 2>&1 | grep micros/op | tee out
awk '{ tot += $5; n += 1; } END { print int(1.0 * tot / n) }' < out

Each value below is ops/s averaged over 100 runs, run simultaneously with competing
configuration for load fairness

Before -> after (both fv=5): 483530 -> 483673 (negligible)
Re-run 1: 480733 -> 485427 (1.0% faster)
Re-run 2: 483821 -> 484541 (0.1% faster)
Before (fv=5) -> after (fv=6): 482006 -> 485100 (0.6% faster)
Re-run 1: 482212 -> 485075 (0.6% faster)
Re-run 2: 483590 -> 484073 (0.1% faster)
After fv=5 -> after fv=6: 483878 -> 485542 (0.3% faster)
Re-run 1: 485331 -> 483385 (0.4% slower)
Re-run 2: 485283 -> 483435 (0.4% slower)
Re-run 3: 483647 -> 486109 (0.5% faster)

Test Plan: unit tests included (table_test, db_properties_test, salt in env_test). General DB tests
and crash test updated to test new format_version.

Also temporarily updated the default format version to 6 and saw some test failures. Almost all
were due to an inadvertent additional read in VerifyChecksum to verify the index block checksum,
though it's arguably a bug that VerifyChecksum does not appear to (re-)verify the index block
checksum, just assuming it was verified in opening the index reader (probably usually true but
probably not always true). Some other concerns about VerifyChecksum are left in FIXME
comments. The only remaining test failure on change of default (in block_fetcher_test) now
has a comment about how to upgrade the test.

The format compatibility test does not need updating because we have not updated the default
format_version.

@pdillinger
Copy link
Contributor Author

It was easier to implement "with context" as a bitwise-or flag on existing
checksum algorithms than to duplicate/imbue only some algorithms with
"WithContext" variants.

For the user, checksum_with_context might make more sense as its own boolean option, but we still have to put that in the footer somehow, and the ChecksumType byte seems like the natural place. But it also seems weird to have a "hidden" internal-only flag on ChecksumType restricting its value range.

An alternative could be to use format_version=6 to enable context-aware checksum, so it doesn't have to be in ChecksumType in options or in the footer. I can't think of any reason this feature shouldn't be used by everyone eventually.

Copy link
Contributor

@mrambacher mrambacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • There should be a XXH3 test added to table_test/FooterTest. Also tests for the context cases.
  • There should be a test where an invalid checksum type is written to the file and validate that the code does the right thing
  • Do you need to do backward compatibility tests to verify that what happens in earlier versions if a file with XXH3 (or contexts) as the hash is used? This is likely true both from a normal start or an ingest.

// "WithContext" checksums offer essentially "full" protection against
// misplaced data because the checksum values depend on which file they are
// in and the location within the file. There is essentially no performance
// penalty for this enhanced checking.
enum ChecksumType : char {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if this would be better as an unsigned char.

table/format.cc Outdated Show resolved Hide resolved
table/format.h Outdated Show resolved Hide resolved
tools/db_bench_tool.cc Outdated Show resolved Hide resolved
include/rocksdb/table.h Outdated Show resolved Hide resolved
db/db_basic_test.cc Outdated Show resolved Hide resolved
db/external_sst_file_test.cc Outdated Show resolved Hide resolved
table/block_based/reader_common.h Outdated Show resolved Hide resolved
table/format.cc Outdated
// <padding> to make the total size 2 * BlockHandle::kMaxEncodedLength + 1
// <padding> up to part2 size = 2 * BlockHandle::kMaxEncodedLength - 4
// base_context_checksum (4 bytes, 0s if unused)
// ---- part3
// footer version (4 bytes)
// table_magic_number (8 bytes)
void Footer::EncodeTo(std::string* dst) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the Footer::ToString method also contain the context? And can the "ToString" method change to print out the string version of the enum (rather than its number)?

table/format.cc Outdated Show resolved Hide resolved
pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Oct 23, 2021
Summary: XXH3 - latest hash function that is extremely fast on large
data, easily faster than crc32c on most any x86_64 hardware. In
integrating this hash function, I have handled the compression type byte
in a non-standard way to avoid using the streaming API (extra data
movement and active code size because of hash function complexity). This
approach got a thumbs-up from Yann Collet.

This change split off from facebook#9058 because context-aware checksum is
likely to be handled through different configuration than ChecksumType.

Test Plan: tests updated, including stress test. db_bench also updated
for microbenchmarking checksums.

 ### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)

./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3
crc32c       :       0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op)
xxhash       :       0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op)
xxhash64     :       0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op)
xxh3         :       0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op)
crc32c       :       0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op)
xxhash       :       0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op)
xxhash64     :       0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op)
xxh3         :       0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op)
crc32c       :       0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op)
xxhash       :       0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op)
xxhash64     :       0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op)
xxh3         :       0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op)

As you can see, especially once warmed up, xxh3 is fastest.

 ### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)

Test

    for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done

Results (ops/sec)

    for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done

results-0 252118 # kNoChecksum
results-1 251588 # kCRC32c
results-2 251863 # kxxHash
results-3 252016 # kxxHash64
results-4 252038 # kXXH3
@pdillinger pdillinger changed the title Implement XXH3 checksum and context-aware checksums [To revise] Context-aware block checksums Oct 28, 2021
facebook-github-bot pushed a commit that referenced this pull request Oct 29, 2021
Summary:
XXH3 - latest hash function that is extremely fast on large
data, easily faster than crc32c on most any x86_64 hardware. In
integrating this hash function, I have handled the compression type byte
in a non-standard way to avoid using the streaming API (extra data
movement and active code size because of hash function complexity). This
approach got a thumbs-up from Yann Collet.

Existing functionality change:
* reject bad ChecksumType in options with InvalidArgument

This change split off from #9058 because context-aware checksum is
likely to be handled through different configuration than ChecksumType.

Pull Request resolved: #9069

Test Plan:
tests updated, and substantially expanded. Unit tests now check
that we don't accidentally change the values generated by the checksum
algorithms ("schema test") and that we properly handle
invalid/unrecognized checksum types in options or in file footer.

DBTestBase::ChangeOptions (etc.) updated from two to one configuration
changing from default CRC32c ChecksumType. The point of this test code
is to detect possible interactions among features, and the likelihood of
some bad interaction being detected by including configurations other
than XXH3 and CRC32c--and then not detected by stress/crash test--is
extremely low.

Stress/crash test also updated (manual run long enough to see it accepts
new checksum type). db_bench also updated for microbenchmarking
checksums.

 ### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)

./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3
crc32c       :       0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op)
xxhash       :       0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op)
xxhash64     :       0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op)
xxh3         :       0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op)
crc32c       :       0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op)
xxhash       :       0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op)
xxhash64     :       0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op)
xxh3         :       0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op)
crc32c       :       0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op)
xxhash       :       0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op)
xxhash64     :       0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op)
xxh3         :       0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op)

As you can see, especially once warmed up, xxh3 is fastest.

 ### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)

Test

    for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done

Results (ops/sec)

    for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done

results-0 252118 # kNoChecksum
results-1 251588 # kCRC32c
results-2 251863 # kxxHash
results-3 252016 # kxxHash64
results-4 252038 # kXXH3

Reviewed By: mrambacher

Differential Revision: D31905249

Pulled By: pdillinger

fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1
pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Dec 2, 2021
Summary: I'm working on a new format_version=6 to support context
checksum (facebook#9058) and this includes much of the refactoring and test
updates to support that change.

Test coverage data and manual inspection agree on dead code in
block_based_table_reader.cc (removed).

Test Plan: tests enhanced to cover more cases etc.
@pdillinger pdillinger changed the title [To revise] Context-aware block checksums [Dependencies pending] format_version=6 and context-aware block checksums Dec 4, 2021
facebook-github-bot pushed a commit that referenced this pull request Dec 10, 2021
Summary:
I'm working on a new format_version=6 to support context
checksum (#9058) and this includes much of the refactoring and test
updates to support that change.

Test coverage data and manual inspection agree on dead code in
block_based_table_reader.cc (removed).

Pull Request resolved: #9240

Test Plan:
tests enhanced to cover more cases etc.

Extreme case performance testing indicates small % regression in fillseq (w/ compaction), though CPU profile etc. doesn't suggest any explanation. There is enhanced correctness checking in Footer::DecodeFrom, but this should be negligible.

TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=1 --disable_wal={false,true}

(Each is ops/s averaged over 50 runs, run simultaneously with competing configuration for load fairness)
Before w/ wal: 454512
After w/ wal: 444820 (-2.1%)
Before w/o wal: 1004560
After w/o wal: 998897 (-0.6%)

Since this doesn't modify WAL code, one would expect real effects to be larger in w/o wal case.

This regression will be corrected in a follow-up PR.

Reviewed By: ajkr

Differential Revision: D32813769

Pulled By: pdillinger

fbshipit-source-id: 444a244eabf3825cd329b7d1b150cddce320862f
pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Dec 10, 2021
Summary: Again, ahead of planned changes in facebook#9058. This change improves
performance (vs. pre-facebook#9240 baseline) by separating a FooterBuilder from
Footer, where FooterBuilder includes (inline owns) the serialized data
so that it can be stack allocated.

Test Plan: existing tests + performance testing below

Extreme case performance testing as in facebook#9240 with

    TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000

(Each is ops/s averaged over 50 runs, run simultaneously with competing
configuration for load fairness)
Pre-facebook#9240 baseline (f577458): 436389
With facebook#9240 (653c392): 417946 (-4.2% vs. baseline)
This change: 443762 (+1.7% vs. baseline)
facebook-github-bot pushed a commit that referenced this pull request Dec 14, 2021
Summary:
Again, ahead of planned changes in #9058. This change improves
performance (vs. pre-#9240 baseline) by separating a FooterBuilder from
Footer, where FooterBuilder includes (inline owns) the serialized data
so that it can be stack allocated.

Pull Request resolved: #9280

Test Plan:
existing tests + performance testing below

Extreme case performance testing as in #9240 with

    TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000

(Each is ops/s averaged over 50 runs, run simultaneously with competing
configuration for load fairness)
Pre-#9240 baseline (f577458): 436389
With #9240 (653c392): 417946 (-4.2% vs. baseline)
This change: 443762 (+1.7% vs. baseline)

Reviewed By: ajkr

Differential Revision: D33077220

Pulled By: pdillinger

fbshipit-source-id: 7eaa6499589aac1693414a758e8c799216c5016c
@pdillinger pdillinger changed the title [Dependencies pending] format_version=6 and context-aware block checksums format_version=6 and context-aware block checksums Dec 14, 2021
@facebook-github-bot
Copy link
Contributor

@pdillinger has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pdillinger has updated the pull request. You must reimport the pull request before landing.

@pdillinger
Copy link
Contributor Author

Baseline, format_version=5: 66396
This PR, format_version=5: 65419 (-1.5%)
This PR, format_version=6: 64849 (-2.3%)

This is after Small optimization. After that change, it's hard to explain any difference between format_version 5 and 6 on new code, because this test is dominated by reading blocks on already open files and there's (now) no difference in control flow path on checking block checksum with or without context.

@facebook-github-bot
Copy link
Contributor

@pdillinger has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@mrambacher mrambacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@@ -418,6 +421,9 @@ struct BlockBasedTableOptions {
// 5 -- Can be read by RocksDB's versions since 6.6.0. Full and partitioned
// filters use a generally faster and more accurate Bloom filter
// implementation, with a different schema.
// 6 -- Modified the file footer and checksum matching so that SST data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the comment contain since version info?

@pdillinger
Copy link
Contributor Author

Still not totally satisfied with performance, and simulation data on new cache key suggests little value from "improved" unique id. Will reconsider this some time.

@pdillinger pdillinger marked this pull request as draft January 5, 2022 22:34
chadaustin and others added 15 commits July 18, 2023 15:38
Summary:
Pull Request resolved: facebook#11624

<queue> must be included to use std::queue.

Reviewed By: pdillinger

Differential Revision: D47562433

fbshipit-source-id: 7c5b19fd9e411694c782dfc0dff0231d4f92ef24
Summary:
An internal user wants to implement a key-aware row cache policy. For that, they need to know the components of the cache key, especially the user key component. With a specialized `RowCache` interface, we will be able to tell them the components so they won't have to make assumptions about our internal key schema.

This PR prepares for the specialized `RowCache` interface by updating the migration plan of facebook#11450. I added a release note for the removed APIs and didn't mention the added ones for now.

Pull Request resolved: facebook#11620

Reviewed By: pdillinger

Differential Revision: D47536962

Pulled By: ajkr

fbshipit-source-id: bbee0fc4ad67fc699a66b8f2b4ea4544dd003691
…acebook#11575)

Summary:
This adds proper support for using rocksdb with FetchContent, without this PR the user must include the following with their own `CMakeLists.txt` file:
```cmake
include_directories(./build/_deps/rocksdb-src/include)
```

Pull Request resolved: facebook#11575

Reviewed By: jowlyzhang

Differential Revision: D47163520

Pulled By: ajkr

fbshipit-source-id: a202dcf435ecc9dd8d51c88f90e98c04814721ca
…not executed in db_compaction_test (facebook#11583)

Summary:
In [db_impl_open.cc](https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_open.cc), the sync point `SanitizeOptions::AfterChangeMaxOpenFiles` is used to set `max_open_files` with some specified "**invalid**" value even if it has been sanitized.

However,  in [db_compaction_test.cc](https://github.com/facebook/rocksdb/blob/main/db/db_compaction_test.cc), `SanitizeOptions::AfterChangeMaxOpenFiles` would not be executed since `SyncPoint::EnableProcessing()` is run after `DBTestBase::Reopen()`.  To enable `SanitizeOptions::AfterChangeMaxOpenFiles`,  `SyncPoint::EnableProcessing()` should be put ahead of `DBTestBase::Reopen()`.

Pull Request resolved: facebook#11583

Test Plan:
run unit tests locally as below:
```
make J=1 check

[ RUN      ] DBCompactionTest.LevelTtlCascadingCompactions
[       OK ] DBCompactionTest.LevelTtlCascadingCompactions (85 ms)
[ RUN      ] DBCompactionTest.LevelPeriodicCompaction
[       OK ] DBCompactionTest.LevelPeriodicCompaction (57 ms)
```

Reviewed By: jowlyzhang

Differential Revision: D47311827

Pulled By: ajkr

fbshipit-source-id: 99165e87a8129e404af06fdf9b4c96eca540fd23
Summary:
Fixes facebook#11604

Pull Request resolved: facebook#11605

Reviewed By: jowlyzhang

Differential Revision: D47459254

Pulled By: ajkr

fbshipit-source-id: 4420e443fbf4bd01ddaa2b47285fc4445bf36246
Summary:
Add path existence check in the script to avoid script running even when db_bench executable does not exist or relative path is not right.

Pull Request resolved: facebook#11621

Reviewed By: jowlyzhang

Differential Revision: D47552590

Pulled By: ajkr

fbshipit-source-id: f09ea069f69e067212b249a22ad755b76bc6063a
Summary:
improvement code by std::move and c++17

Pull Request resolved: facebook#11614

Reviewed By: ajkr

Differential Revision: D47599519

Pulled By: jowlyzhang

fbshipit-source-id: 6b897876f4e87e94a74c53d8db2a01303d500bff
Summary: Pull Request resolved: facebook#11595

Reviewed By: ajkr

Differential Revision: D47600701

Pulled By: jowlyzhang

fbshipit-source-id: 22375b51c726b176e4bc502b49cf3343f45f8a0a
Summary: Pull Request resolved: facebook#11617

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D47599209

Pulled By: jowlyzhang

fbshipit-source-id: 00e96266c75128875663083a2877d27fd7392eea
Summary:
Add `rocksdb_transactiondb_get_base_db` and `rocksdb_transactiondb_close_base_db` functions to the C API modeled after `rocksdb_optimistictransactiondb_get_base_db` and `rocksdb_optimistictransactiondb_close_base_db`:
https://github.com/facebook/rocksdb/blob/ca50ccc71a1ce89008b4737e74f321b8df8a3b5b/include/rocksdb/c.h#L2711-L2716

With this pair of functions, it is possible to get a `rocksdb_t *` from a `rocksdb_transactiondb_t *`. The main goal is to be able to use the approximate memory usage API, only accessible to the `rocksdb_t *` type:
https://github.com/facebook/rocksdb/blob/ca50ccc71a1ce89008b4737e74f321b8df8a3b5b/include/rocksdb/c.h#L2821-L2833

Pull Request resolved: facebook#11562

Reviewed By: ajkr

Differential Revision: D47603343

Pulled By: jowlyzhang

fbshipit-source-id: c70cf6af5834026e232fe7791634db3a396f7d5e
Summary: Pull Request resolved: facebook#11596

Reviewed By: ajkr

Differential Revision: D47635614

Pulled By: jowlyzhang

fbshipit-source-id: 651a06049a54d15fd4b4f010bb4b82f53ff9c9d4
…er non directIO usecase (facebook#11631)

Summary:
**Context/Summary**
As titled. The benefit of doing so is to explicitly call readahead() instead of relying page cache behavior for compaction read when we know that we most likely need readahead as compaction read is sequential read .

**Test**
Extended the existing UT to cover compaction read case

Pull Request resolved: facebook#11631

Reviewed By: ajkr

Differential Revision: D47681437

Pulled By: hx235

fbshipit-source-id: 78792f64985c4dc44aa8f2a9c41ab3e8bbc0bc90
…book#11267)

Summary:
Plaintable will miss properties.
It should have some behavior like blockbasedtable.
Here is a unit test for reproduce this bug.

```
#include <gflags/gflags.h>
#include "rocksdb/db.h"
#include "rocksdb/options.h"
#include "rocksdb/table.h"
#include "rocksdb/slice_transform.h"
#include <iostream>
#include <thread>
#include <csignal>
const std::string kKey = "key";

DEFINE_bool(use_plaintable, true, "use plain table");
DEFINE_string(db_path, "/dev/shm/test_zyx_path", "db_path");

rocksdb::DB* db = nullptr;

class NoopTransform : public rocksdb::SliceTransform {
public:
    explicit NoopTransform() {
    }

    virtual const char* Name() const override {
        return "rocksdb.Noop";
    }

    virtual rocksdb::Slice Transform(const rocksdb::Slice& src) const override {
        return src;
    }

    virtual bool InDomain(const rocksdb::Slice& src) const override {
        return true;
    }

    virtual bool InRange(const rocksdb::Slice& dst) const override {
        return true;
    }

    virtual bool SameResultWhenAppended(const rocksdb::Slice& prefix) const override {
        return false;
    }
};

class TestPropertiesCollector : public ::rocksdb::TablePropertiesCollector {
public:
    explicit TestPropertiesCollector() {
    }

private:
    ::rocksdb::Status AddUserKey(const ::rocksdb::Slice& key, const ::rocksdb::Slice& value, ::rocksdb::EntryType type,
                                 ::rocksdb::SequenceNumber seq, uint64_t file_size) override {
        count++;
        return ::rocksdb::Status::OK();
    }

    ::rocksdb::Status Finish(::rocksdb::UserCollectedProperties* properties) override {
        properties->insert({kKey, std::to_string(count)});
        return ::rocksdb::Status::OK();
    }

    ::rocksdb::UserCollectedProperties GetReadableProperties() const override {
        ::rocksdb::UserCollectedProperties properties;
        properties.insert({kKey, std::to_string(count)});
        return properties;
    }

    const char* Name() const override {
        return "TestPropertiesCollector";
    }
    int count = 0;
};

class TestTablePropertiesCollectorFactory : public ::rocksdb::TablePropertiesCollectorFactory {
public:
    explicit TestTablePropertiesCollectorFactory() {
    }

private:
    ::rocksdb::TablePropertiesCollector* CreateTablePropertiesCollector(
            ::rocksdb::TablePropertiesCollectorFactory::Context context) override {
        return new TestPropertiesCollector();
    }

    const char* Name() const override {
        return "test.TablePropertiesCollectorFactory";
    }
};

class TestFlushListener : rocksdb::EventListener {
public:
    const char* Name() const override {
        return "TestFlushListener";
    }
    void OnFlushCompleted(rocksdb::DB* /*db*/, const rocksdb::FlushJobInfo& flush_job_info) override {
        if (flush_job_info.table_properties.user_collected_properties.find(kKey) ==
            flush_job_info.table_properties.user_collected_properties.end()) {
            std::cerr << "OnFlushCompleted: properties not found" << std::endl;
            return;
        }
        std::cerr << "OnFlushCompleted: properties found "
                  << flush_job_info.table_properties.user_collected_properties.at(kKey) << std::endl;
    }
    explicit TestFlushListener() {
    }
};

int main(int argc, char* argv[]) {
    gflags::ParseCommandLineFlags(&argc, &argv, true);
    rocksdb::DBOptions rocksdb_options;
    std::shared_ptr<rocksdb::EventListener> flush_offset;
    rocksdb_options.create_if_missing = true;
    rocksdb_options.create_missing_column_families = true;
    std::shared_ptr<::rocksdb::TablePropertiesCollectorFactory> properties_collector(
            new TestTablePropertiesCollectorFactory());
    rocksdb::ColumnFamilyOptions cfoptions;
    cfoptions.table_properties_collector_factories.emplace_back(properties_collector);
    std::shared_ptr<rocksdb::EventListener> test_cleaner;
    test_cleaner.reset((rocksdb::EventListener*)new TestFlushListener());
    rocksdb_options.listeners.emplace_back(test_cleaner);

    std::vector<rocksdb::ColumnFamilyDescriptor> cf_desc_;
    cf_desc_.emplace_back(rocksdb::kDefaultColumnFamilyName, cfoptions);
    std::vector<rocksdb::ColumnFamilyHandle*> cfhs;
    cfoptions.prefix_extractor.reset(new NoopTransform());
    if (FLAGS_use_plaintable) {
        cfoptions.table_factory.reset(rocksdb::NewPlainTableFactory());
        std::cerr << "use plaintable" << std::endl;
    } else {
        cfoptions.table_factory.reset(rocksdb::NewBlockBasedTableFactory());
        std::cerr << "use blockbasedtable" << std::endl;
    }

    auto s = rocksdb::DB::Open(rocksdb_options, FLAGS_db_path, cf_desc_, &cfhs, &db);
    if (s.ok()) {
        rocksdb::WriteOptions wops;
        wops.disableWAL = true;
        for (int i = 0; i < 1000000; i++) {
            auto status = db->Put(wops, std::to_string(i), std::string(1024, '3'));
            if (!status.ok()) {
                std::cerr << "write fail " << status.getState() << std::endl;
            }
        }
    } else {
        std::cerr << "open rocksdb failed" << s.getState() << std::endl;
    }
    std::this_thread::sleep_for(std::chrono::seconds(1000));
    delete db;
}
```

Pull Request resolved: facebook#11267

Reviewed By: jowlyzhang

Differential Revision: D47689943

Pulled By: hx235

fbshipit-source-id: 585589cc48f8b26c7dd2323fc7ac4a0c3d4df6bb
Summary:
... ahead of dynamic variant.

* Introduce an Unref function for a common pattern. Cases that were previously using std::memory_order_acq_rel we doing so because we were saving the pre-updated value in case it might be used. Now we are explicitly throwing away the pre-updated value so do not need the acquire semantic, just release.
* Introduce a reusable EvictionData struct and TrackAndReleaseEvictedEntry() function.
* Based on a linter suggesting, use const Func& parameter type instead of Func for templated callable parameters.

Pull Request resolved: facebook#11630

Test Plan: existing tests, and performance test with release build of cache_bench. Getting 1-2% difference between before & after from run to run, but inconsistent about which one is faster.

Reviewed By: jowlyzhang

Differential Revision: D47657334

Pulled By: pdillinger

fbshipit-source-id: 5cf2377c0d47a39143b04be6735f98c550e8bdc3
Summary:
Remove obsolete comment.

Support for WriteBatchWithIndex::NewIteratorWithBase when overwrite_key=false is added in facebook#8135, as you can clearly see in the HISTORY.md.

Pull Request resolved: facebook#11636

Reviewed By: jowlyzhang

Differential Revision: D47722955

Pulled By: ajkr

fbshipit-source-id: 4fa44a309d9708e9f4a1530918a9aaf7114c9032
Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

// Start populating Part 2
char* cur = data_.data() + /* part 1 size */ 1;
// Set extended magic of part2
std::copy(kExtendedMagic.begin(), kExtendedMagic.end(), cur);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to exploit format_version <= 5's odd use of varints plus padding to make part 2 a fixed size. I never would've guessed that old encoding has this advantage

cbi42 and others added 6 commits July 26, 2023 11:31
… universal compaction (facebook#11552)

Summary:
this is stacked on facebook#11550 to further clarify usage of these two options for universal compaction. Similar to FIFO, the two options have the same meaning for universal compaction, which can be confusing to use. For example, for universal compaction, dynamically changing the value of `ttl` has no impact on periodic compactions. Users should dynamically change `periodic_compaction_seconds` instead. From feature matrix (https://fburl.com/daiquery/5s647hwh), there are instances where users set `ttl` to non-zero value and `periodic_compaction_seconds` to 0. For backward compatibility reason, instead of deprecating `ttl`, comments are added to mention that `periodic_compaction_seconds` are preferred. In `SanitizeOptions()`, we update the value of `periodic_compaction_seconds` to take into account value of `ttl`. The logic is documented in relevant option comment.

Pull Request resolved: facebook#11552

Test Plan: * updated existing unit test `DBTestUniversalCompaction2.PeriodicCompactionDefault`

Reviewed By: ajkr

Differential Revision: D47381434

Pulled By: cbi42

fbshipit-source-id: bc41f29f77318bae9a96be84dd89bf5617c7fd57
Summary:
Make flush respect the cutoff timestamp `full_history_ts_low` as much as possible for the user-defined timestamps in Memtables only feature. We achieve this by not proceeding with the actual flushing but instead reschedule the same `FlushRequest` so a follow up flush job can continue with the check after some interval.

This approach doesn't work well for atomic flush, so this feature currently is not supported in combination with atomic flush. Furthermore, this approach also requires a customized method to get the next immediately bigger user-defined timestamp. So currently it's limited to comparator that use uint64_t as the user-defined timestamp format. This support can be extended when we add such a customized method to `AdvancedColumnFamilyOptions`.

For non atomic flush request, at any single time, a column family can only have as many as one FlushRequest for it in the `flush_queue_`. There is deduplication done at `FlushRequest` enqueueing(`SchedulePendingFlush`) and dequeueing time (`PopFirstFromFlushQueue`). We hold the db mutex between when a `FlushRequest` is popped from the queue and the same FlushRequest get rescheduled, so no other `FlushRequest` with a higher `max_memtable_id` can be added to the `flush_queue_` blocking us from re-enqueueing the same `FlushRequest`.

Flush is continued nevertheless if there is risk of entering write stall mode had the flush being postponed, e.g. due to accumulation of write buffers, exceeding the `max_write_buffer_number` setting. When this happens, the newest user-defined timestamp in the involved Memtables need to be tracked and we use it to increase the `full_history_ts_low`, which is an inclusive cutoff timestamp for which RocksDB promises to keep all user-defined timestamps equal to and newer than it.

Tet plan:
```
./column_family_test --gtest_filter="*RetainUDT*"
./memtable_list_test --gtest_filter="*WithTimestamp*"
./flush_job_test --gtest_filter="*WithTimestamp*"
```

Pull Request resolved: facebook#11599

Reviewed By: ajkr

Differential Revision: D47561586

Pulled By: jowlyzhang

fbshipit-source-id: 9400445f983dd6eac489e9dd0fb5d9b99637fe89
…facebook#11623)

Summary:
Add support to allow enabling / disabling user-defined timestamps feature for an existing column family in combination with the in-Memtable only feature.

To do this, this PR includes:
1) Log the `persist_user_defined_timestamps` option per column family in Manifest to facilitate detecting an attempt to enable / disable UDT. This entry is enforced to be logged in the same VersionEdit as the user comparator name entry.

2) User-defined timestamps related options are validated when re-opening a column family, including user comparator name and the `persist_user_defined_timestamps` flag. These type of settings and settings change are considered valid:
     a) no user comparator change and no effective `persist_user_defined_timestamp` flag change.
     b) switch user comparator to enable UDT provided the immediately effective `persist_user_defined_timestamps` flag
         is false.
     c) switch user comparator to disable UDT provided that the before-change `persist_user_defined_timestamps` is
         already false.
3) when an attempt to enable UDT is detected, we mark all its existing SST files as "having no UDT" by marking its `FileMetaData.user_defined_timestamps_persisted` flag to false and handle their file boundaries `FileMetaData.smallest`, `FileMetaData.largest` by padding a min timestamp.

4) while enabling / disabling UDT feature, timestamp size inconsistency in existing WAL logs are handled to make it compatible with the running user comparator.

Pull Request resolved: facebook#11623

Test Plan:
```
make all check
./db_with_timestamp_basic_test --gtest-filter="*EnableDisableUDT*"
./db_wal_test --gtest_filter="*EnableDisableUDT*"
```

Reviewed By: ltamasi

Differential Revision: D47636862

Pulled By: jowlyzhang

fbshipit-source-id: dcd19f67292da3c3cc9584c09ad00331c9ab9322
@facebook-github-bot
Copy link
Contributor

@pdillinger has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pdillinger merged this pull request in 7a1b020.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.