Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable IngestExternalFile() in crash test #9357

Closed
wants to merge 4 commits into from

Conversation

ajkr
Copy link
Contributor

@ajkr ajkr commented Jan 4, 2022

Thanks to #9919 and #10051 the known bugs in file ingestion (besides mmap read + file checksum) are fixed. Now we can try again to enable file ingestion in crash test.

Test Plan:

stress file ingestion heavily for an hour: $ TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --max_key=1000000 --ingest_external_file_one_in=100 --duration=3600 --interval=20 --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152

@facebook-github-bot
Copy link
Contributor

@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Jan 6, 2022
Summary:
The LastSequence field in the MANIFEST file is the baseline seqno for a recovered DB. Recovering WAL entries might cause the recovered DB's seqno to advance above this baseline, but the recovered DB will never use a smaller seqno.

Before this PR, we were writing the DB's seqno at the time of LogAndApply() as the LastSequence value. This works in the sense that it is a large enough baseline for the recovered DB that it'll never overwrite any records in existing SST files. At the same time, it's arbitrarily larger than what's needed. This behavior comes from LevelDB, where there was no tracking of largest seqno in an SST file.

Now we know the largest seqno of newly written SST files, so we can write an exact value in LastSequence that actually reflects the largest seqno in any file referred to by the MANIFEST. This is primarily useful for correctness testing with unsynced data loss, where the recovered DB's seqno needs to indicate what records were recovered.

Pull Request resolved: #9305

Test Plan:
- #9338 adds crash-recovery correctness testing coverage for WAL disabled use cases
- #9357 will extend that testing to cover file ingestion
- Added assertion at end of LogAndApply() for `VersionSet::descriptor_last_sequence_` consistency with files
- Manually tested upgrade/downgrade compatibility with a custom crash test that randomly picks between a `db_stress` built with and without this PR (for old code it must run with `-disable_wal=0`)

Reviewed By: riversand963

Differential Revision: D33182770

Pulled By: ajkr

fbshipit-source-id: 0bfafaf685f347cc8cb0e1d62e0186340a738f7d
@ajkr ajkr force-pushed the crashtest-enable-ingestexternalfile branch from 5d9fcf7 to 1bbd77d Compare May 25, 2022 17:53
@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

@ajkr ajkr marked this pull request as ready for review May 25, 2022 18:25
@facebook-github-bot
Copy link
Contributor

@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ajkr ajkr requested a review from cbi42 May 25, 2022 18:27
@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

@facebook-github-bot
Copy link
Contributor

@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ajkr added a commit to ajkr/rocksdb that referenced this pull request Jun 7, 2022
After facebook#9357 we began seeing the following error attempting to acquire
locks for file ingestion:

```
FATAL: ThreadSanitizer CHECK failed: /home/engshare/third-party2/llvm-fb/12/src/llvm/compiler-rt/lib/sanitizer_common/sanitizer_deadlock_detector.h:67 "((n_all_locks_)) < (((sizeof(all_locks_with_contexts_)/sizeof((all_locks_with_contexts_)[0]))))" (0x40, 0x40)
```

The command was using default values for `ingest_external_file_width`
(1000) and `log2_keys_per_lock` (2). The expected number of locks needed
to update those keys is (1000 / 2^2) = 250, which is above the 0x40 (64)
limit. This PR reduces the default value of `ingest_external_file_width`
to 100 so the expected number of locks is 25, which is within the limit.
facebook-github-bot pushed a commit that referenced this pull request Jun 7, 2022
Summary:
After #9357 we began seeing the following error attempting to acquire
locks for file ingestion:

```
FATAL: ThreadSanitizer CHECK failed: /home/engshare/third-party2/llvm-fb/12/src/llvm/compiler-rt/lib/sanitizer_common/sanitizer_deadlock_detector.h:67 "((n_all_locks_)) < (((sizeof(all_locks_with_contexts_)/sizeof((all_locks_with_contexts_)[0]))))" (0x40, 0x40)
```

The command was using default values for `ingest_external_file_width`
(1000) and `log2_keys_per_lock` (2). The expected number of locks needed
to update those keys is then (1000 / 2^2) = 250, which is above the 0x40 (64)
limit. This PR reduces the default value of `ingest_external_file_width`
to 100 so the expected number of locks is 25, which is within the limit.

Pull Request resolved: #10131

Reviewed By: ltamasi

Differential Revision: D36986307

Pulled By: ajkr

fbshipit-source-id: e918cdb2fcc39517d585f1e5fd2539e185ada7c1
pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Jan 4, 2024
Summary: I suspect the issue called out in facebook#9357 was fixed in facebook#11328

Test Plan: `make blackbox_crash_test` for hours
facebook-github-bot pushed a commit that referenced this pull request Jan 4, 2024
Summary:
I suspect the issue called out in #9357 was fixed in #11328

Pull Request resolved: #12201

Test Plan: `make blackbox_crash_test` for hours

Reviewed By: ajkr

Differential Revision: D52543075

Pulled By: pdillinger

fbshipit-source-id: b705a6bdb2799a5f51ad2746df2083aa82f360a2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants