-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support range deletion tombstones in IngestExternalFile SSTs #3778
Conversation
facebook/rocksdb#3778 adds a DeleteRange method to SstFileWriter and adds support for ingesting SSTs with range deletion tombstones. facebook/rocksdb#3779 adds a virtual Truncate method to Env, which truncates the named file to the specified size. Release note: None
Fixes cockroachdb#16954. Related to cockroachdb#25047. This depends on the following two upstream changes to RockDB: - facebook/rocksdb#3778 - facebook/rocksdb#3779 The change introduces a new snapshot strategy called "SST". This strategy stream sst files consisting of all keys in a range from the sender to the receiver. These sst files are then atomically ingested directly into RocksDB. An important property of the strategy is that the amount of memory required for a receiver using the strategy is constant with respect to the size of a range, instead of linear as it is with the KV_BATCH strategy. This will be critical for increasing the default range size and potentially for increasing the number of concurrent snapshots allowed per node. The strategy also seems to significantly speed up snapshots once ranges are above a certain size (somewhere in the single digit MBs). This is a WIP change. Before it can be merged it needs: - to be cleaned up a bit - more testing (unit test, testing knobs, maybe some chaos) - proper version handling - heuristic tuning - decisions on questions like compactions after ingestion Release note: None
@anand1976 let me know if you need to me explain anything. As discussed in the commit message, the second commit is needed to get |
@@ -166,6 +164,7 @@ class RangeDelAggregator { | |||
struct Rep { | |||
StripeMap stripe_map_; | |||
PinnedIteratorsManager pinned_iters_mgr_; | |||
std::list<std::string> pinned_slices_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After debugging I believe the corruption can only happen if UpdateInternalKey
is called or if block_restart_intervals != 1
, as either of those prevent the map from holding pointers to the full keys inside the range deletion meta-block (which the PinnedIteratorsManager
should be capable of pinning). UpdateInternalKey
is only called for range deletion keys as of this PR, and we always set block_restart_intervals = 1
. I wonder if there's a way to prevent the file ingestion path from using UpdateInternalKey
, rather than maintaining this list for all range deletion keys (even though it's only needed for ones added through the file ingestion path).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this and thanks for adding an assertion to guard against this in 7b65521.
As you brought up, the file ingestion path uses UpdateInternalKey
because it creates blocks with global sequence numbers. This is used to avoid needing to update every single key in an SST when a sequence number is computed during ingestion time. I see two general approaches to addressing this issue, but I'm not confident about which one would be the better approach:
The first is that we look into addressing the following TODO:
Lines 256 to 257 in aa53579
// TODO(tec): Investigate updating the seqno in the loaded block | |
// directly instead of doing a copy and update. |
If we updated the sequence number in the loaded block directly in
BlockIter::ParseNextKey
when we see a global sequence number then we wouldn't ever need to un-pin the keys. This would allow us to avoid needing to worry about making any changes to RangeDelAggregator
. However, there are a few downsides here, including that this could alter performance characteristics outside of this change. I also don't feel comfotable making a change like that without some guidance because I expect that it could have far reaching consequences.
The other approach, mentioned by @anand1976 below, is that we allow unpinned keys in RangeDelAggregator::AddTombstones
but we only perform the copy if a key is unpinned. This has a few nice properties, including that it is a well contained change, it shouldn't have any unexpected consequences or performance implications, and it adds a bit more flexibility to RangeDelAggregator
in case another future user of the class provides an iterator with unpinned keys. This is the direction I'm leaning towards. Please let me know if you agree. Also, if so, could you give me a pointer about how this kind of memory pinning usually takes place in this code base? The std::list was just for prototyping, but it certainly isn't the best way to do this kind of thing. Is there some kind of arena object that I should be copying the keys into?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Friendly ping. Hoping to make progress on this sometime this week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In approach #1, it looks like there is no guarantee that the sequence number can be updated in the loaded block directly, since the key may be sharing bytes with the previous key and would have to be copied anyway. So I suggest going with approach#2. @ajkr Do you agree?
As far as pinning is concerned, maybe rep_->pinned_slices_.emplace_back(input->key().data(), input->key().size()) would be more efficient. Also, pinned_slices_ could be std::vector or std::deque instead of std::list to avoid repeated allocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this change as suggested. I also had to do the same thing for input->value()
when !input->IsValuePinned()
. This is a new case which seems to have been introduced/revealed by fea2b1d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. Yeah, approach 2 sounds safer to me too. And if we decide to do approach 1 later, you'll get the benefit of avoiding the copy without having to change anything.
I don't have a concrete idea of a problem for approach 1. Regarding shared bytes, we don't delta encode keys in range deletion meta-blocks so it shouldn't be a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch on the IsValuePinned()
recent change
db/range_del_aggregator.cc
Outdated
@@ -204,7 +204,8 @@ Status RangeDelAggregator::AddTombstones( | |||
first_iter = false; | |||
} | |||
ParsedInternalKey parsed_key; | |||
if (!ParseInternalKey(input->key(), &parsed_key)) { | |||
rep_->pinned_slices_.push_back(input->key().ToString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up on comment by @ajkr below, maybe we should do this only if input->IsKeyPinned() is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the best option too, but I provided an alternative in my comment above.
table/block_prefix_index.cc
Outdated
// TODO DURING REVIEW: without this, ingested sstables without any keys | ||
// other than range deletions hit an assertion in BlockPrefixIndex::Builder. | ||
// I need to figure out why this is. | ||
if (num_blocks == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a prefix meta block created even if there are no keys, only range deletions? That doesn't seem right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like it is. I was hoping that this was a result of #3906, but the block is definitely create, even when there are no keys. I don't actually see any logic that would prevent that. The meta block written in HashIndexBuilder::Finish
, and this is called from BlockBasedTableBuilder::Finish
regardless of whether empty_data_block
is true or not. Is it possible that this code path has just never been hit before, even though SSTs with only range deletions were already supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that this code path has just never been hit before, even though SSTs with only range deletions were already supported?
Actully, I think this might be is exactly what's going on. See #3774, which is the same assertion I hit in my test without this patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A recent change to testing code created a new case that this change hadn't been tested against before - building a table with only range deletion tombstones and no normal keys while using a two level index (PartitionedIndexBuilder
). This is hitting this assertion:
rocksdb/table/index_builder.cc
Line 146 in 1f32dc7
assert(!entries_.empty()); |
This is making me question whether sstables with only range deletion tombstones and no normal keys are actually supported, and if so, how. I'd appreciate some help getting more context on this situation and how we expect it behave with BlockBasedTableBuilder::Finish
. That method currently avoids writing its filter block if the data block is empty, but it writes its index block unconditionally.
0ddd112
to
0df9b2c
Compare
0df9b2c
to
d45f471
Compare
I've rebased this on top of #4016 and #4021. Once those go in, I think this is ready to merge as well. Thanks for all the help @ajkr and @anand1976! |
Add a new table property, rocksdb.num.range-deletions, which tracks the number of range deletions in a block-based table. Range deletions are no longer counted in rocksdb.num.entries; as discovered in PR facebook#3778, there are various code paths that implicitly assume that rocksdb.num.entries counts only true keys, not range deletions.
Summary: Add a new table property, rocksdb.num.range-deletions, which tracks the number of range deletions in a block-based table. Range deletions are no longer counted in rocksdb.num.entries; as discovered in PR #3778, there are various code paths that implicitly assume that rocksdb.num.entries counts only true keys, not range deletions. /cc ajkr nvanbenschoten Closes #4016 Differential Revision: D8527575 Pulled By: ajkr fbshipit-source-id: 92e7edbe78fda53756a558013c9fb496e7764fd7
Upstream PR: facebook#3778. This change adds a `DeleteRange` method to `SstFileWriter` and adds support for ingesting SSTs with range deletion tombstones. This is important for applications that need to atomically ingest SSTs while clearing out any existing keys in a given key range.
Summary: Add a new table property, rocksdb.num.range-deletions, which tracks the number of range deletions in a block-based table. Range deletions are no longer counted in rocksdb.num.entries; as discovered in PR facebook#3778, there are various code paths that implicitly assume that rocksdb.num.entries counts only true keys, not range deletions. /cc ajkr nvanbenschoten Closes facebook#4016
0b4b5bb
to
422ddcd
Compare
@nvanbenschoten has updated the pull request. Re-import the pull request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anand1976 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Nice. The memory leak is gone now. |
Summary: Add a new table property, rocksdb.num.range-deletions, which tracks the number of range deletions in a block-based table. Range deletions are no longer counted in rocksdb.num.entries; as discovered in PR facebook#3778, there are various code paths that implicitly assume that rocksdb.num.entries counts only true keys, not range deletions. /cc ajkr nvanbenschoten Closes facebook#4016 Differential Revision: D8527575 Pulled By: ajkr fbshipit-source-id: 92e7edbe78fda53756a558013c9fb496e7764fd7
422ddcd
to
9fed3ea
Compare
@nvanbenschoten has updated the pull request. Re-import the pull request |
Yeah I made what looked like a harmless change and it ended up causing |
Yeah this flake seems to be hitting other PRs as well, like #4132. |
Looks like all tests are passing now! |
Yeah, it finally passed after a few re-runs and different failures. I think we're ready to land this now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anand1976 is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anand1976 is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@nvanbenschoten The land is still failing, though the failure is rather cryptic. I think it might be due to conflicts with #4014 which landed yesterday. Can you rebase one more time? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anand1976 is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Trying to land again using a suggestion from @ajkr to use a slightly different workflow. |
@nvanbenschoten My initial guess was right. There is a merge conflict in db/range_del_aggregator.cc. Can you rebase and resolve? |
Fixes facebook#3391. This change adds a `DeleteRange` method to `SstFileWriter` and adds support for ingesting SSTs with range deletion tombstones. This is important for applications that need to atomically ingest SSTs while clearing out any existing keys in a given key range.
9fed3ea
to
9c584da
Compare
@anand1976 thanks for figuring that out. Rebased and resolved the conflict. |
@nvanbenschoten has updated the pull request. Re-import the pull request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anand1976 is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: Add a new table property, rocksdb.num.range-deletions, which tracks the number of range deletions in a block-based table. Range deletions are no longer counted in rocksdb.num.entries; as discovered in PR facebook#3778, there are various code paths that implicitly assume that rocksdb.num.entries counts only true keys, not range deletions. /cc ajkr nvanbenschoten Closes facebook#4016 Differential Revision: D8527575 Pulled By: ajkr fbshipit-source-id: 92e7edbe78fda53756a558013c9fb496e7764fd7
…k#3778) Summary: Fixes facebook#3391. This change adds a `DeleteRange` method to `SstFileWriter` and adds support for ingesting SSTs with range deletion tombstones. This is important for applications that need to atomically ingest SSTs while clearing out any existing keys in a given key range. Pull Request resolved: facebook#3778 Differential Revision: D8821836 Pulled By: anand1976 fbshipit-source-id: ca7786c1947ff129afa703dab011d524c7883844
Fixes #3391.
This change adds a
DeleteRange
method toSstFileWriter
and addssupport for ingesting SSTs with range deletion tombstones. This is
important for applications that need to atomically ingest SSTs while
clearing out any existing keys in a given key range.