Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and generalize framework for filtering range queries, etc. #13005

Closed
wants to merge 6 commits into from

Conversation

pdillinger
Copy link
Contributor

@pdillinger pdillinger commented Sep 10, 2024

Summary: There was a subtle design/contract bug in the previous version of range filtering in experimental.h If someone implemented a key segments extractor with "all or nothing" fixed size segments, that could result in unsafe range filtering. For example, with two segments of width 3:

x = 0x|12 34 56|78 9A 00|
y = 0x|12 34 56||78 9B
z = 0x|12 34 56|78 9C 00|

Segment 1 of y (empty) is out of order with segment 1 of x and z.

I have re-worked the contract to make it clear what does work, and implemented a standard extractor for fixed-size segments, CappedKeySegmentsExtractor. The safe approach for filtering is to consume as much as is available for a segment in the case of a short key.

I have also added support for min-max filtering with reverse byte-wise comparator, which is probably the 2nd most common comparator for RocksDB users (because of MySQL). It might seem that a min-max filter doesn't care about forward or reverse ordering, but it does when trying to determine whether in input range from segment values v1 to v2, where it so happens that v2 is byte-wise less than v1, is an empty forward interval or a non-empty reverse interval. At least in the current setup, we don't have that context.

A new unit test (with some refactoring) tests CappedKeySegmentsExtractor, reverse byte-wise comparator, and the corresponding min-max filter.

I have also (contractually / mathematically) generalized the framework to comparators other than the byte-wise comparator, and made other generalizations to make the extractor limitations more explicitly connected to the particular filters and filtering used--at least in description.

Test Plan: added unit tests as described

@pdillinger pdillinger changed the title WIP: Fix and generalize framework for filtering range queries, etc. WIP/RFC: Fix and generalize framework for filtering range queries, etc. Sep 10, 2024
Summary: There was a subtle design/contract bug in the previous version
of range filtering in experimental.h  If someone implemented a key
segments extractor with "all or nothing" fixed size segments, that could
result in unsafe range filtering. TODO: example

I have re-worked the contract to make it clear what does work, and
implemented a standard extractor for fixed-size segments,
CappedKeySegmentsExtractor. The safe approach for filtering is to
consume as much as is available for a segment in the case of a short
key.

I have also (contractually / mathematically) generalized the framework
to comparators other than the byte-wise comparator, and made other
generalizations to make the extractor limitations more explicitly
connected to the particular filters and filtering used--at least in
description. (TODO: better detection/enforcement?)

Test Plan: added a sizeable unit test for capped extractor
@pdillinger pdillinger marked this pull request as ready for review September 16, 2024 19:17
@facebook-github-bot
Copy link
Contributor

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@pdillinger pdillinger changed the title WIP/RFC: Fix and generalize framework for filtering range queries, etc. Fix and generalize framework for filtering range queries, etc. Sep 17, 2024
Copy link
Contributor

@jowlyzhang jowlyzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for adding this improvement.

// class filter on the sequence of segments.
//
// GENERALIZING FILTERS (INDIRECT):
// * Point queries can utilize essentially any kind of filter by extracting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the theory, but in practice we don't need this capability to filter tables for point query, right? Since we check table's bounds.

Copy link
Contributor Author

@pdillinger pdillinger Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimum and maximum keys on a table essentially provide a min-max filter on the whole key. You can think of this as a one-dimensional filter that represents only a single range within that dimension. Keeping with one dimension, Bloom filters on a prefix or the whole key complement that by greatly narrowing down the set of keys that might be in the SST file. And of course higher-dimensional filters can also operate on point queries. For some workloads, a min-max filter on a segment might be more effective than a Bloom filter, even for point queries, and much more space/memory efficient.

I'll see if I can make this clearer, perhaps with such examples.

//
// Beyond point queries, we generally expect the key comparator to be a
// lexicographic / big endian ordering at a high level, while each segment
// can use an arbitrary comparator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this "can use arbitrary comparator" mean? Since lexicographic ordering is used in building the filter and for checking bounds against the filter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, each segment could be a little-endian 4-byte value, but ordering between segments is still lexicographic. We can make min-max filters aware of other segment comparators, much as we added reverse byte-wise support in this PR.

I'll expand this paragraph.

// * Order-based filters on segments (rather than whole key) can apply to range
// queries (with "whole key" bounds). Specifically, an order-based filter on
// segments i through j and category set s is applicable to a range query from
// lb to ub if
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:lb and ub also both need to be in category set s, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to mention that this time! Yes, thanks!

@facebook-github-bot
Copy link
Contributor

@pdillinger has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pdillinger merged this pull request in 10984e8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants