-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filters on high cardinality dimensions should sometimes use dim index bitset + full scan instead of unioning bitsets of dim values #3878
Comments
In addition to your idea (converting pre-filters to post-filters) one or both of these may also be an improvement. I haven't tested them though:
|
The link is broken, unfortunately. |
Do queries with simple |
I doubt it, unless you have a lot of them ORed together. Then, maybe. |
I wonder whether |
@egor-ryashin experiments should be run to determine that. |
It also depends on whether those filters match anything or not (not-matching means it doesn't add any union work, just index-search work). So it can be very hard / impossible to predict which one is better without doing a lot of the work, and the algorithm is going to need to make some guesses. |
I've also been lightly experimenting with this, as an offshoot of #6633, to try to get enough of a feel for things to put together a proposal for a more generic solution than in that PR. But I have been rather side tracked with some other things and haven't got back to it yet, so I can at least share my thoughts so far in case they are useful to you or someone else wants to run with it before I can pick it back up. Besides selectivity, my limited experiments lead me to think that the type of filter can also drive whether or not a filter should be done as a pre or a post filter, since not all filter value matching is equal. Particularly, things like bloom filters which need to compute a hash per value and test each hash against the bloom filter, the match itself is very expensive for high cardinality dimensions before it even gets to combining the bitmaps. #6633 illustrates that it can be useful in that case to always push to a post filter if the cardinality is high, even if the filter is highly selective. The hard part of this then would be figuring out how to encode this somehow so that the threshold can be a function of how expensive the filter is, assuming such classification is possible or useful in practice and that other filters exhibit a similar patterns to what I was observing with bloom filters which was my main focus so far. By side effect, introducing the idea that filters have some sort of cost value additionally opens up another optimization for evaluating 'and' filters, by giving a sensible mechanism to control the order in which filters are evaluated so that cheaper filters can be evaluated first and non-matches potentially shake out faster. It looks like some effort has gone into selectivity estimation for use with I don't really know what this looks like yet implementation wise, too many unknowns at this point. All testing I have done so far has been very manual. For my next steps, I was planning on setting up a test harness that would let me manually control whether or not filters should use bitmap indexes similar to #6633, but maybe as an option to all filters, and then benchmark with parameters to run under a variety of conditions to find if there is indeed a 'break even' point and how it varies between filter types, overall dimension cardinality, and filter selectivity. But like I said, I'm not sure when I'll get to this, so if this interests you, I say have at it and I'll be more than happy to support with further discussion or review instead in order to see this get done 👍 |
@gianm @clintropolis it seems to me that you are talking about more general problems. This issue is about a very specific problem: There are also less extreme cases when bitmaps are generally shouldn't be disabled, but union should be avoided for particular queries. Imagine that |
Eh, I think the problem I encountered with bloom filters is more or less the same problem you describe, if not even simpler because it only depends on cardinality, not even selectivity matters, so a threshold based approach would likely almost always do the right thing if the threshold is set correctly. The filters you describe experience bad performance when those filters match a large percentage of overall rows, which is why I brought up selectivity in the first place, but I too am hoping that just a threshold based approach would be good enough or at least better than what there is now. |
This issue has been marked as stale due to 280 days of inactivity. |
This issue has been closed due to lack of activity. If you think that |
Sometimes Filters/DimFilters (Like, Regex, Bound, etc.) on dimensions of very high cardinality (thousands of values) end up unioning bitsets of significant fraction of individual dimension values, and the resulting bitset is 50% or more full.
So instead of making a union of 100k bitsets when filtering on a 200k-cardinality dimension, we better make a bitset of matching dimension value indexes (like in
DimensionSelectorUtils.makeDictionaryEncodedValueMatcherGeneric()
), scan all rows in the segment and apply a simple check "matchingDimValuesBitset.get(index)" on each row.The text was updated successfully, but these errors were encountered: