Filters on high cardinality dimensions should sometimes use dim index bitset + full scan instead of unioning bitsets of dim values #3878

leventov · 2017-01-24T00:18:10Z

Sometimes Filters/DimFilters (Like, Regex, Bound, etc.) on dimensions of very high cardinality (thousands of values) end up unioning bitsets of significant fraction of individual dimension values, and the resulting bitset is 50% or more full.

So instead of making a union of 100k bitsets when filtering on a 200k-cardinality dimension, we better make a bitset of matching dimension value indexes (like in DimensionSelectorUtils.makeDictionaryEncodedValueMatcherGeneric()), scan all rows in the segment and apply a simple check "matchingDimValuesBitset.get(index)" on each row.

The text was updated successfully, but these errors were encountered:

gianm · 2017-01-24T07:43:18Z

#3775, #3792, #3848 are tangentially related in that they also consider the tradeoff between index-based filtering and cursor-based filtering.

gianm · 2017-01-24T08:03:22Z

In addition to your idea (converting pre-filters to post-filters) one or both of these may also be an improvement. I haven't tested them though:

unioning larger numbers of bitmaps by adding them one by one into a BitSet and then converting the BitSet to an ImmutableBitmap of whatever type (roaring/concise), instead of directly calling bitmapFactory.union(bitmaps). This mode could trigger after unioning some number of bitmaps.
allowing Filter.getBitmapIndex to return an uncompressed BitSet-based bitmap if it wants to, and use this ability to "fall back" to handling bitmaps in an uncompressed way if there are a lot of them being unioned.

egor-ryashin · 2019-02-19T18:13:40Z

a bitset of matching dimension value indexes (like here)

The link is broken, unfortunately.

egor-ryashin · 2019-02-19T19:40:19Z

Sometimes Filters/DimFilters (Like, Regex, Bound, etc.) on dimensions of very high cardinality

Do queries with simple dim = 'VALUE' fall into that category?

gianm · 2019-02-19T19:42:44Z

Do queries with simple dim = 'VALUE' fall into that category?

I doubt it, unless you have a lot of them ORed together. Then, maybe.

egor-ryashin · 2019-02-19T20:02:06Z

I wonder whether a lot of them means 10, 100, >1000?

leventov · 2019-02-19T20:22:34Z

@egor-ryashin experiments should be run to determine that.

gianm · 2019-02-19T20:45:24Z

It also depends on whether those filters match anything or not (not-matching means it doesn't add any union work, just index-search work). So it can be very hard / impossible to predict which one is better without doing a lot of the work, and the algorithm is going to need to make some guesses.

clintropolis · 2019-02-20T07:36:24Z

I've also been lightly experimenting with this, as an offshoot of #6633, to try to get enough of a feel for things to put together a proposal for a more generic solution than in that PR. But I have been rather side tracked with some other things and haven't got back to it yet, so I can at least share my thoughts so far in case they are useful to you or someone else wants to run with it before I can pick it back up.

Besides selectivity, my limited experiments lead me to think that the type of filter can also drive whether or not a filter should be done as a pre or a post filter, since not all filter value matching is equal. Particularly, things like bloom filters which need to compute a hash per value and test each hash against the bloom filter, the match itself is very expensive for high cardinality dimensions before it even gets to combining the bitmaps. #6633 illustrates that it can be useful in that case to always push to a post filter if the cardinality is high, even if the filter is highly selective.

The hard part of this then would be figuring out how to encode this somehow so that the threshold can be a function of how expensive the filter is, assuming such classification is possible or useful in practice and that other filters exhibit a similar patterns to what I was observing with bloom filters which was my main focus so far. By side effect, introducing the idea that filters have some sort of cost value additionally opens up another optimization for evaluating 'and' filters, by giving a sensible mechanism to control the order in which filters are evaluated so that cheaper filters can be evaluated first and non-matches potentially shake out faster.

It looks like some effort has gone into selectivity estimation for use with search queries in the past, but my gut tells me many of these estimates look fairly expensive, so i'm unsure of their utility for all query types. I have not verified this however, nor am I super familiar with these estimators, so it's possible I'm wrong. I was hoping experimentation would shake out whether or not knowing filter selectivity directly is even necessary. It might be well enough that we know that certain types of filters (selectors and combinations of selectors) should rarely if ever not be pre-filters, while others should generally always skip bitmaps if the cardinality is high enough, perhaps with the threshold adjusted based on how expensive the filter is or not. However, I suspect it's not actually that simple either, but hard to say without further experiments.

I don't really know what this looks like yet implementation wise, too many unknowns at this point. All testing I have done so far has been very manual. For my next steps, I was planning on setting up a test harness that would let me manually control whether or not filters should use bitmap indexes similar to #6633, but maybe as an option to all filters, and then benchmark with parameters to run under a variety of conditions to find if there is indeed a 'break even' point and how it varies between filter types, overall dimension cardinality, and filter selectivity. But like I said, I'm not sure when I'll get to this, so if this interests you, I say have at it and I'll be more than happy to support with further discussion or review instead in order to see this get done 👍

leventov · 2019-02-20T22:16:58Z

@gianm @clintropolis it seems to me that you are talking about more general problems. This issue is about a very specific problem: unionDimensionValueBitmaps() calls in BoundFilter.getBitmapResult(), LikeFilter.getBitmapResult() and Filters.matchPredicate(), when they literally union 100k bitsets, each of cardinality 1 (in other words, the dimension values are unique). It's just always a very bad idea. Properly in such cases bitmap indexes should be disabled for such columns (see #5402), but sometimes they are not disabled, e. g. in old segments produced by Druid versions that didn't have #5402 yet.

There are also less extreme cases when bitmaps are generally shouldn't be disabled, but union should be avoided for particular queries. Imagine that LikeFilter or BoundFilter ends up unioning bitmaps corresponding to 90% (99? 100?) of values in the dimension in the segment. There is also little doubt that unioning should be skipped in such case. The question is where is that threshold. Even if threshold-based heurstic leads to a suboptimal decision sometimes, it's anyway better to have it than not having any threshold at all. (All query planners generate suboptimal plans sometimes; it's not the reason not to use them.)

clintropolis · 2019-02-20T22:56:35Z

Eh, I think the problem I encountered with bloom filters is more or less the same problem you describe, if not even simpler because it only depends on cardinality, not even selectivity matters, so a threshold based approach would likely almost always do the right thing if the threshold is set correctly.

The filters you describe experience bad performance when those filters match a large percentage of overall rows, which is why I brought up selectivity in the first place, but I too am hoping that just a threshold based approach would be good enough or at least better than what there is now.

github-actions · 2023-05-31T00:17:30Z

This issue has been marked as stale due to 280 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If this issue is still
relevant, please simply write any comment. Even if closed, you can still revive the
issue at any time or discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

github-actions · 2023-06-28T00:19:31Z

This issue has been closed due to lack of activity. If you think that
is incorrect, or the issue requires additional review, you can revive the issue at
any time.

leventov added Improvement Performance labels Jan 24, 2017

This was referenced Jan 24, 2017

Add filter selectivity estimation for auto search strategy #3848

Merged

Store bitmap sizes along with bitmaps in the segment format #3882

Closed

leventov self-assigned this Mar 30, 2017

leventov added the Area - Querying label Jul 17, 2017

leventov mentioned this issue Nov 12, 2017

TopN queries with multiple independent dimensions #5078

Closed

leventov mentioned this issue Nov 16, 2018

make bitmap index usage optional for bloom filters #6633

Closed

clintropolis mentioned this issue Jan 23, 2019

[Proposal] Use bitmap iteration for filtered aggregators #6889

Open

clintropolis mentioned this issue Jul 31, 2019

add mechanism to control filter optimization in historical query processing #8209

Merged

4 tasks

clintropolis mentioned this issue Mar 25, 2023

smarter nested column index utilization #13977

Merged

10 tasks

github-actions bot added the stale label May 31, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 28, 2023

clintropolis mentioned this issue Feb 6, 2024

adaptive filter partitioning #15838

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filters on high cardinality dimensions should sometimes use dim index bitset + full scan instead of unioning bitsets of dim values #3878

Filters on high cardinality dimensions should sometimes use dim index bitset + full scan instead of unioning bitsets of dim values #3878

leventov commented Jan 24, 2017 •

edited

Loading

gianm commented Jan 24, 2017

gianm commented Jan 24, 2017

egor-ryashin commented Feb 19, 2019

egor-ryashin commented Feb 19, 2019

gianm commented Feb 19, 2019

egor-ryashin commented Feb 19, 2019

leventov commented Feb 19, 2019

gianm commented Feb 19, 2019

clintropolis commented Feb 20, 2019

leventov commented Feb 20, 2019

clintropolis commented Feb 20, 2019

github-actions bot commented May 31, 2023

github-actions bot commented Jun 28, 2023

Filters on high cardinality dimensions should sometimes use dim index bitset + full scan instead of unioning bitsets of dim values #3878

Filters on high cardinality dimensions should sometimes use dim index bitset + full scan instead of unioning bitsets of dim values #3878

Comments

leventov commented Jan 24, 2017 • edited Loading

gianm commented Jan 24, 2017

gianm commented Jan 24, 2017

egor-ryashin commented Feb 19, 2019

egor-ryashin commented Feb 19, 2019

gianm commented Feb 19, 2019

egor-ryashin commented Feb 19, 2019

leventov commented Feb 19, 2019

gianm commented Feb 19, 2019

clintropolis commented Feb 20, 2019

leventov commented Feb 20, 2019

clintropolis commented Feb 20, 2019

github-actions bot commented May 31, 2023

github-actions bot commented Jun 28, 2023

leventov commented Jan 24, 2017 •

edited

Loading