Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The default pruning threshold for inpredicate might be too low #14380

Open
UOETianleZhang opened this issue Nov 4, 2024 · 3 comments
Open

The default pruning threshold for inpredicate might be too low #14380

UOETianleZhang opened this issue Nov 4, 2024 · 3 comments

Comments

@UOETianleZhang
Copy link

We are trying to use bloom filters to reduce the latency of queries that have a long IN caluse (number of elements in the IN caluse is ~50). However we see bllom filters are not taking effects.

After digging it, we found there is a server config which will disable pruning if the number of values in the in predicate is larger than 10 (default).

Do we know the reason of setting this default number as 10? Applying pruning on a large IN clause will lead to diminishing returns, but even if we take this into consideration, 10 looks too conversative for me.

@jasperjiaguo
Copy link
Contributor

Yes, in our case we see a quite substantial improvement with bloomfilter pruning added to a high cardinality dictionary-enabled column (@UOETianleZhang can probably share some anonymous number here). This kind of tells us for dictionary enabled column binary search is slower than hashing (bloomfilter). Therefore the gain would proably be more prominent when the number of values in in clause is larger? Unless we are sure that these values would exist in every segment we query.

@UOETianleZhang
Copy link
Author

With some benchmarking, we see increasing the limit will have a great improvement on the latency.

  • Column cardinality: 5,573,103
  • Column type: STRING.
  • Query pattern: SELECT column FROM table WHERE column IN (var1, var2, ... var41) LIMIT 1

After increasing the pruning limit from 10 to 100, the latency reduced from 20 seconds to 286 miliseconds.

@jasperjiaguo
Copy link
Contributor

Oh I think the PR is #6776

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants