Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC Rewrite bloom filter pruning predicate logic in terms of intervals #14

Draft
wants to merge 4 commits into
base: alamb/bloom-filter-simplify
Choose a base branch
from

Conversation

alamb
Copy link
Owner

@alamb alamb commented Oct 19, 2023

Which issue does this PR close?

This is related to apache#7821 and the hope of a high level unification of PruningPredicate and the interval arithmetic (I thought there was a ticket, but I can't find it)

Rationale for this change

Basically DataFusion has two types of range analysis -- the Interval based analysis used in selectivity analysis and the PruningPrediate based rewrite used for pruning row groups in parquet.

apache#7821 adds a BloomFilterPredicate in the style of PruningPredicate. However, this only supports binary expressions (col = <const>) and AND / OR and is entirely separate from statistics based pruning. This means that predicates like col IN (...) and IS NOT NULL may not be completely supported.

I am pretty sure I could express the same analysis in terms of Intervals, which:

  1. Is more general (as we add support for range analysis in predicates, the bloom filtering will support them as well)
  2. Is more concise (less code)
  3. and this would serve as a proof of concept for using the Intervals for the more general PredicatePruning.

What changes are included in this PR?

  1. Express the application of Bloom filters using interval arithmetic rather than partial expression evaluation.

Are these changes tested?

Yes, existing tests

Are there any user-facing changes?

Not really.

fn prune(&self, column_sbbf: &HashMap<String, Sbbf>) -> bool {
// rewrite any <col = literal> exprs to to `false` if we can prove they
// are always false using the bloom filter.
let rewritten = self
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am quite pleased with this logic. What it does is to rewrite the predicate String@0 = Hello_Not_Exists AND 1 = 1 is rewritten to false AND false and the interval analysis determines that expression is always false:

Evaluating. self.predicate_expr: String@0 = Hello_Not_Exists AND 1 = 1
   rewritten false AND 1 = 1
   interval result: [false, false]
lower: IntervalBound { value: Boolean(false), open: false }
upper: IntervalBound { value: Boolean(false), open: false }

However it seems I need to update the interval analysis so that OR is handled as OR gives a "unknown" for the bounds:

Evaluating. self.predicate_expr: String@0 = Hello_Not_Exists OR String@0 = Hello_Not_Exists2
   rewritten false OR false
   interval result: (NULL, NULL)
lower: IntervalBound { value: NULL, open: true }
upper: IntervalBound { value: NULL, open: true }

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added support for OR here: apache#7884

@crepererum
Copy link

I agree w/ you that pruning analysis and interval arithmetic should be implemented ONCE, ideally using an interface that both arrow and datafusion can use.

@alamb
Copy link
Owner Author

alamb commented Oct 20, 2023

I agree w/ you that pruning analysis and interval arithmetic should be implemented ONCE, ideally using an interface that both arrow and datafusion can use.

I filed apache#7887 to track this idea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants