-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC Rewrite bloom filter pruning predicate logic in terms of intervals #14
base: alamb/bloom-filter-simplify
Are you sure you want to change the base?
Conversation
datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs
Outdated
Show resolved
Hide resolved
fn prune(&self, column_sbbf: &HashMap<String, Sbbf>) -> bool { | ||
// rewrite any <col = literal> exprs to to `false` if we can prove they | ||
// are always false using the bloom filter. | ||
let rewritten = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am quite pleased with this logic. What it does is to rewrite the predicate String@0 = Hello_Not_Exists AND 1 = 1
is rewritten to false AND false
and the interval analysis determines that expression is always false
:
Evaluating. self.predicate_expr: String@0 = Hello_Not_Exists AND 1 = 1
rewritten false AND 1 = 1
interval result: [false, false]
lower: IntervalBound { value: Boolean(false), open: false }
upper: IntervalBound { value: Boolean(false), open: false }
However it seems I need to update the interval analysis so that OR
is handled as OR
gives a "unknown" for the bounds:
Evaluating. self.predicate_expr: String@0 = Hello_Not_Exists OR String@0 = Hello_Not_Exists2
rewritten false OR false
interval result: (NULL, NULL)
lower: IntervalBound { value: NULL, open: true }
upper: IntervalBound { value: NULL, open: true }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added support for OR here: apache#7884
I agree w/ you that pruning analysis and interval arithmetic should be implemented ONCE, ideally using an interface that both arrow and datafusion can use. |
I filed apache#7887 to track this idea |
Which issue does this PR close?
This is related to apache#7821 and the hope of a high level unification of PruningPredicate and the interval arithmetic (I thought there was a ticket, but I can't find it)
Rationale for this change
Basically DataFusion has two types of range analysis -- the
Interval
based analysis used in selectivity analysis and thePruningPrediate
based rewrite used for pruning row groups in parquet.apache#7821 adds a
BloomFilterPredicate
in the style ofPruningPredicate
. However, this only supports binary expressions (col = <const>
) andAND
/OR
and is entirely separate from statistics based pruning. This means that predicates likecol IN (...)
andIS NOT NULL
may not be completely supported.I am pretty sure I could express the same analysis in terms of
Intervals
, which:What changes are included in this PR?
Are these changes tested?
Yes, existing tests
Are there any user-facing changes?
Not really.