-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add EXISTS
and IN
subquery rewriting for correlated filters at filter depth 1
#2451
Conversation
EXISTS
and IN
subquery rewriting for correlated filters at filter depth 1
\n TableScan: sq projection=None [a:UInt32, b:UInt32, c:UInt32]\ | ||
\n Projection: #sq_nested.c [c:UInt32]\ | ||
\n TableScan: sq_nested projection=None [a:UInt32, b:UInt32, c:UInt32]"; | ||
\n Semi Join: #sq.a = #sq_nested.c [a:UInt32, b:UInt32, c:UInt32]\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this is a quirk of the implementation. It removes projections below the semi join in some cases (which are inconsequential).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to remove the quirk is to allow for multi-column InSubquery
s.
The predicate pullup rule (next PR) will be able to "commute" a projection and a filter as follows:
Project([col(a)])
Filter(col(b)=(col(t.b))
=>
Filter(col(b)=(col(t.b))
Project([col(a), col(b)])
This way, one can move the Filter to the top of the subquery tree, just like with Exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The predicate pullup rule (next PR) will be able to "commute" a projection and a filter as follows:
This sounds like a subset of the filter "pushdown" rule -- so maybe we can just invoke that again rather than adding a new special case rule
// NOTE: We only pattern match against Projection(Filter(..)). We will have another optimization rule | ||
// which tries to pull up all correlated predicates in an InSubquery into a Projection(Filter(..)) | ||
// at the root node of the InSubquery's subquery. The Projection at the root must have as its expression | ||
// a single Column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be somewhat restrictive. It can be expanded but it needs some work. In particular joins will need to accept expressions rather than just joining blindly on rows.
Edit: Actually, its probably fine to have the following:
Projection(col(a))
Filter(col(b)=col(outer.b))
Projection(alias(col(a).plus(col(c)), a), col(b))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree that re-using a Projection
to calculate expressions is reasonable.
I think there are certain types of OUTER joins where the actual evaluation needs to happen within the join (as the presence of a NULL means something different than the row being filtered)
230e72f
to
066a865
Compare
Thanks @jon-chuang -- I will try and review this carefully tomorrow |
ff90c1a
to
e7f0365
Compare
Sorry I haven't had a chance to review this PR yet @jon-chuang -- I will put it on my queue and hopefully get to it shortly |
I have made it 1/2 way through this PR -- I will finish it up later today. Thanks again for your patience @jon-chuang |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nicely done @jon-chuang -- thank you 🙏
cc @andygrove
For other reviewers, here is the PR for equality IN/NOT IN support: #2421
Semi Join: #table_a.a = #table_b.a [a:UInt32, b:UInt32]\ | ||
\n Projection: #table_a.a, #table_a.b [a:UInt32, b:UInt32]\ | ||
\n TableScan: table_a projection=None [a:UInt32, b:UInt32, c:UInt32]\ | ||
\n Filter: #table_b.b > Utf8(\"5\") [a:UInt32, b:UInt32, c:UInt32]\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vaguely remember some corner cases related to these non equality predicates and SEMI joins, but when I tried to come up with some examples of issues with this plan I could not. 👍
let subquery = LogicalPlanBuilder::from(table_b) | ||
.filter( | ||
(col("table_a.c").eq(col("table_b.c"))).and( | ||
(col("table_a.a").eq(col("table_b.a"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest at least one predicate that has the subquery / table_b on the left -- like table_b.a = table_a.
as all these predicates have table_a
on the left
Ok(()) | ||
} | ||
|
||
// We only test not exists for the simplest case since all the other code paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I think there might be some subtleties involving NULLs (like if the subquery only has nulls and there is a non equality filter like table_b.a > 5
// returns a new [LogicalPlan] that wraps `plan` in a [LogicalPlan::Filter] with | ||
/// its predicate with all `predicates` ANDed. | ||
pub fn filter_by_all(plan: LogicalPlan, predicates: &[&Expr]) -> LogicalPlan { | ||
if let Some(predicate) = combine_conjunctive(predicates) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
// NOTE: We only pattern match against Projection(Filter(..)). We will have another optimization rule | ||
// which tries to pull up all correlated predicates in an InSubquery into a Projection(Filter(..)) | ||
// at the root node of the InSubquery's subquery. The Projection at the root must have as its expression | ||
// a single Column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree that re-using a Projection
to calculate expressions is reasonable.
I think there are certain types of OUTER joins where the actual evaluation needs to happen within the join (as the presence of a NULL means something different than the row being filtered)
&mut correlated_join_columns, | ||
); | ||
|
||
// Strip the projection away and use its input for the semi/anti-join |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why this stripping is needed? I don't understand it
\n TableScan: sq projection=None [a:UInt32, b:UInt32, c:UInt32]\ | ||
\n Projection: #sq_nested.c [c:UInt32]\ | ||
\n TableScan: sq_nested projection=None [a:UInt32, b:UInt32, c:UInt32]"; | ||
\n Semi Join: #sq.a = #sq_nested.c [a:UInt32, b:UInt32, c:UInt32]\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The predicate pullup rule (next PR) will be able to "commute" a projection and a filter as follows:
This sounds like a subset of the filter "pushdown" rule -- so maybe we can just invoke that again rather than adding a new special case rule
let new_input = subquery_filters.iter().try_fold( | ||
optimized_input, | ||
|outer_plan, &subquery_expr| { | ||
self.rewrite_correlated_subquery_as_join( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
let mut contains_correlated_columns = false; | ||
expr.accept(CorrelatedColumnsVisitor { | ||
outer_schema, | ||
contains_correlated_columns: &mut contains_correlated_columns, | ||
})?; | ||
Ok(contains_correlated_columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it really matters, but this type of pattern could also be expressed like this, to avoid some mut
(assuming contains_correlated_columns
has been changed to bool
):
let mut contains_correlated_columns = false; | |
expr.accept(CorrelatedColumnsVisitor { | |
outer_schema, | |
contains_correlated_columns: &mut contains_correlated_columns, | |
})?; | |
Ok(contains_correlated_columns) | |
let visitor = expr.accept(CorrelatedColumnsVisitor { | |
outer_schema, | |
contains_correlated_columns: false, | |
})?; | |
Ok(visitor.contains_correlated_columns) |
It's a great job!😀❤My work is a little busy. |
@jon-chuang do you have time to work on this PR (looks like it has accumulated some conflicts) and there are some suggestions about tests. If not, I can find some time to do a final polish and merge it in |
.filter( | ||
(col("table_a.c").eq(col("table_b.c"))).and( | ||
(col("table_a.a").eq(col("table_b.a"))) | ||
.and(col("table_a.b").eq(col("table_b.b"))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need test like A or (B and c)
?
Because it can't be split .
@@ -64,105 +233,41 @@ impl OptimizerRule for SubqueryFilterToJoin { | |||
utils::split_conjunction(predicate, &mut filters); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A or (B anc C)
can't be splited.
I'm not sure that it will have a affact.
Marking as draft as it needs some work before merging |
@avantgardnerio noted that #2885 has some non trivial overlap with this PR, if you ever get back to it @jon-chuang |
I believe this PR has been superseded now, so closing. Please feel free to reopen if you plan to keep working on it. |
Which issue does this PR close?
Closes: #2351
Related: #2421
Rationale for this change
We perform a simple pattern match in the LogicalPlan optimizer on
InSubquery(Projection(Filter(..)))
andExists(Filter(..))
, extracting correlated/dependent columns to be used in the semi/anti joinTODO:
detect_correlated_columns
in the remainder expression (needs to be correlation free to derive the benefits of semi/antijoin)? Future: rewrite infallibly as SEMIJOIN + UNION ALL.What changes are included in this PR?
Are there any user-facing changes?
Future Work
For instance
=>