-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize where exists
sub-queries into aggregate
and join
#2813
Conversation
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…evel earlier in tree
@@ -37,6 +37,8 @@ use std::{convert::TryFrom, fmt, iter::repeat, sync::Arc}; | |||
|
|||
/// Represents a dynamically typed, nullable single value. | |||
/// This is the single-valued counter-part of arrow’s `Array`. | |||
/// https://arrow.apache.org/docs/python/api/datatypes.html | |||
/// https://github.com/apache/arrow/blob/master/format/Schema.fbs#L354-L375 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this was built upon #2797 . I'll turn this into a draft until that gets merged.
@@ -483,7 +484,37 @@ fn get_tpch_table_schema(table: &str) -> Schema { | |||
Field::new("n_comment", DataType::Utf8, false), | |||
]), | |||
|
|||
_ => unimplemented!(), | |||
"supplier" => Schema::new(vec![ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add missing TPC-H tables to support testing those queries.
register_tpch_csv(&ctx, "orders").await?; | ||
register_tpch_csv(&ctx, "lineitem").await?; | ||
|
||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Annotate plan with variable names from optimizer code for cross-correlation.
// TODO: arbitrary expressions | ||
Expr::Exists { subquery, negated } => { | ||
if *negated { | ||
return Ok(plan.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case of doubt, fall back to skipping optimization, following the "do no harm" rule.
|
||
// Only operate if one column is present and the other closed upon from outside scope | ||
let found: Vec<_> = cols.intersection(&fields).map(|it| (*it).clone()).collect(); | ||
let closed_upon: Vec<_> = cols.difference(&fields).map(|it| (*it).clone()).collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should truly resolve closed-upon scope here, instead of assuming if it's not in the present scope it must exist elsewhere. Queries will fail either way, but this could cause the error messages to be significantly more difficult for users to debug.
let group_expr = vec![Expr::Column(found.as_str().into())]; | ||
let aggr_expr: Vec<Expr> = vec![]; | ||
let join_keys = (c_col.clone(), f_col.clone()); | ||
let right = LogicalPlanBuilder::from((*filter.input).clone()) | ||
.aggregate(group_expr, aggr_expr)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could just use distinct
rather than create the aggregate. It is semantically equivalent and result in a simpler logical plan. It will get translated into an aggregate in the physical plan.
let group_expr = vec![Expr::Column(found.as_str().into())]; | |
let aggr_expr: Vec<Expr> = vec![]; | |
let join_keys = (c_col.clone(), f_col.clone()); | |
let right = LogicalPlanBuilder::from((*filter.input).clone()) | |
.aggregate(group_expr, aggr_expr)? | |
let join_keys = (c_col.clone(), f_col.clone()); | |
let right = LogicalPlanBuilder::from((*filter.input).clone()) | |
.distinct()? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I may have misunderstood. I thought this was grouping on all the columns but it looks that is not the case so please disregard this suggestion.
Codecov Report
@@ Coverage Diff @@
## master #2813 +/- ##
==========================================
- Coverage 85.20% 85.19% -0.02%
==========================================
Files 274 276 +2
Lines 48666 48848 +182
==========================================
+ Hits 41468 41616 +148
- Misses 7198 7232 +34
Continue to review full report at Codecov.
|
let filter = if let LogicalPlan::Filter(f) = sub_input { | ||
f | ||
} else { | ||
return Ok(plan.clone()); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be more idiomatic to use a match
for these patterns.
let filter = if let LogicalPlan::Filter(f) = sub_input { | |
f | |
} else { | |
return Ok(plan.clone()); | |
}; | |
let filter = match sub_input { | |
LogicalPlan::Filter(f) => f, | |
_ => return Ok(plan.clone()) | |
}; |
let fields: HashSet<_> = sub_input | ||
.schema() | ||
.fields() | ||
.iter() | ||
.map(|f| f.name()) | ||
.collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to get a hashset of qualified names like this:
let fields: HashSet<_> = sub_input | |
.schema() | |
.fields() | |
.iter() | |
.map(|f| f.name()) | |
.collect(); | |
let fields = HashSet::from_iter(sub_input | |
.schema() | |
.field_names()); |
Thanks @avantgardnerio. This looks good overall and the logic is easy to follow. I will review again when #2797 is merged. |
I double checked with I made some adjustments, and I was able to run query 4 with the presently committed code:
This is slow, but matches my postgres results:
|
The remaining failing queries seem to fall into two categories:
|
Probably duplicated work with #2421 |
Closed in favor of #2885 |
Which issue does this PR close?
Closes #160.
Rationale for this change
In order to evaluate DataFusion as a candidate query engine, users need to be able to run industry standard benchmarks like TPC-H. Query 4 is a good initial candidate, because it is being blocked only by a relatively simple optimization rule to turn
exists
subqueries intojoin
s.This PR includes the minimum necessary changes to get Query 4 passing, but I believe this is a generalizable approach that will work for the remaining queries in the TPC-H suite being blocked by subquery-related issues.
I wanted to PR early to start the conversation, but I intend to either submit subsequent PRs generalizing this approach, or extend this PR until we have all the TPC-H subquery cases covered.
What changes are included in this PR?
An optimization rule for decorelating a narrowly defined set of queries. Those not explicitly covered will remain unaltered.
Are there any user-facing changes?
Any single-column join
where exists
correlated subquery should now work.