Optimize `where exists` sub-queries into `aggregate` and `join` #2813

avantgardnerio · 2022-06-29T20:09:46Z

Which issue does this PR close?

Closes #160.

Rationale for this change

In order to evaluate DataFusion as a candidate query engine, users need to be able to run industry standard benchmarks like TPC-H. Query 4 is a good initial candidate, because it is being blocked only by a relatively simple optimization rule to turn exists subqueries into joins.

This PR includes the minimum necessary changes to get Query 4 passing, but I believe this is a generalizable approach that will work for the remaining queries in the TPC-H suite being blocked by subquery-related issues.

I wanted to PR early to start the conversation, but I intend to either submit subsequent PRs generalizing this approach, or extend this PR until we have all the TPC-H subquery cases covered.

What changes are included in this PR?

An optimization rule for decorelating a narrowly defined set of queries. Those not explicitly covered will remain unaltered.

Are there any user-facing changes?

Any single-column join where exists correlated subquery should now work.

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

…evel earlier in tree

avantgardnerio · 2022-06-29T20:10:53Z

datafusion/common/src/scalar.rs

@@ -37,6 +37,8 @@ use std::{convert::TryFrom, fmt, iter::repeat, sync::Arc};

 /// Represents a dynamically typed, nullable single value.
 /// This is the single-valued counter-part of arrow’s `Array`.
+/// https://arrow.apache.org/docs/python/api/datatypes.html
+/// https://github.com/apache/arrow/blob/master/format/Schema.fbs#L354-L375


Sorry, this was built upon #2797 . I'll turn this into a draft until that gets merged.

avantgardnerio · 2022-06-29T20:11:13Z

datafusion/core/tests/sql/mod.rs

@@ -483,7 +484,37 @@ fn get_tpch_table_schema(table: &str) -> Schema {
            Field::new("n_comment", DataType::Utf8, false),
        ]),

-        _ => unimplemented!(),
+        "supplier" => Schema::new(vec![


Add missing TPC-H tables to support testing those queries.

avantgardnerio · 2022-06-29T20:11:50Z

datafusion/core/tests/sql/subqueries.rs

+    register_tpch_csv(&ctx, "orders").await?;
+    register_tpch_csv(&ctx, "lineitem").await?;
+
+    /*


Annotate plan with variable names from optimizer code for cross-correlation.

avantgardnerio · 2022-06-29T20:13:13Z

datafusion/optimizer/src/subquery_decorrelate.rs

+                    // TODO: arbitrary expressions
+                    Expr::Exists { subquery, negated } => {
+                        if *negated {
+                            return Ok(plan.clone());


In any case of doubt, fall back to skipping optimization, following the "do no harm" rule.

avantgardnerio · 2022-06-29T20:17:18Z

datafusion/optimizer/src/subquery_decorrelate.rs

+
+    // Only operate if one column is present and the other closed upon from outside scope
+    let found: Vec<_> = cols.intersection(&fields).map(|it| (*it).clone()).collect();
+    let closed_upon: Vec<_> = cols.difference(&fields).map(|it| (*it).clone()).collect();


We should truly resolve closed-upon scope here, instead of assuming if it's not in the present scope it must exist elsewhere. Queries will fail either way, but this could cause the error messages to be significantly more difficult for users to debug.

andygrove · 2022-06-29T23:32:58Z

datafusion/optimizer/src/subquery_decorrelate.rs

+    let group_expr = vec![Expr::Column(found.as_str().into())];
+    let aggr_expr: Vec<Expr> = vec![];
+    let join_keys = (c_col.clone(), f_col.clone());
+    let right = LogicalPlanBuilder::from((*filter.input).clone())
+        .aggregate(group_expr, aggr_expr)?


You could just use distinct rather than create the aggregate. It is semantically equivalent and result in a simpler logical plan. It will get translated into an aggregate in the physical plan.

Suggested change

let group_expr = vec![Expr::Column(found.as_str().into())];

let aggr_expr: Vec<Expr> = vec![];

let join_keys = (c_col.clone(), f_col.clone());

let right = LogicalPlanBuilder::from((*filter.input).clone())

.aggregate(group_expr, aggr_expr)?

let join_keys = (c_col.clone(), f_col.clone());

let right = LogicalPlanBuilder::from((*filter.input).clone())

.distinct()?

Actually, I may have misunderstood. I thought this was grouping on all the columns but it looks that is not the case so please disregard this suggestion.

codecov-commenter · 2022-06-30T00:16:54Z

Codecov Report

Merging #2813 (858b284) into master (839a618) will decrease coverage by 0.01%.
The diff coverage is 72.68%.

@@            Coverage Diff             @@
##           master    #2813      +/-   ##
==========================================
- Coverage   85.20%   85.19%   -0.02%     
==========================================
  Files         274      276       +2     
  Lines       48666    48848     +182     
==========================================
+ Hits        41468    41616     +148     
- Misses       7198     7232      +34

Impacted Files	Coverage Δ
datafusion/common/src/scalar.rs	`74.94% <ø> (+0.11%)`	⬆️
datafusion/core/tests/sql/mod.rs	`93.25% <0.00%> (-4.39%)`	⬇️
...tafusion/physical-expr/src/expressions/datetime.rs	`59.21% <64.40%> (+26.55%)`	⬆️
datafusion/optimizer/src/subquery_decorrelate.rs	`82.85% <82.85%> (ø)`
datafusion/core/tests/sql/subqueries.rs	`88.23% <88.23%> (ø)`
datafusion/core/src/execution/context.rs	`79.02% <100.00%> (+0.02%)`	⬆️
datafusion/core/tests/sql/timestamp.rs	`100.00% <100.00%> (ø)`
datafusion/optimizer/src/simplify_expressions.rs	`82.04% <100.00%> (+0.02%)`	⬆️
datafusion/core/src/physical_plan/metrics/value.rs	`86.93% <0.00%> (-0.51%)`	⬇️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 839a618...858b284. Read the comment docs.

andygrove · 2022-06-30T00:26:15Z

datafusion/optimizer/src/subquery_decorrelate.rs

+    let filter = if let LogicalPlan::Filter(f) = sub_input {
+        f
+    } else {
+        return Ok(plan.clone());
+    };


I think it might be more idiomatic to use a match for these patterns.

Suggested change

let filter = if let LogicalPlan::Filter(f) = sub_input {

f

} else {

return Ok(plan.clone());

};

let filter = match sub_input {

LogicalPlan::Filter(f) => f,

_ => return Ok(plan.clone())

};

andygrove · 2022-06-30T15:10:15Z

datafusion/optimizer/src/subquery_decorrelate.rs

+    let fields: HashSet<_> = sub_input
+        .schema()
+        .fields()
+        .iter()
+        .map(|f| f.name())
+        .collect();


You should be able to get a hashset of qualified names like this:

Suggested change

let fields: HashSet<_> = sub_input

.schema()

.fields()

.iter()

.map(|f| f.name())

.collect();

let fields = HashSet::from_iter(sub_input

.schema()

.field_names());

andygrove · 2022-06-30T15:15:03Z

Thanks @avantgardnerio. This looks good overall and the logic is easy to follow. I will review again when #2797 is merged.

avantgardnerio · 2022-06-30T20:03:19Z

I double checked with cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path ./data --format tbl --query 4 --batch-size 4096 and observed that this didn't resolve "the real" query 4, just my stripped down minimal failing test case.

I made some adjustments, and I was able to run query 4 with the presently committed code:

+-----------------+-------------+
| o_orderpriority | order_count |
+-----------------+-------------+
| 1-URGENT        | 10594       |
| 2-HIGH          | 10476       |
| 3-MEDIUM        | 10410       |
| 4-NOT SPECIFIED | 10556       |
| 5-LOW           | 10487       |
+-----------------+-------------+
Query 4 iteration 2 took 43617.9 ms and returned 5 rows
Query 4 avg time: 45785.94 ms

This is slow, but matches my postgres results:

+---------------+-----------+
|o_orderpriority|order_count|
+---------------+-----------+
|1-URGENT       |10594      |
|2-HIGH         |10476      |
|3-MEDIUM       |10410      |
|4-NOT SPECIFIED|10556      |
|5-LOW          |10487      |
+---------------+-----------+

avantgardnerio · 2022-06-30T20:32:01Z

The remaining failing queries seem to fall into two categories:

Ones that fail because we only handle where exists with this PR (not where x < (subquery) expressions)
Ones that fail due to multiple subqueries in the same filter expression (which means we have to run this iteratively or something)

avantgardnerio · 2022-07-02T16:35:35Z

Probably duplicated work with #2421

avantgardnerio · 2022-07-12T17:52:06Z

Closed in favor of #2885

avantgardnerio and others added 30 commits June 29, 2022 09:46

Failing tests

14d807a

Add month/year arithmetic

88f5d7f

Fix tests?

d2f43c9

Fix clippy?

e34705e

Update datafusion/common/src/scalar.rs

c37d29e

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Add support for all types, fix math

874a5ed

Fix doc

ee1c756

Fix test that relied on previous flawed implementation

5ea1c28

Appease clippy

8348470

Failing test case for TPC-H query 20

cd999c7

Fix name

ccdb98f

Broken test for adding intervals to dates

e7fcb2f

Tests pass

9b51e46

Fix rebase

de8ae11

Fix query

8dd2b16

Additional tests

34b2908

Reduce to minimum failing (and passing) cases

6a759ce

Adjust so data _should_ be returned, but see none

37a73c2

Fixed data, decorrelated test passes

1db5c8d

Check in plans

f3ee70c

Put real assertion in place

b08da97

Add test for already working subquery optimizer

f22c079

Add decorellator

0e0e0c7

Check in broken test

308b67c

Add some passing and failing tests to see scope of problem

0c5ed1a

Have almost all inputs needed for optimization, but need to catch 1 l…

d11d7f9

…evel earlier in tree

Collected all inputs, now we just need to optimize

6ab6894

Successfully decorrelated query 4

b281c8c

refactor

6a08eb1

Pass test 4

7e02545

github-actions bot added core Core DataFusion crate optimizer Optimizer rules physical-expr Changes to the physical-expr crates labels Jun 29, 2022

avantgardnerio commented Jun 29, 2022

View reviewed changes

Only operate on equality expressions

50b3549

avantgardnerio marked this pull request as draft June 29, 2022 20:18

avantgardnerio added 3 commits June 29, 2022 14:21

Lint error

f90d95a

Tests still pass because we are losing remaining predicate

9377cdf

Don't lose remaining expressions

23b0ffb

andygrove reviewed Jun 29, 2022

View reviewed changes

Update test to expect remaining filter clause

858b284

andygrove reviewed Jun 30, 2022

View reviewed changes

avantgardnerio added 3 commits June 30, 2022 13:27

Debugging

00a661b

Can run query 4

1708415

Remove debugging code

60a6e58

Clippy

b8c0808

avantgardnerio mentioned this pull request Jul 1, 2022

TPC-H Query 4 #160

Closed

avantgardnerio closed this Jul 12, 2022

avantgardnerio deleted the bg_tpch_q4 branch July 19, 2022 11:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `where exists` sub-queries into `aggregate` and `join` #2813

Optimize `where exists` sub-queries into `aggregate` and `join` #2813

avantgardnerio commented Jun 29, 2022

avantgardnerio Jun 29, 2022

avantgardnerio Jun 29, 2022

avantgardnerio Jun 29, 2022

avantgardnerio Jun 29, 2022

avantgardnerio Jun 29, 2022

andygrove Jun 29, 2022

andygrove Jun 30, 2022

codecov-commenter commented Jun 30, 2022

andygrove Jun 30, 2022

andygrove Jun 30, 2022

andygrove commented Jun 30, 2022

avantgardnerio commented Jun 30, 2022

avantgardnerio commented Jun 30, 2022

avantgardnerio commented Jul 2, 2022

avantgardnerio commented Jul 12, 2022

Optimize where exists sub-queries into aggregate and join #2813

Optimize where exists sub-queries into aggregate and join #2813

Conversation

avantgardnerio commented Jun 29, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jun 30, 2022

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Jun 30, 2022

avantgardnerio commented Jun 30, 2022

avantgardnerio commented Jun 30, 2022

avantgardnerio commented Jul 2, 2022

avantgardnerio commented Jul 12, 2022

Optimize `where exists` sub-queries into `aggregate` and `join` #2813

Optimize `where exists` sub-queries into `aggregate` and `join` #2813