Support non-tuple expression for exists-subquery to join #5264

ygf11 · 2023-02-13T12:14:35Z

Which issue does this PR close?

Closes #4934.
Closes #4366.

Rationale for this change

This sql works correctly in datafusion, and will be rewrited to LeftSemi Join.

> explain select * from t1 where exists (select 1 from t2 where t1.t1_id = t2.t2_id);
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: t1.t1_id, t1.t1_name, t1.t1_int                                                                                                                      |
|               |   LeftSemi Join: t1.t1_id = t2.t2_id                                                                                                                             |
|               |     TableScan: t1 projection=[t1_id, t1_name, t1_int]                                                                                                            |
|               |     TableScan: t2 projection=[t2_id]

But the following sql will not:

> select * from t1 where exists (select 1 from t2 where t1.t1_id + 1 = t2.t2_id * 2);

These sql should also be rewrited to LeftSemi Join.

What changes are included in this PR?

Are these changes tested?

Yes.

Are there any user-facing changes?

ygf11 · 2023-02-14T12:52:20Z

datafusion/optimizer/src/decorrelate_where_exists.rs

+ \n Projection: orders.o_custkey [o_custkey:Int64]\
+ \n TableScan: orders [o_orderkey:Int64, o_custkey:Int64, o_orderstatus:Utf8, o_totalprice:Float64;N]\
+ \n Projection: orders.o_custkey [o_custkey:Int64]\
+ \n TableScan: orders [o_orderkey:Int64, o_custkey:Int64, o_orderstatus:Utf8, o_totalprice:Float64;N]";


Before this pr, this rule will not keep the projection.
I think keeping projection make it easy to read, and spark also keeps it.

ygf11 · 2023-02-14T12:54:36Z

datafusion/optimizer/src/decorrelate_where_exists.rs

 #[test]
 fn exists_subquery_no_cols() -> Result<()> {
 let sq = Arc::new(
 LogicalPlanBuilder::from(scan_tpch_table("orders"))
- .filter(col("customer.c_custkey").eq(col("customer.c_custkey")))?
+ .filter(col("customer.c_custkey").eq(lit(1u32)))?
 .project(vec![col("orders.o_custkey")])?


customer.c_custkey = customer.c_custkey is too specific here, so modify it to customer.c_custkey = 1.

ygf11 · 2023-02-14T12:58:07Z

benchmarks/expected-plans/q21.txt

+ SubqueryAlias: l3
+ Filter: lineitem.l_receiptdate > lineitem.l_commitdate
+ TableScan: lineitem projection=[l_orderkey, l_suppkey, l_commitdate, l_receiptdate]


The difference is we keep the projection now.

ygf11 · 2023-02-14T12:59:49Z

datafusion/core/tests/sql/joins.rs

@@ -2187,7 +2187,6 @@ async fn left_anti_join() -> Result<()> {
 }

 #[tokio::test]
-#[ignore = "Test ignored, will be enabled after fixing the anti join plan bug"]
 // https://github.com/apache/arrow-datafusion/issues/4366


This test passes now

Linked issue #4366 to this pr.

ygf11 · 2023-02-14T13:00:33Z

Please take a look @alamb @jackwener.

alamb · 2023-02-15T15:05:24Z

I will try and find time to review this PR over the next day or two

alamb

Thank you @ygf11

I went over this code and the plan changes carefully and they look great to me.

The only thing I think needs to be done prior is double check handling of distincts. Otherwise really nice work. Thank you 🏆

alamb · 2023-02-16T18:47:51Z

datafusion/optimizer/tests/integration-test.rs

- \n TableScan: test projection=[col_int32, col_uint32, col_utf8]";
+ \n Projection: t2.col_int32, t2.col_uint32\
+ \n SubqueryAlias: t2\
+ \n TableScan: test projection=[col_int32, col_uint32]";


this is a nice improvement too -- the unused column col_utf8 is filtered out

alamb · 2023-02-16T18:49:55Z

datafusion/optimizer/src/utils.rs

+ for expr in subquery_filter_exprs {
+ let cols = expr.to_columns()?;
+ if check_all_column_from_schema(&cols, input_schema.clone()) {
+ subquery_filters.push(expr.clone());


I wonder if you need to clone the expr here. It seems like it is owned already it could be used directly here and below

Though I see this code was just refactored into a different module

I think we need clone here.

subquery_filter_exprs is the result of split_conjunction function, and the type is Vec<&Expr>.

https://github.com/apache/arrow-datafusion/blob/fed4019d556f4afb3156fd12c21608e08b8d7eb6/datafusion/optimizer/src/utils.rs#L65-L67

alamb · 2023-02-16T19:08:57Z

datafusion/optimizer/src/decorrelate_where_exists.rs


 // join our sub query into the main plan
 let join_type = match query_info.negated {
 true => JoinType::LeftAnti,
 false => JoinType::LeftSemi,
 };
+
+ // TODO: add Distinct if the original plan is a Distinct.


is this still a todo? I see the code above looking into Distinct children, but the distinct is not added back

Yes, it is still a todo, some distinct cases need more checks. like:

// SELECT t1.t1_id, // t1.t1_name, // t1.t1_int // FROM t1 // WHERE EXISTS(SELECT DISTINCT t2_int // FROM t2 // WHERE t2.t2_id > t1.t1_id); // if we just add back the `DISTINCT`, the result: Projection: t1.t1_id, t1.t1_name, t1.t1_int LeftSemi Join: Filter: t2.t2_int > t1.t1_int TableScan: t1 Distinct: Projection: t2.t2_int TableScan: t2 // expected result: Projection: t1.t1_id, t1.t1_name, t1.t1_int LeftSemi Join: Filter: t2.t2_int > t1.t1_int TableScan: t1 Distinct: Projection: t2.t2_int, t2.t2_id TableScan: t2

t2_id will not be in the projection.

The reason is we just consider the columns from join filter as the project items, we should also consider columns from original projection when there is an outer distinct.

I will do this in the following pr if it is ok 🤣.

I will do this in the following pr if it is ok 🤣.

Absolutely it is ok! Thank you for all your work on this PR

alamb

Thanks again @ygf11 !

There appears to be a conflict in this PR -- I think once that is fixed we can merge it in.

ygf11 · 2023-02-18T11:46:49Z

Thanks for reviewing @alamb. I fixed the conflict.

ursabot · 2023-02-18T16:22:01Z

Benchmark runs are scheduled for baseline = 5d5b1a0 and contender = 27b15fd. 27b15fd is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

mingmwang · 2023-02-20T07:59:11Z

@ygf11
Thanks for taking care of this. I didn't get chance to provide a fix for in/exists subqueries. I will take a closer look at your PRs.

mingmwang · 2023-02-20T09:41:30Z

@ygf11 @jackwener
I have a feeling that maybe we should combine the two rules DecorrelateWhereExists and DecorrelateWhereIn to one rule. How do you think ?

ygf11 · 2023-02-20T12:15:36Z

I have a feeling that maybe we should combine the two rules DecorrelateWhereExists and DecorrelateWhereIn to one rule. How do you think ?

Agree, most logics are same, we should combine them. I can take this task if others do not work on 😄.

mingmwang · 2023-02-20T14:02:15Z

One more thing, looks like the SubqueryAlias logic is not consistent between the two rules DecorrelateWhereExists and DecorrelateWhereIn
#4767

I think you can make them consistent in the refactoring work.

 let sql = "SELECT t1_id, t1_name FROM t1 WHERE t1_id IN (SELECT t2_id FROM t2) ORDER BY t1_id";
        let msg = format!("Creating logical plan for '{sql}'");
        let dataframe = ctx.sql(&("explain ".to_owned() + sql)).await.expect(&msg);
        let plan = dataframe.into_optimized_plan()?;
        let expected = vec![
            "Explain [plan_type:Utf8, plan:Utf8]",
            "  Sort: t1.t1_id ASC NULLS LAST [t1_id:UInt32;N, t1_name:Utf8;N]",
            "    Projection: t1.t1_id, t1.t1_name [t1_id:UInt32;N, t1_name:Utf8;N]",
            "      LeftSemi Join: t1.t1_id = __correlated_sq_1.t2_id [t1_id:UInt32;N, t1_name:Utf8;N]",
            "        TableScan: t1 projection=[t1_id, t1_name] [t1_id:UInt32;N, t1_name:Utf8;N]",
            "        SubqueryAlias: __correlated_sq_1 [t2_id:UInt32;N]",
            "          Projection: t2.t2_id AS t2_id [t2_id:UInt32;N]",
            "            TableScan: t2 projection=[t2_id] [t2_id:UInt32;N]",
        ];

let sql = "SELECT t1_id, t1_name FROM t1 WHERE NOT EXISTS (SELECT 1 FROM t2 WHERE t1_id = t2_id and t1_id > 11) ORDER BY t1_id";
        let msg = format!("Creating logical plan for '{sql}'");
        let dataframe = ctx.sql(&("explain ".to_owned() + sql)).await.expect(&msg);
        let plan = dataframe.into_optimized_plan()?;
        let expected = vec![
            "Explain [plan_type:Utf8, plan:Utf8]",
            "  Sort: t1.t1_id ASC NULLS LAST [t1_id:UInt32;N, t1_name:Utf8;N]",
            "    Projection: t1.t1_id, t1.t1_name [t1_id:UInt32;N, t1_name:Utf8;N]",
            "      LeftAnti Join: t1.t1_id = t2.t2_id Filter: t1.t1_id > UInt32(11) [t1_id:UInt32;N, t1_name:Utf8;N]",
            "        TableScan: t1 projection=[t1_id, t1_name] [t1_id:UInt32;N, t1_name:Utf8;N]",
            "        Projection: t2.t2_id [t2_id:UInt32;N]",
            "          TableScan: t2 projection=[t2_id] [t2_id:UInt32;N]",
        ];

mingmwang · 2023-02-20T14:08:09Z

BTW, this PR looks really nice and fixed couple of issues related to the Exist subquery.
Thanks you @ygf11.

jackwener · 2023-02-21T05:02:07Z

@ygf11 @jackwener I have a feeling that maybe we should combine the two rules DecorrelateWhereExists and DecorrelateWhereIn to one rule. How do you think ?

Agree it.

After unification, the changes of the two rules are synchronized, instead of differences

* Support non-tuple expression for exists-subquery to join * fix tests * add tests * add comments * fix tests * fix test comment

Support non-tuple expression for exists-subquery to join

3219bf5

github-actions bot added core Core DataFusion crate optimizer Optimizer rules labels Feb 13, 2023

ygf11 added 4 commits February 14, 2023 02:42

fix tests

e4d2c3d

add tests

5e7f8e2

add comments

9bc38cc

fix tests

a2f5b51

ygf11 commented Feb 14, 2023

View reviewed changes

ygf11 marked this pull request as ready for review February 14, 2023 12:58

ygf11 commented Feb 14, 2023

View reviewed changes

alamb reviewed Feb 16, 2023

View reviewed changes

Merge branch upstream/main into where-exists

84e19e4

alamb approved these changes Feb 17, 2023

View reviewed changes

ygf11 added 2 commits February 18, 2023 06:01

Merge upstream/main and resolve confilicts

a8366ea

fix test comment

f2435aa

alamb merged commit 27b15fd into apache:main Feb 18, 2023

This was referenced Feb 20, 2023

Add back Distinct for where-exists if subquery is a DISTINCT #5344

Closed

Refactor DecorrelateWhereExists and add back Distinct if needs #5345

Merged

ygf11 deleted the where-exists branch February 20, 2023 12:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support non-tuple expression for exists-subquery to join #5264

Support non-tuple expression for exists-subquery to join #5264

ygf11 commented Feb 13, 2023 •

edited

Loading

ygf11 Feb 14, 2023

ygf11 Feb 14, 2023

ygf11 Feb 14, 2023

ygf11 Feb 14, 2023

ygf11 Feb 18, 2023

ygf11 commented Feb 14, 2023

alamb commented Feb 15, 2023

alamb left a comment

alamb Feb 16, 2023

alamb Feb 16, 2023

ygf11 Feb 17, 2023

alamb Feb 16, 2023

ygf11 Feb 17, 2023 •

edited

Loading

alamb Feb 17, 2023 •

edited

Loading

alamb left a comment

ygf11 commented Feb 18, 2023

ursabot commented Feb 18, 2023

mingmwang commented Feb 20, 2023

mingmwang commented Feb 20, 2023

ygf11 commented Feb 20, 2023

mingmwang commented Feb 20, 2023

mingmwang commented Feb 20, 2023

jackwener commented Feb 21, 2023 •

edited

Loading

Support non-tuple expression for exists-subquery to join #5264

Support non-tuple expression for exists-subquery to join #5264

Conversation

ygf11 commented Feb 13, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ygf11 commented Feb 14, 2023

alamb commented Feb 15, 2023

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ygf11 Feb 17, 2023 • edited Loading

Choose a reason for hiding this comment

alamb Feb 17, 2023 • edited Loading

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

ygf11 commented Feb 18, 2023

ursabot commented Feb 18, 2023

mingmwang commented Feb 20, 2023

mingmwang commented Feb 20, 2023

ygf11 commented Feb 20, 2023

mingmwang commented Feb 20, 2023

mingmwang commented Feb 20, 2023

jackwener commented Feb 21, 2023 • edited Loading

ygf11 commented Feb 13, 2023 •

edited

Loading

ygf11 Feb 17, 2023 •

edited

Loading

alamb Feb 17, 2023 •

edited

Loading

jackwener commented Feb 21, 2023 •

edited

Loading