Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-tuple expression for exists-subquery to join #5264

Merged
merged 8 commits into from
Feb 18, 2023

Conversation

ygf11
Copy link
Contributor

@ygf11 ygf11 commented Feb 13, 2023

Which issue does this PR close?

Closes #4934.
Closes #4366.

Rationale for this change

This sql works correctly in datafusion, and will be rewrited to LeftSemi Join.

> explain select * from t1 where exists (select 1 from t2 where t1.t1_id = t2.t2_id);
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: t1.t1_id, t1.t1_name, t1.t1_int                                                                                                                      |
|               |   LeftSemi Join: t1.t1_id = t2.t2_id                                                                                                                             |
|               |     TableScan: t1 projection=[t1_id, t1_name, t1_int]                                                                                                            |
|               |     TableScan: t2 projection=[t2_id]    

But the following sql will not:

> select * from t1 where exists (select 1 from t2 where t1.t1_id + 1 = t2.t2_id * 2);

These sql should also be rewrited to LeftSemi Join.

What changes are included in this PR?

Are these changes tested?

Yes.

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate optimizer Optimizer rules labels Feb 13, 2023
\n Projection: orders.o_custkey [o_custkey:Int64]\
\n TableScan: orders [o_orderkey:Int64, o_custkey:Int64, o_orderstatus:Utf8, o_totalprice:Float64;N]\
\n Projection: orders.o_custkey [o_custkey:Int64]\
\n TableScan: orders [o_orderkey:Int64, o_custkey:Int64, o_orderstatus:Utf8, o_totalprice:Float64;N]";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this pr, this rule will not keep the projection.
I think keeping projection make it easy to read, and spark also keeps it.

#[test]
fn exists_subquery_no_cols() -> Result<()> {
let sq = Arc::new(
LogicalPlanBuilder::from(scan_tpch_table("orders"))
.filter(col("customer.c_custkey").eq(col("customer.c_custkey")))?
.filter(col("customer.c_custkey").eq(lit(1u32)))?
.project(vec![col("orders.o_custkey")])?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

customer.c_custkey = customer.c_custkey is too specific here, so modify it to customer.c_custkey = 1.

SubqueryAlias: l3
Filter: lineitem.l_receiptdate > lineitem.l_commitdate
TableScan: lineitem projection=[l_orderkey, l_suppkey, l_commitdate, l_receiptdate]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is we keep the projection now.

@ygf11 ygf11 marked this pull request as ready for review February 14, 2023 12:58
@@ -2187,7 +2187,6 @@ async fn left_anti_join() -> Result<()> {
}

#[tokio::test]
#[ignore = "Test ignored, will be enabled after fixing the anti join plan bug"]
// https://github.com/apache/arrow-datafusion/issues/4366
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test passes now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linked issue #4366 to this pr.

@ygf11
Copy link
Contributor Author

ygf11 commented Feb 14, 2023

Please take a look @alamb @jackwener.

@alamb
Copy link
Contributor

alamb commented Feb 15, 2023

I will try and find time to review this PR over the next day or two

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ygf11

I went over this code and the plan changes carefully and they look great to me.

The only thing I think needs to be done prior is double check handling of distincts. Otherwise really nice work. Thank you 🏆

\n TableScan: test projection=[col_int32, col_uint32, col_utf8]";
\n Projection: t2.col_int32, t2.col_uint32\
\n SubqueryAlias: t2\
\n TableScan: test projection=[col_int32, col_uint32]";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a nice improvement too -- the unused column col_utf8 is filtered out

for expr in subquery_filter_exprs {
let cols = expr.to_columns()?;
if check_all_column_from_schema(&cols, input_schema.clone()) {
subquery_filters.push(expr.clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you need to clone the expr here. It seems like it is owned already it could be used directly here and below

Though I see this code was just refactored into a different module

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need clone here.

subquery_filter_exprs is the result of split_conjunction function, and the type is Vec<&Expr>.

https://github.com/apache/arrow-datafusion/blob/fed4019d556f4afb3156fd12c21608e08b8d7eb6/datafusion/optimizer/src/utils.rs#L65-L67


// join our sub query into the main plan
let join_type = match query_info.negated {
true => JoinType::LeftAnti,
false => JoinType::LeftSemi,
};

// TODO: add Distinct if the original plan is a Distinct.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still a todo? I see the code above looking into Distinct children, but the distinct is not added back

Copy link
Contributor Author

@ygf11 ygf11 Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is still a todo, some distinct cases need more checks. like:

// SELECT t1.t1_id,
//       t1.t1_name,
//       t1.t1_int
// FROM   t1
// WHERE  EXISTS(SELECT DISTINCT t2_int
//              FROM   t2
//              WHERE  t2.t2_id > t1.t1_id); 

// if we just add back the `DISTINCT`, the result:
Projection: t1.t1_id, t1.t1_name, t1.t1_int
  LeftSemi Join:  Filter: t2.t2_int > t1.t1_int
    TableScan: t1
    Distinct:
      Projection: t2.t2_int
         TableScan: t2
 
// expected result:
Projection: t1.t1_id, t1.t1_name, t1.t1_int
  LeftSemi Join:  Filter: t2.t2_int > t1.t1_int
    TableScan: t1
    Distinct:
      Projection: t2.t2_int, t2.t2_id
         TableScan: t2

t2_id will not be in the projection.

The reason is we just consider the columns from join filter as the project items, we should also consider columns from original projection when there is an outer distinct.

I will do this in the following pr if it is ok 🤣.

Copy link
Contributor

@alamb alamb Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do this in the following pr if it is ok 🤣.

Absolutely it is ok! Thank you for all your work on this PR

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @ygf11 !

There appears to be a conflict in this PR -- I think once that is fixed we can merge it in.

@ygf11
Copy link
Contributor Author

ygf11 commented Feb 18, 2023

Thanks for reviewing @alamb. I fixed the conflict.

@alamb alamb merged commit 27b15fd into apache:main Feb 18, 2023
@ursabot
Copy link

ursabot commented Feb 18, 2023

Benchmark runs are scheduled for baseline = 5d5b1a0 and contender = 27b15fd. 27b15fd is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@mingmwang
Copy link
Contributor

@ygf11
Thanks for taking care of this. I didn't get chance to provide a fix for in/exists subqueries. I will take a closer look at your PRs.

@mingmwang
Copy link
Contributor

@ygf11 @jackwener
I have a feeling that maybe we should combine the two rules DecorrelateWhereExists and DecorrelateWhereIn to one rule. How do you think ?

@ygf11
Copy link
Contributor Author

ygf11 commented Feb 20, 2023

I have a feeling that maybe we should combine the two rules DecorrelateWhereExists and DecorrelateWhereIn to one rule. How do you think ?

Agree, most logics are same, we should combine them. I can take this task if others do not work on 😄.

@ygf11 ygf11 deleted the where-exists branch February 20, 2023 12:15
@mingmwang
Copy link
Contributor

One more thing, looks like the SubqueryAlias logic is not consistent between the two rules DecorrelateWhereExists and DecorrelateWhereIn
#4767

I think you can make them consistent in the refactoring work.

 let sql = "SELECT t1_id, t1_name FROM t1 WHERE t1_id IN (SELECT t2_id FROM t2) ORDER BY t1_id";
        let msg = format!("Creating logical plan for '{sql}'");
        let dataframe = ctx.sql(&("explain ".to_owned() + sql)).await.expect(&msg);
        let plan = dataframe.into_optimized_plan()?;
        let expected = vec![
            "Explain [plan_type:Utf8, plan:Utf8]",
            "  Sort: t1.t1_id ASC NULLS LAST [t1_id:UInt32;N, t1_name:Utf8;N]",
            "    Projection: t1.t1_id, t1.t1_name [t1_id:UInt32;N, t1_name:Utf8;N]",
            "      LeftSemi Join: t1.t1_id = __correlated_sq_1.t2_id [t1_id:UInt32;N, t1_name:Utf8;N]",
            "        TableScan: t1 projection=[t1_id, t1_name] [t1_id:UInt32;N, t1_name:Utf8;N]",
            "        SubqueryAlias: __correlated_sq_1 [t2_id:UInt32;N]",
            "          Projection: t2.t2_id AS t2_id [t2_id:UInt32;N]",
            "            TableScan: t2 projection=[t2_id] [t2_id:UInt32;N]",
        ];
let sql = "SELECT t1_id, t1_name FROM t1 WHERE NOT EXISTS (SELECT 1 FROM t2 WHERE t1_id = t2_id and t1_id > 11) ORDER BY t1_id";
        let msg = format!("Creating logical plan for '{sql}'");
        let dataframe = ctx.sql(&("explain ".to_owned() + sql)).await.expect(&msg);
        let plan = dataframe.into_optimized_plan()?;
        let expected = vec![
            "Explain [plan_type:Utf8, plan:Utf8]",
            "  Sort: t1.t1_id ASC NULLS LAST [t1_id:UInt32;N, t1_name:Utf8;N]",
            "    Projection: t1.t1_id, t1.t1_name [t1_id:UInt32;N, t1_name:Utf8;N]",
            "      LeftAnti Join: t1.t1_id = t2.t2_id Filter: t1.t1_id > UInt32(11) [t1_id:UInt32;N, t1_name:Utf8;N]",
            "        TableScan: t1 projection=[t1_id, t1_name] [t1_id:UInt32;N, t1_name:Utf8;N]",
            "        Projection: t2.t2_id [t2_id:UInt32;N]",
            "          TableScan: t2 projection=[t2_id] [t2_id:UInt32;N]",
        ];

@mingmwang
Copy link
Contributor

BTW, this PR looks really nice and fixed couple of issues related to the Exist subquery.
Thanks you @ygf11.

@jackwener
Copy link
Member

jackwener commented Feb 21, 2023

@ygf11 @jackwener I have a feeling that maybe we should combine the two rules DecorrelateWhereExists and DecorrelateWhereIn to one rule. How do you think ?

Agree it.

After unification, the changes of the two rules are synchronized, instead of differences

jiangzhx pushed a commit to jiangzhx/arrow-datafusion that referenced this pull request Feb 24, 2023
* Support non-tuple expression for exists-subquery to join

* fix tests

* add tests

* add comments

* fix tests

* fix test comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules
Projects
None yet
5 participants