Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for non-column key for equijoin when eliminating cross join to inner join #4443

Merged
merged 12 commits into from
Dec 9, 2022

Conversation

ygf11
Copy link
Contributor

@ygf11 ygf11 commented Nov 30, 2022

Which issue does this PR close?

Closes #4442.
Part of #4389.

Rationale for this change

What changes are included in this PR?

  • Add logics to support non-column join key in EliminateCrossJoin.
  • Fix duplicated column name when join keys contain cast(column as xxx).
  • Add unit tests.

Are these changes tested?

Yes, unit tests will test it.

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner labels Nov 30, 2022
@ygf11 ygf11 changed the title Add support for non-column key for equijoin when eliminating cross join to inner Add support for non-column key for equijoin when eliminating cross join to inner join Nov 30, 2022
@ygf11 ygf11 marked this pull request as ready for review November 30, 2022 15:12
@ygf11
Copy link
Contributor Author

ygf11 commented Dec 1, 2022

cc @alamb @jackwener @mingmwang

@liukun4515
Copy link
Contributor

@ygf11 thanks for your contribution, could you please add some IT in the datafusion/core/tests/sql/joins.rs file with your SQL like

select * from test0 as t0 cross join test1 as t1 where t0.a + 1 = t1.a * 2;
select * from test0 as t0 cross join test1 as t1 where t0.a * 2 = t1.a - 1;

@github-actions github-actions bot added the core Core DataFusion crate label Dec 2, 2022
@ygf11
Copy link
Contributor Author

ygf11 commented Dec 2, 2022

thanks for your contribution, could you please add some IT in the datafusion/core/tests/sql/joins.rs file with your SQL like

@liukun4515 Added.

Comment on lines +2372 to +2373
" Projection: t1.t1_id, t2.t2_id, t1.t1_name [t1_id:UInt32;N, t2_id:UInt32;N, t1_name:Utf8;N]",
" Projection: t1.t1_id, t1.t1_name, t2.t2_id [t1_id:UInt32;N, t1_name:Utf8;N, t2_id:UInt32;N]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like these two projections are not collapsed. This is unrelated to this PR. But we need to take a look at the PushDownProjection rule.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jackwener
maybe @jackwener has fixed that in #4487

Copy link
Member

@jackwener jackwener Dec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't related with this PR. This projection is generated due to keep the same order.
It should need do more optimization in PushDownProjection.

Recent I am doing a enhancement about PushDownProjection, it should can resolve this.

Comment on lines +1025 to +1031
// The display_name() of cast expression will ignore the cast info, and show the inner expression name.
// If we do not add alais, it will throw same field name error in the schema when adding projection.
// For example:
// input scan : [a, b, c],
// join keys: [cast(a as int)]
//
// then a and cast(a as int) will use the same field name - `a` in projection schema.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the Cast expressions ignore the cast info ? Looks like this is not the expected behavior.
I think no matter the Cast is explicit or implicit, it should display the cast info.

https://github.com/apache/arrow-datafusion/blob/cedb05aedf3cea030bfa8774b8575d8f4806a1c8/datafusion/expr/src/expr.rs#L1108-L1115

@andygrove @liukun4515 @jackwener
How do you think?

Copy link
Member

@jackwener jackwener Dec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we need to consider to the Cast.

  1. sql can explicitly include cast
  2. after infer cast, it will generate cast.

But I need to time to think carefully about its detail.

@mingmwang
Copy link
Contributor

The PR itself LGTM.

Copy link
Member

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM generally.
But there may be some problem cooperate with other rules, but not related to this one.I will try to investigate and think about them.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at several of the tests carefully and skimmed the rest of this PR -- I also appreciate @jackwener 's PR.

Thank you very much @ygf11 -- I'll plan to merge this PR tomorrow unless there are any additional comments


// reduce to inner join, t2.t2_id will insert cast.
let sql =
"select t1.t1_id, t2.t2_id, t1.t1_name from t1 cross join t2 where t1.t1_id + 11 = cast(t2.t2_id as BIGINT)";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

let expected = vec![
"Filter: t2.c < UInt32(20) [a:UInt32, b:UInt32, c:UInt32, a:UInt32, b:UInt32, c:UInt32]",
" Projection: t1.a, t1.b, t1.c, t2.a, t2.b, t2.c [a:UInt32, b:UInt32, c:UInt32, a:UInt32, b:UInt32, c:UInt32]",
" Inner Join: t1.a + UInt32(100) = t2.a * UInt32(2) [a:UInt32, b:UInt32, c:UInt32, t1.a + UInt32(100):UInt32, a:UInt32, b:UInt32, c:UInt32, t2.a * UInt32(2):UInt32]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very cool

@@ -3115,32 +3101,6 @@ fn extract_join_keys(
Ok(())
}

/// Wrap projection for a plan, if the join keys contains normal expression.
fn wrap_projection_for_join_if_necessary(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is very nice to not do this in the planner anymore

.iter()
.map(|key| {
// The display_name() of cast expression will ignore the cast info, and show the inner expression name.
// If we do not add alais, it will throw same field name error in the schema when adding projection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good comments

Copy link
Contributor

@liukun4515 liukun4515 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add sql test in the follow up pr.

select *,t1.t1_id+10 from t1,t2 where t1.t1_id+10=t2.t2_id

Thanks @ygf11

@ygf11
Copy link
Contributor Author

ygf11 commented Dec 9, 2022

Maybe we can add sql test in the follow up pr.

Thanks @liukun4515, I will add this test case in the follow up pr.

@alamb alamb merged commit 097a3de into apache:master Dec 9, 2022
@alamb
Copy link
Contributor

alamb commented Dec 9, 2022

Thanks @ygf11 for the contribution and everyone for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for non-column key for equijoin when eliminating cross join to inner join
5 participants