-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32220][SQL]SHUFFLE_REPLICATE_NL Hint should not change Non-Cartesian Product join result #29035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…tesian Product join redsult
|
Test build #125314 has finished for PR 29035 at commit
|
|
Test build #125421 has finished for PR 29035 at commit
|
| } | ||
| } | ||
|
|
||
| test("SPARK-32220: Non Cartesian Product Join Result Correct with SHUFFLE_REPLICATE_NL hint") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, is this a correctness issue, @AngersZhuuuu ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I think so. Nice catch, @AngersZhuuuu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, when I try new join hint, I found this result is non-correct.
|
Thank you for reporting and making a PR, @AngersZhuuuu . |
|
cc @cloud-fan |
| } | ||
| } | ||
|
|
||
| def removeCartesianProductJoinHint(hint: Option[HintInfo]): Option[HintInfo] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should pass a correct condition (leftKeys and rightKeys) into CartesianProductExec instead of removing the hint:
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
Line 202 in ac6406e
| Some(Seq(joins.CartesianProductExec(planLater(left), planLater(right), condition))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the spark strategy incrrectly removes a#0 = a#2;
== Optimized Logical Plan ==
Sort [a#0 ASC NULLS FIRST], true
+- Join Inner, (a#0 = a#2), leftHint=(strategy=shuffle_replicate_nl)
:- Filter isnotnull(a#0)
: +- Relation[a#0,b#1] parquet
+- Filter isnotnull(a#2)
+- Relation[a#2,b#3] parquet
== Physical Plan ==
*(3) Sort [a#0 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(a#0 ASC NULLS FIRST, 200), true, [id=#87]
+- CartesianProduct
:- *(1) Project [a#0, b#1]
: +- *(1) Filter isnotnull(a#0)
: +- *(1) ColumnarToRow
: +- FileScan parquet default.test4[a#0,b#1] Batched: true, DataFilters: ...
+- *(2) Project [a#2, b#3]
+- *(2) Filter isnotnull(a#2)
+- *(2) ColumnarToRow
+- FileScan parquet default.test5[a#2,b#3] Batched: true, DataFilters: ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should pass a correct condition (
leftKeysandrightKeys) intoCartesianProductExecinstead of removing the hint:
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
Line 202 in ac6406e
Some(Seq(joins.CartesianProductExec(planLater(left), planLater(right), condition)))
Yea, in default Cartesian Product Join situation, it didn't need condition at all. So in default, it seems don't have condition when build data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu See latest change, it's ok to do like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't know the join keys's condition is EqualTo or EaultNullSafe so it's better just not remove it in ExtractEqualJoinKeys
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You cannot use the original condition in logical.Join?
case p @ ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right, hint) =>
p.condition <-- This?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You cannot use the original condition in logical.Join?
case p @ ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right, hint) => p.condition <-- This?
Don't know we can write like this....updated..
|
Test build #125545 has finished for PR 29035 at commit
|
|
Test build #125540 has finished for PR 29035 at commit
|
| def createCartesianProduct() = { | ||
| if (joinType.isInstanceOf[InnerLike]) { | ||
| Some(Seq(joins.CartesianProductExec(planLater(left), planLater(right), condition))) | ||
| Some(Seq(joins.CartesianProductExec(planLater(left), planLater(right), p.condition))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
|
retest this please |
|
Test build #125571 has finished for PR 29035 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. I tested PySpark/Spark UT locally. Merged to master.
Tests passed in 978 seconds
OK: 2306
Failed: 0
Warnings: 0
Skipped: 1
Thank you all!
|
Hi, @AngersZhuuuu . Could you make a backport PR against |
Yea |
|
Thanks, @AngersZhuuuu . |
|
late LGTM. Thanks, @AngersZhuuuu |
| def createCartesianProduct() = { | ||
| if (joinType.isInstanceOf[InnerLike]) { | ||
| Some(Seq(joins.CartesianProductExec(planLater(left), planLater(right), condition))) | ||
| Some(Seq(joins.CartesianProductExec(planLater(left), planLater(right), p.condition))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to write a comment above this line and explain what it is doing.
// instead of using the condition extracted by ExtractEquiJoinKeys, we should use the original join condition, i.e., "p.condition".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to write a comment above this line and explain what it is doing.
// instead of using the condition extracted by ExtractEquiJoinKeys, we should use the original join condition, i.e., "p.condition".
Raise a PR #29084
| // 5. Pick broadcast nested loop join as the final solution. It may OOM but we don't have | ||
| // other choice. | ||
| case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right, hint) => | ||
| case p @ ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right, hint) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid making the similar mistakes, we need to rename condition to a self-descriptive name. "otherConditions"? It is a little bit hard to name it TBH
…ange Non-Cartesian Product join result ### What changes were proposed in this pull request? follow comment #29035 (comment) Explain for pr ### Why are the changes needed? add comment ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #29084 from AngersZhuuuu/follow-spark-32220. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ot change Non-Cartesian Product join result ### What changes were proposed in this pull request? follow comment #29035 (comment) Explain for pr ### Why are the changes needed? add comment ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #29093 from AngersZhuuuu/SPARK-32220-3.0-FOLLOWUP. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
In current Join Hint strategies, if we use SHUFFLE_REPLICATE_NL hint, it will directly convert join to Cartesian Product Join and loss join condition making result not correct.
For Example:
Why are the changes needed?
Fix wrong data result
Does this PR introduce any user-facing change?
NO
How was this patch tested?
Added UT