-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32220][SQL][3.0]SHUFFLE_REPLICATE_NL Hint should not change Non-Cartesian Product join result #29070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32220][SQL][3.0]SHUFFLE_REPLICATE_NL Hint should not change Non-Cartesian Product join result #29070
Conversation
…tesian Product join result
|
@dongjoon-hyun FYI, I forgot to change branch to branch-3.0 at first then it create too much label, could you help to remove error labels? |
|
link: #29035 |
maropu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM cc: @cloud-fan @dongjoon-hyun
|
Test build #125651 has finished for PR 29070 at commit
|
|
Retest this please. |
|
Hi, @AngersZhuuuu . The patch itself looks good and I also tested the patch and new test suite. However, Jenkins seems to unable to handle branch change for some reason. Could you make a new clean PR against branch-2.4? Also, GitHub Action is not triggered correctly. |
Hi @dongjoon-hyun , seems branch-2.4 don't have new join hints. so don't need to backport to branch-2.4? |
|
Test build #125660 has finished for PR 29070 at commit
|
Yea, the features have been implemented in the release 3.0. |
|
Oh, it was my typo. I meant to create another PR for Jenkins seems to be recovered here. |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Merged to branch-3.0. Thank you all.
…on-Cartesian Product join result ### What changes were proposed in this pull request? In current Join Hint strategies, if we use SHUFFLE_REPLICATE_NL hint, it will directly convert join to Cartesian Product Join and loss join condition making result not correct. For Example: ``` spark-sql> select * from test4 order by a asc; 1 2 Time taken: 1.063 seconds, Fetched 4 row(s)20/07/08 14:11:25 INFO SparkSQLCLIDriver: Time taken: 1.063 seconds, Fetched 4 row(s) spark-sql>select * from test5 order by a asc 1 2 2 2 Time taken: 1.18 seconds, Fetched 24 row(s)20/07/08 14:13:59 INFO SparkSQLCLIDriver: Time taken: 1.18 seconds, Fetched 24 row(s)spar spark-sql>select /*+ shuffle_replicate_nl(test4) */ * from test4 join test5 where test4.a = test5.a order by test4.a asc ; 1 2 1 2 1 2 2 2 Time taken: 0.351 seconds, Fetched 2 row(s) 20/07/08 14:18:16 INFO SparkSQLCLIDriver: Time taken: 0.351 seconds, Fetched 2 row(s) ``` ### Why are the changes needed? Fix wrong data result ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Added UT Closes #29070 from AngersZhuuuu/SPARK-32220-branch-3.0. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…on-Cartesian Product join result ### What changes were proposed in this pull request? In current Join Hint strategies, if we use SHUFFLE_REPLICATE_NL hint, it will directly convert join to Cartesian Product Join and loss join condition making result not correct. For Example: ``` spark-sql> select * from test4 order by a asc; 1 2 Time taken: 1.063 seconds, Fetched 4 row(s)20/07/08 14:11:25 INFO SparkSQLCLIDriver: Time taken: 1.063 seconds, Fetched 4 row(s) spark-sql>select * from test5 order by a asc 1 2 2 2 Time taken: 1.18 seconds, Fetched 24 row(s)20/07/08 14:13:59 INFO SparkSQLCLIDriver: Time taken: 1.18 seconds, Fetched 24 row(s)spar spark-sql>select /*+ shuffle_replicate_nl(test4) */ * from test4 join test5 where test4.a = test5.a order by test4.a asc ; 1 2 1 2 1 2 2 2 Time taken: 0.351 seconds, Fetched 2 row(s) 20/07/08 14:18:16 INFO SparkSQLCLIDriver: Time taken: 0.351 seconds, Fetched 2 row(s) ``` ### Why are the changes needed? Fix wrong data result ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Added UT ### jsd addition 1. 2020/07/25 阅读 2. 因为 SHUFFLE_REPLICATE_NL 内部自行定义了 doExecute 方法,直接对 rdd partition 进行操作, 所以如果没有 condition 会出现 bug Closes apache#29070 from AngersZhuuuu/SPARK-32220-branch-3.0. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
In current Join Hint strategies, if we use SHUFFLE_REPLICATE_NL hint, it will directly convert join to Cartesian Product Join and loss join condition making result not correct.
For Example:
Why are the changes needed?
Fix wrong data result
Does this PR introduce any user-facing change?
NO
How was this patch tested?
Added UT