Skip to content

Conversation

@AngersZhuuuu
Copy link
Contributor

What changes were proposed in this pull request?

In current Join Hint strategies, if we use SHUFFLE_REPLICATE_NL hint, it will directly convert join to Cartesian Product Join and loss join condition making result not correct.

For Example:

spark-sql> select * from test4 order by a asc;
1 2
Time taken: 1.063 seconds, Fetched 4 row(s)20/07/08 14:11:25 INFO SparkSQLCLIDriver: Time taken: 1.063 seconds, Fetched 4 row(s)
spark-sql>select * from test5 order by a asc
1 2
2 2
Time taken: 1.18 seconds, Fetched 24 row(s)20/07/08 14:13:59 INFO SparkSQLCLIDriver: Time taken: 1.18 seconds, Fetched 24 row(s)spar
spark-sql>select /*+ shuffle_replicate_nl(test4) */ * from test4 join test5 where test4.a = test5.a order by test4.a asc ;
1 2 1 2
1 2 2 2
Time taken: 0.351 seconds, Fetched 2 row(s)
20/07/08 14:18:16 INFO SparkSQLCLIDriver: Time taken: 0.351 seconds, Fetched 2 row(s)

Why are the changes needed?

Fix wrong data result

Does this PR introduce any user-facing change?

NO

How was this patch tested?

Added UT

@AngersZhuuuu AngersZhuuuu changed the base branch from master to branch-3.0 July 10, 2020 23:20
@AngersZhuuuu
Copy link
Contributor Author

@dongjoon-hyun FYI, I forgot to change branch to branch-3.0 at first then it create too much label, could you help to remove error labels?

@maropu
Copy link
Member

maropu commented Jul 10, 2020

link: #29035

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Jul 11, 2020

Test build #125651 has finished for PR 29070 at commit 514e30c.

  • This patch fails build dependency tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jul 11, 2020

Hi, @AngersZhuuuu . The patch itself looks good and I also tested the patch and new test suite. However, Jenkins seems to unable to handle branch change for some reason. Could you make a new clean PR against branch-2.4? Also, GitHub Action is not triggered correctly.

@AngersZhuuuu
Copy link
Contributor Author

AngersZhuuuu commented Jul 11, 2020

Hi, @AngersZhuuuu . The patch itself looks good and I also tested the patch and new test suite. However, Jenkins seems to unable to handle branch change for some reason. Could you make a new clean PR against branch-2.4? Also, GitHub Action is not triggered correctly.

Hi @dongjoon-hyun , seems branch-2.4 don't have new join hints. so don't need to backport to branch-2.4?

@SparkQA
Copy link

SparkQA commented Jul 11, 2020

Test build #125660 has finished for PR 29070 at commit 514e30c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jul 11, 2020

Hi @dongjoon-hyun , seems branch-2.4 don't have new join hints. so don't need to backport to branch-2.4?

Yea, the features have been implemented in the release 3.0.

@dongjoon-hyun
Copy link
Member

Oh, it was my typo. I meant to create another PR for branch-3.0 because GitHub Action was not triggered correctly at that time.

Jenkins seems to be recovered here.
#29070 (comment)

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to branch-3.0. Thank you all.

dongjoon-hyun pushed a commit that referenced this pull request Jul 11, 2020
…on-Cartesian Product join result

### What changes were proposed in this pull request?
In current Join Hint strategies, if we use SHUFFLE_REPLICATE_NL hint, it will directly convert join to Cartesian Product Join and loss join condition making result not correct.

For Example:
```
spark-sql> select * from test4 order by a asc;
1 2
Time taken: 1.063 seconds, Fetched 4 row(s)20/07/08 14:11:25 INFO SparkSQLCLIDriver: Time taken: 1.063 seconds, Fetched 4 row(s)
spark-sql>select * from test5 order by a asc
1 2
2 2
Time taken: 1.18 seconds, Fetched 24 row(s)20/07/08 14:13:59 INFO SparkSQLCLIDriver: Time taken: 1.18 seconds, Fetched 24 row(s)spar
spark-sql>select /*+ shuffle_replicate_nl(test4) */ * from test4 join test5 where test4.a = test5.a order by test4.a asc ;
1 2 1 2
1 2 2 2
Time taken: 0.351 seconds, Fetched 2 row(s)
20/07/08 14:18:16 INFO SparkSQLCLIDriver: Time taken: 0.351 seconds, Fetched 2 row(s)
```

### Why are the changes needed?
Fix wrong data result

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
Added UT

Closes #29070 from AngersZhuuuu/SPARK-32220-branch-3.0.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
hunanjsd pushed a commit to hunanjsd/spark that referenced this pull request Jul 26, 2020
…on-Cartesian Product join result

### What changes were proposed in this pull request?
In current Join Hint strategies, if we use SHUFFLE_REPLICATE_NL hint, it will directly convert join to Cartesian Product Join and loss join condition making result not correct.

For Example:
```
spark-sql> select * from test4 order by a asc;
1 2
Time taken: 1.063 seconds, Fetched 4 row(s)20/07/08 14:11:25 INFO SparkSQLCLIDriver: Time taken: 1.063 seconds, Fetched 4 row(s)
spark-sql>select * from test5 order by a asc
1 2
2 2
Time taken: 1.18 seconds, Fetched 24 row(s)20/07/08 14:13:59 INFO SparkSQLCLIDriver: Time taken: 1.18 seconds, Fetched 24 row(s)spar
spark-sql>select /*+ shuffle_replicate_nl(test4) */ * from test4 join test5 where test4.a = test5.a order by test4.a asc ;
1 2 1 2
1 2 2 2
Time taken: 0.351 seconds, Fetched 2 row(s)
20/07/08 14:18:16 INFO SparkSQLCLIDriver: Time taken: 0.351 seconds, Fetched 2 row(s)
```

### Why are the changes needed?
Fix wrong data result

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
Added UT

### jsd addition
1. 2020/07/25 阅读
2. 因为 SHUFFLE_REPLICATE_NL 内部自行定义了 doExecute 方法,直接对 rdd partition 进行操作, 所以如果没有 condition 会出现 bug

Closes apache#29070 from AngersZhuuuu/SPARK-32220-branch-3.0.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants