Skip to content

Conversation

@AngersZhuuuu
Copy link
Contributor

What changes were proposed in this pull request?

In current Join Hint strategies, if we use SHUFFLE_REPLICATE_NL hint, it will directly convert join to Cartesian Product Join and loss join condition making result not correct.

For Example:

spark-sql> select * from test4 order by a asc;
1 2
Time taken: 1.063 seconds, Fetched 4 row(s)20/07/08 14:11:25 INFO SparkSQLCLIDriver: Time taken: 1.063 seconds, Fetched 4 row(s)
spark-sql>select * from test5 order by a asc
1 2
2 2
Time taken: 1.18 seconds, Fetched 24 row(s)20/07/08 14:13:59 INFO SparkSQLCLIDriver: Time taken: 1.18 seconds, Fetched 24 row(s)spar
spark-sql>select /*+ shuffle_replicate_nl(test4) */ * from test4 join test5 where test4.a = test5.a order by test4.a asc ;
1 2 1 2
1 2 2 2
Time taken: 0.351 seconds, Fetched 2 row(s)
20/07/08 14:18:16 INFO SparkSQLCLIDriver: Time taken: 0.351 seconds, Fetched 2 row(s)

Why are the changes needed?

Fix wrong data result

Does this PR introduce any user-facing change?

NO

How was this patch tested?

Added UT

@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125314 has finished for PR 29035 at commit 27b41e5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 9, 2020

Test build #125421 has finished for PR 29035 at commit aeccc25.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
}

test("SPARK-32220: Non Cartesian Product Join Result Correct with SHUFFLE_REPLICATE_NL hint") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, is this a correctness issue, @AngersZhuuuu ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think so. Nice catch, @AngersZhuuuu

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, when I try new join hint, I found this result is non-correct.

@dongjoon-hyun
Copy link
Member

Thank you for reporting and making a PR, @AngersZhuuuu .

@dongjoon-hyun
Copy link
Member

cc @cloud-fan

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-32220][SQL]SHUFFLE_REPLICATE_NL Hint should not change Non-Cartesian Product join redsult [SPARK-32220][SQL]SHUFFLE_REPLICATE_NL Hint should not change Non-Cartesian Product join result Jul 9, 2020
}
}

def removeCartesianProductJoinHint(hint: Option[HintInfo]): Option[HintInfo] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should pass a correct condition (leftKeys and rightKeys) into CartesianProductExec instead of removing the hint:

Some(Seq(joins.CartesianProductExec(planLater(left), planLater(right), condition)))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the spark strategy incrrectly removes a#0 = a#2;

== Optimized Logical Plan ==
Sort [a#0 ASC NULLS FIRST], true
+- Join Inner, (a#0 = a#2), leftHint=(strategy=shuffle_replicate_nl)
   :- Filter isnotnull(a#0)
   :  +- Relation[a#0,b#1] parquet
   +- Filter isnotnull(a#2)
      +- Relation[a#2,b#3] parquet

== Physical Plan ==
*(3) Sort [a#0 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(a#0 ASC NULLS FIRST, 200), true, [id=#87]
   +- CartesianProduct
      :- *(1) Project [a#0, b#1]
      :  +- *(1) Filter isnotnull(a#0)
      :     +- *(1) ColumnarToRow
      :        +- FileScan parquet default.test4[a#0,b#1] Batched: true, DataFilters: ...
      +- *(2) Project [a#2, b#3]
         +- *(2) Filter isnotnull(a#2)
            +- *(2) ColumnarToRow
               +- FileScan parquet default.test5[a#2,b#3] Batched: true, DataFilters: ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should pass a correct condition (leftKeys and rightKeys) into CartesianProductExec instead of removing the hint:

Some(Seq(joins.CartesianProductExec(planLater(left), planLater(right), condition)))

Yea, in default Cartesian Product Join situation, it didn't need condition at all. So in default, it seems don't have condition when build data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu See latest change, it's ok to do like this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't know the join keys's condition is EqualTo or EaultNullSafe so it's better just not remove it in ExtractEqualJoinKeys

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot use the original condition in logical.Join?

case p @ ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right, hint) =>
  p.condition <-- This?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot use the original condition in logical.Join?

case p @ ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right, hint) =>
  p.condition <-- This?

Don't know we can write like this....updated..

@SparkQA
Copy link

SparkQA commented Jul 10, 2020

Test build #125545 has finished for PR 29035 at commit d9b4dcd.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 10, 2020

Test build #125540 has finished for PR 29035 at commit a65d332.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

def createCartesianProduct() = {
if (joinType.isInstanceOf[InnerLike]) {
Some(Seq(joins.CartesianProductExec(planLater(left), planLater(right), condition)))
Some(Seq(joins.CartesianProductExec(planLater(left), planLater(right), p.condition)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jul 10, 2020

Test build #125571 has finished for PR 29035 at commit d9b4dcd.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. I tested PySpark/Spark UT locally. Merged to master.

Tests passed in 978 seconds
OK:       2306
Failed:   0
Warnings: 0
Skipped:  1

Thank you all!

@dongjoon-hyun
Copy link
Member

Hi, @AngersZhuuuu . Could you make a backport PR against branch-3.0? There is a conflict.

@AngersZhuuuu
Copy link
Contributor Author

Hi, @AngersZhuuuu . Could you make a backport PR against branch-3.0? There is a conflict.

Yea

@dongjoon-hyun
Copy link
Member

Thanks, @AngersZhuuuu .

@maropu
Copy link
Member

maropu commented Jul 10, 2020

late LGTM. Thanks, @AngersZhuuuu

def createCartesianProduct() = {
if (joinType.isInstanceOf[InnerLike]) {
Some(Seq(joins.CartesianProductExec(planLater(left), planLater(right), condition)))
Some(Seq(joins.CartesianProductExec(planLater(left), planLater(right), p.condition)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to write a comment above this line and explain what it is doing.
// instead of using the condition extracted by ExtractEquiJoinKeys, we should use the original join condition, i.e., "p.condition".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to write a comment above this line and explain what it is doing.
// instead of using the condition extracted by ExtractEquiJoinKeys, we should use the original join condition, i.e., "p.condition".

Raise a PR #29084

// 5. Pick broadcast nested loop join as the final solution. It may OOM but we don't have
// other choice.
case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right, hint) =>
case p @ ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right, hint) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid making the similar mistakes, we need to rename condition to a self-descriptive name. "otherConditions"? It is a little bit hard to name it TBH

dongjoon-hyun pushed a commit that referenced this pull request Jul 13, 2020
…ange Non-Cartesian Product join result

### What changes were proposed in this pull request?
follow comment #29035 (comment)
Explain for pr

### Why are the changes needed?
add comment

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not need

Closes #29084 from AngersZhuuuu/follow-spark-32220.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request Jul 14, 2020
…ot change Non-Cartesian Product join result

### What changes were proposed in this pull request?
follow comment #29035 (comment)
Explain for pr

### Why are the changes needed?
add comment

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not need

Closes #29093 from AngersZhuuuu/SPARK-32220-3.0-FOLLOWUP.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants