[SPARK-25150][SQL] Rewrite condition when deduplicate Join #22318

peter-toth · 2018-09-03T02:06:47Z

What changes were proposed in this pull request?

ResolveReferences rule modifies (deduplicates) conflicting AttributeReferences in left and right side of a Join. It changes the right side to avoid conflicts but it forgot to change the references in the condition of the join. Solution is to rewrite the condition too.

Details:

With this PR, if any of the resolved attributes in a join condition refer to an attribute that is replaced during the deduplication (dedupRight()) then the join condition is modified to reflect the change.
The modification is done regardless whether the reference is on the left or the right side of the condition.

This PR has no effect on unresolved attributes. Those attributes are resolved later by different rules.

This PR helps in those cases where same origin dataframes (b, c) are joined to a different origin one (a):

val a = spark.range(1, 5)
val b = spark.range(10)
val c = b.filter($"id" % 2 === 0)
val r = a.join(b, a("id") === b("id"), "inner").join(c, a("id") === c("id"), "inner")

Here the result of a join b contains an id attribute coming from b. That attribute conflicts with the id attribute of c in the second join so c("id") is deduplicated to let's say X and the join condition a("id") === c("id") needs to be modified to a("id") === X to get correct results.

I believe it is also worth explaining some situations where this PR does't help (but also does no harm). In cases where same origin dataframes are joined to each other:

val b = spark.range(10)
val c = b.filter($"id" % 2 === 0)
val r = b.join(c, b("id") === c("id"), "inner")

In this latter case the deduplication is applied as both b and c have an id attribute (that actually refer to the same resolved attribute (have the same ExprId)). Deduplication results c("id") to be changed to let's say X. So in this case both b("id") and c("id") in the join condition are modified to X by the changes introduced in this PR. The reason why the result will be correct with this PR (and is correct without this PR) is that the current implementation of Spark contains a "hack" in Dataframe that fixes EqualTo and EqualNullSafe type of conditions with identical references on both sides. The hack rewrites these join conditions by re-resolving the attributes with the name of the attribute on both the left and right sides.

Please note that until a resolved attribute doesn't contain some kind of reference to it's dataframe not all cases of join can be handled properly. Such an initiative can be found in #21449.

How was this patch tested?

Added unit test.

Please review http://spark.apache.org/contributing.html before opening a pull request.

maropu · 2018-09-03T04:20:14Z

Could you describe more in the PR description?; what's the root cause of this issue? How did you solve this by this pr?

maropu · 2018-09-03T04:20:40Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala

+      checkAnswer(r, Row(2, 2, 2) :: Row(4, 4, 4) :: Nil)
+    }
+  }
+


nit: remove this empty line

maropu · 2018-09-03T04:20:53Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala

  }
+
+  test("SPARK-25150: Attribute deduplication handles attributes in join condition properly") {
+    withSQLConf(SQLConf.CROSS_JOINS_ENABLED.key -> "false") {


withSQLConf(SQLConf.CROSS_JOINS_ENABLED.key -> "true") {

actually we can remove this

maropu · 2018-09-03T04:23:58Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala

+      val b = spark.range(10)
+      val c = b.filter($"id" % 2 === 0)
+
+      val r = a.join(b, a("id") === b("id"), "inner").join(c, a("id") === c("id"), "inner")


Do we need a df a for this test? I think a simple test is better.

I think we do need a here.
If we dropped a and the test would become like:

val b = spark.range(1, 5) val c = b.filter($"id" % 2 === 0) val r = b.join(c, b("id") === c("id"), "inner") checkAnswer(r, Row(2, 2) :: Row(4, 4) :: Nil)

then the test would pass even without the fix. This is because we have a special case to handle id = id like conditions in case of EqualTo and EqualNullSafe in Dataset.

My fix comes into play in some other cases where there is an AttributeReference change in the right side of a join due to deduplication.

mgaido91

This change makes sense to me, despite I quite don't get how this didn't come out earlier.

@peter-toth may you please improve the PR description as suggested by @maropu? May you please also check whether this is a regression from 2.2? In that case, it might be a blocker for the next 2.3 release.

cc @cloud-fan @gatorsmile

mgaido91 · 2018-09-03T07:50:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+        e: Expression,
+        attributeRewrites: AttributeMap[Attribute]): Expression = e match {
+      case a: Attribute => attributeRewrites.getOrElse(a, a)
+      case _ => e.mapChildren(rewriteJoinCondition(_, attributeRewrites))


nit: may be more clear to do:

case other => e.mapChildren(rewriteJoinCondition(other, attributeRewrites))

sorry, we can't do that, it wouldn't mean the same

oh, sorry, I see now, sorry.

Change-Id: I52ac5aa76e821f5d74cfa108e0b8269665eb081d

maropu · 2018-09-03T12:22:54Z

@ueshin @HyukjinKwon can you trigger tests?

cloud-fan · 2018-09-03T13:18:36Z

ok to test

maropu · 2018-09-03T13:25:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

-        j.copy(right = dedupRight(left, right))
+      case j @ Join(left, right, _, condition) if !j.duplicateResolved =>
+        val (dedupedRight, attributeRewrites) = dedupRight(left, right)
+        val changedCondition = condition.map(rewriteJoinCondition(_, attributeRewrites))


How about this?

val changedCondition = condition.map { _.transform { case attr: Attribute => attributeRewrites.getOrElse(attr, attr) }}

Then, removes rewriteJoinCondition?

thanks! fixed

maropu · 2018-09-03T13:26:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

        case Some((oldRelation, newRelation)) =>
          val attributeRewrites = AttributeMap(oldRelation.output.zip(newRelation.output))
-          right transformUp {
+          (right transformUp {


(right transformUp { -> val newRight = right transformUp {?

maropu · 2018-09-03T13:27:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

                s.withNewPlan(dedupOuterReferencesInSubquery(s.plan, attributeRewrites))
            }
-          }
+          }, attributeRewrites)


Then, (newRight, attributeRewrites).

maropu · 2018-09-03T13:30:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

           * that resolves these conflicts. Otherwise, the analysis will fail.
           */
-          right
+          (right, AttributeMap.empty[Attribute])


Since I don't want to build an empty object per call, how about defining an empty map in this class field? e.g.,

object ResolveReferences extends Rule[LogicalPlan] { private val emptyAttrMap = new AttributeMap[Attribute](Map.empty)

just commented quite the same meanwhile :) sorry, I saw your comments only now :)

mgaido91 · 2018-09-03T12:39:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

        case Some((oldRelation, newRelation)) =>
          val attributeRewrites = AttributeMap(oldRelation.output.zip(newRelation.output))
-          right transformUp {
+          (right transformUp {


nit: may be cleaner doing something like val newPlan = and then return

(newPlan, attributeRewrites)

mgaido91 · 2018-09-03T12:43:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala

    new AttributeMap(kvs.map(kv => (kv._1.exprId, kv)).toMap)
  }
+
+  def empty[A](): AttributeMap[A] = new AttributeMap(Map.empty)


I think it'd be better if we can avoid creating a new Map every time. In Scala, this is usually handled creating an empty object and returning it on the invocation of empty (eg. refer to the Nil implementation for List). The only problem is that this would require changing AttributeMap[A] to AttributeMap[+A]. I am not sure whether this may break binary compatibility and therefore whether this is an acceptable change or not. In case it is, it would be great to do this I think.

SparkQA · 2018-09-03T15:36:29Z

Test build #95625 has finished for PR 22318 at commit 8e58345.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

peter-toth · 2018-09-03T16:33:20Z

Also added missing if attr.resolved which I think will fix the UT issues.

SparkQA · 2018-09-03T20:11:43Z

Test build #95632 has finished for PR 22318 at commit d6e316a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jaceklaskowski · 2018-09-03T20:24:13Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala

+    val b = spark.range(10)
+    val c = b.filter($"id" % 2 === 0)
+
+    val r = a.join(b, a("id") === b("id"), "inner").join(c, a("id") === c("id"), "inner")


Why is this a simpler a.join(b, "id").join(c, "id")?

That simpler join doesn't hit the issue. It is handled by a different rule ResolveNaturalAndUsingJoin.

peter-toth · 2018-09-04T05:58:12Z

@mgaido91 , 2.2 also suffered from this.

mgaido91

@peter-toth may you please also update the title of the PR to be more precise than "FIx"? Like "Rewrite condition when deduplicate Join" or something similar which explains better what the PR does?

Just left one comment, otherwise LGTM, thanks.

mgaido91 · 2018-09-04T07:53:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

   */
  object ResolveReferences extends Rule[LogicalPlan] {
+
+    private val emptyAttrMap = new AttributeMap[Attribute](Map.empty)


I'd prefer to do what I suggested previously as it would be easier to reuse if it will be needed. In this way, next time we need again an empty AttributeMap we need to create a new one and we'd end up with several of them.

cc @cloud-fan @maropu WDYT?

@mgaido91 , I agree with you and happy to do it, just saw your concerns about binary compatibility.
@cloud-fan @maropu please share your thoughts on this.

its ok to me.

mgaido91 · 2018-09-05T12:54:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala

 * of the name, or the expected nullability).
 */
 object AttributeMap {
+  var empty = new AttributeMap(Map.empty)


var -> val

oh thanks, fixed

mgaido91 · 2018-09-05T12:55:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

   * a logical plan node's children.
   */
  object ResolveReferences extends Rule[LogicalPlan] {
+


unneeded newline

Change-Id: I928e9a56410c00c4fbbd11c8e91b7993a4bc1878

mgaido91 · 2018-09-05T13:45:47Z

LGTM

SparkQA · 2018-09-05T14:31:38Z

Test build #95714 has finished for PR 22318 at commit d56f33c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class AttributeMap[+A](val baseMap: Map[ExprId, (Attribute, A)])

kiszk · 2018-09-05T14:38:28Z

retest this please

SparkQA · 2018-09-05T17:33:48Z

Test build #95720 has finished for PR 22318 at commit 809b8a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-05T18:40:06Z

Test build #95723 has finished for PR 22318 at commit 809b8a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

This reverts commit 809b8a8.

This reverts commit d56f33c.

SparkQA · 2018-09-06T12:03:30Z

Test build #95750 has finished for PR 22318 at commit 938bd7f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-09-06T12:27:52Z

retest this please

SparkQA · 2018-09-06T16:35:27Z

Test build #95757 has finished for PR 22318 at commit 938bd7f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-09-06T23:44:54Z

LGTM

maropu · 2018-09-10T02:20:32Z

To make sure we have no regression by this change, I checked the Analyzer$ResolveReferences time in TPCDS queries. But, I didn't find actual performance regression.

maropu · 2018-09-10T02:21:06Z

retest this please

cloud-fan · 2018-09-10T03:36:10Z

How does this work? When we have duplicated attributes in the join condition, how can we know which attribute comes from which side?

peter-toth · 2018-09-10T06:20:01Z

@cloud-fan this PR doesn't solve that question.
There are some hacks in Dataset.join to handle EqualTo and EqualNullSafe with duplicated attributes and those hacks are still required with this PR.
But I believe there are other initiatives to solve that long-standing issue like #21449

SparkQA · 2018-09-10T06:20:29Z

Test build #95854 has finished for PR 22318 at commit 938bd7f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-10T06:45:20Z

Can you define the scope of this PR? In which case we should change the references in the join condition?

peter-toth · 2018-09-10T13:53:28Z

@cloud-fan , I added some explanation to the description in which cases this PR helps and also where it doesn't.

peter-toth · 2018-09-13T18:34:21Z

@cloud-fan, does the new description defines the scope as you suggested? Is there anything I can add to this PR?

peter-toth · 2018-10-02T12:43:01Z

@cloud-fan could you please help me with this PR and take it one step forward?

viirya · 2018-10-02T15:27:11Z

For a query similar to the one in the PR description:

a.join(b, a("id") === b("id"), "inner").join(c, a("id") === b("id"), "inner"

It is interpreted as a cartesian products now. But after this change, it is as the same as the query:

a.join(b, a("id") === b("id"), "inner").join(c, a("id") === c("id"), "inner")

peter-toth · 2018-10-02T17:02:08Z

Thanks @viirya, your analysis is correct.

Unfortunately an attribute doesn't have a reference to its dataset so I don't think this scenario can be solved easily. I believe the good solution would be something like #21449

But I would argue that my PR still does make sense as a a("id") === b("id") condition is not expected in a join where c is joined in, and actually is very likely a typo.

peter-toth · 2018-10-04T16:17:04Z

Also please consider that currently (and also after this PR) using b and c from the description:

b.join(c, b("id") === b("id"), "inner")

is interpreted as

b.join(c, b("id") === c("id"), "inner")

so my PR just extends this feature to the case with subsequent joins.

@cloud-fan , @viirya what do you think?

peter-toth · 2018-10-11T06:51:38Z

@srowen, I saw your last comment on https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16645484#comment-16645484. I submitted this PR to solve that ticket and I believe the description here explains what is the real issue there.
I would appreciate your thoughts on this PR, unfortunately it got stuck a bit lately. Thanks.

srowen · 2018-10-11T15:22:38Z

Oh I see there was indeed more discussion on this, and it does relate to resolving columns to joined DataFrames. I don't know enough to bless this change, but it seems reasonable. @maropu approves, I think, and so does @mgaido91 ? what about @cloud-fan @viirya ?

It seems like it's behavior that can be improved but isn't strictly a correctness issue? because it's not quite valid to reference columns from dataframes not involved in the join as a join condition, right?

mgaido91 · 2018-10-12T11:57:15Z

@srowen the change seems fine to me as I think this does improve the behavior of attributes deduplication (I think it was a bug not rewriting the join condition with the new references).

The point is that in order to have a really correct and expected behavior in all the conditions we need to keep a reference of the dataset an attribute is from. That is why I created #21449 and I know @cloud-fan mentioned he had a similar PR for the same reason. But that is another story. In general I think both this and something like #21449 is needed in order to handle properly all the cases.

AmplabJenkins · 2018-11-30T02:39:34Z

Can one of the admins verify this patch?

maropu · 2018-12-03T04:33:37Z

@peter-toth we cannot fix the issue of the description without changing the existing behaviour?

peter-toth · 2018-12-03T11:30:59Z

@maropu, I don't think we can. Actually this is how we deal with simpler joins
Do you think changing the behaviour is unacceptable?

maropu · 2018-12-03T12:01:51Z

In the example @viirya described above (#22318 (comment)), I think the interpretation is unclear to most users and I'm fairly concerned that it could be error-prone...

github-actions · 2020-01-07T00:07:40Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

[SPARK-25150][SQL] Fix attribute deduplication in join

be7d8e7

maropu reviewed Sep 3, 2018

View reviewed changes

mgaido91 reviewed Sep 3, 2018

View reviewed changes

[SPARK-25150][SQL] Fix review findings

8e58345

Change-Id: I52ac5aa76e821f5d74cfa108e0b8269665eb081d

maropu reviewed Sep 3, 2018

View reviewed changes

mgaido91 reviewed Sep 3, 2018

View reviewed changes

[SPARK-25150][SQL] Fix review findings 2

d6e316a

jaceklaskowski reviewed Sep 3, 2018

View reviewed changes

mgaido91 reviewed Sep 4, 2018

View reviewed changes

peter-toth changed the title ~~[SPARK-25150][SQL] Fix attribute deduplication in join~~ [SPARK-25150][SQL] Rewrite condition when deduplicate Join Sep 4, 2018

[SPARK-25150][SQL] Fix review findings 3

d56f33c

mgaido91 reviewed Sep 5, 2018

View reviewed changes

[SPARK-25150][SQL] Fix review findings 4

809b8a8

Change-Id: I928e9a56410c00c4fbbd11c8e91b7993a4bc1878

peter-toth added 3 commits September 6, 2018 12:03

Revert "[SPARK-25150][SQL] Fix review findings 4"

8da937b

This reverts commit 809b8a8.

Revert "[SPARK-25150][SQL] Fix review findings 3"

3f15aab

This reverts commit d56f33c.

[SPARK-25150][SQL] Fix review findings 5

938bd7f

dongjoon-hyun added the SQL label Jun 14, 2019

github-actions bot added the Stale label Jan 7, 2020

github-actions bot closed this Jan 8, 2020

[SPARK-25150][SQL] Rewrite condition when deduplicate Join #22318

[SPARK-25150][SQL] Rewrite condition when deduplicate Join #22318

Uh oh!

Conversation

peter-toth commented Sep 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

maropu commented Sep 3, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth Sep 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgaido91 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Sep 3, 2018

Uh oh!

cloud-fan commented Sep 3, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgaido91 Sep 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 3, 2018

Uh oh!

peter-toth commented Sep 3, 2018

Uh oh!

SparkQA commented Sep 3, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth commented Sep 4, 2018

Uh oh!

mgaido91 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

peter-toth commented Sep 3, 2018 •

edited

Loading

peter-toth Sep 3, 2018 •

edited

Loading

mgaido91 Sep 3, 2018 •

edited

Loading