[SPARK-19385][SQL] During canonicalization, `NOT(...(l, r))` should not expect such cases that l.hashcode > r.hashcode #16719

lw-lin · 2017-01-27T15:19:54Z

What changes were proposed in this pull request?

During canonicalization, NOT(...(l, r)) should not expect such cases that l.hashcode > r.hashcode.

Take the rule case NOT(GreaterThan(l, r)) if l.hashcode > r.hashcode for example, it should never be matched since GreaterThan(l, r) itself would be re-written as GreaterThan(r, l) given l.hashcode > r.hashcode after canonicalization.

This patch consolidates rules like case NOT(GreaterThan(l, r)) if l.hashcode > r.hashcode and case NOT(GreaterThan(l, r)).

How was this patch tested?

This patch expanded the NOT test case to cover both cases where:

l.hashcode > r.hashcode
l.hashcode < r.hashcode

SparkQA · 2017-01-27T15:24:44Z

Test build #72075 has finished for PR 16719 at commit da0b98c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

lw-lin · 2017-01-27T16:02:50Z

@cloud-fan @gatorsmile @dongjoon-hyun would you take a look, thanks!

SparkQA · 2017-01-27T18:24:16Z

Test build #72076 has finished for PR 16719 at commit 9c42889.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-01-27T19:38:14Z

Hi, @lw-lin .
Thank you for pining me. I'll take a look.

dongjoon-hyun · 2017-01-27T19:56:38Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionSetSuite.scala

+  setTest(1, Not(maxHash >= 1), maxHash < 1, Not(Literal(1) <= maxHash), Literal(1) > maxHash)
+  setTest(1, Not(minHash >= 1), minHash < 1, Not(Literal(1) <= minHash), Literal(1) > minHash)
+  setTest(1, Not(maxHash <= 1), maxHash > 1, Not(Literal(1) >= maxHash), Literal(1) < maxHash)
+  setTest(1, Not(minHash <= 1), minHash > 1, Not(Literal(1) >= minHash), Literal(1) < minHash)


These test cases are covered previously correctly. Actually, this PR simplifies the logics only. Am I right?

yea sure they are covered correctly even prior to this patch's changes!

the previous aUpper'hashcode is either greater than or less than 1's hashcode but can not be both, while this change aims to test both cases -- but I'm quite open to revert the changes if they are considered unnecessary.

dongjoon-hyun · 2017-01-27T20:04:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala

-    case Not(LessThanOrEqual(l, r)) if l.hashCode() > r.hashCode() => LessThanOrEqual(r, l)
-    case Not(LessThanOrEqual(l, r)) => GreaterThan(l, r)
+    case Not(GreaterThan(l, r)) =>
+      assert(l.hashCode() <= r.hashCode())


Can we remove these asserts? It seems to be verified with your test cases now.

thanks! maybe an alternative way is to add comments saying it's guaranteed that l.hashcode <= r.hashcode, otherwise people might wonder why there is no case Not(GreaterThan(l, r)) if l.hashCode() > r.hashCode() at their first glance.

It should be fine to get rid of assert, as long as we add the code comments and the needed test cases.

dongjoon-hyun · 2017-01-27T20:17:01Z

The original logic was designed to be safe for changing the caller bottom-up code, here.

  lazy val canonicalized: Expression = {
    val canonicalizedChildren = children.map(_.canonicalized)
    Canonicalize.execute(withNewChildren(canonicalizedChildren))
  }

But, I agree that it's safe to simplify that with the new @lw-lin 's test cases.

For the assert statements, I think @cloud-fan and @gatorsmile can give more insightful advice.

For me, LGTM except that. Oh, could you update the title, NOT(l, r)? It looks a little strange.

gatorsmile · 2017-01-29T06:21:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala

    case GreaterThanOrEqual(l, r) if l.hashCode() > r.hashCode() => LessThanOrEqual(r, l)
    case LessThanOrEqual(l, r) if l.hashCode() > r.hashCode() => GreaterThanOrEqual(r, l)

-    case Not(GreaterThan(l, r)) if l.hashCode() > r.hashCode() => GreaterThan(r, l)


This is a dead code, because our canonicalization order is bottom up, right?

uh. Just saw the above comment from @dongjoon-hyun . Thanks!

gatorsmile · 2017-01-29T06:35:08Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionSetSuite.scala

+          // maxHash's hashcode is calculated based on this exprId's hashcode, so we set this
+          // exprId's hashCode to this specific value to make sure maxHash's hashcode is almost
+          // `Int.MaxValue`
+          override def hashCode: Int = 826929706


Why not override def hashCode: Int = Int.MaxValue?

thanks.

the reason is in Canonicalize.scala#ignoreNamesTypes, we're making copies of e (maxHash in this case):

private def ignoreNamesTypes(e: Expression): Expression = e match { case a: AttributeReference => AttributeReference("none", a.dataType.asNullable)(exprId = a.exprId) case _ => e }

so, even if we override def hashCode: Int = Int.MaxValue on maxHash, it has nothing to do with the copy's hashcode.

then i took a step back -- by defining exprId's hashcode to a specific value (as provided in this patch), we further defined the copied attribute-reference's hashcode.

uh, I did not read the comment carefully. Thanks for the explanation.
You can set it to -1030353449. Then, maxHash.hashCode() will be equal to Int.MaxValue

ah, -1030353449 works great! let me push a commit updating this

gatorsmile · 2017-01-29T06:35:38Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionSetSuite.scala

+          // minHash's hashcode is calculated based on this exprId's hashcode, so we set this
+          // exprId's hashCode to this specific value to make sure minHash's hashcode is almost
+          // `Int.MinValue`
+          override def hashCode: Int = 826929707


Why not override def hashCode: Int = Int.MinValue?

To make minHash.hashCode() equal to Int.MinValue, you can set it to 1407330692

updated, thanks!

SparkQA · 2017-01-29T07:48:40Z

Test build #72123 has started for PR 16719 at commit 4c0af3a.

gatorsmile · 2017-01-29T07:58:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala

-    case Not(LessThanOrEqual(l, r)) if l.hashCode() > r.hashCode() => LessThanOrEqual(r, l)
-    case Not(LessThanOrEqual(l, r)) => GreaterThan(l, r)
+    case Not(GreaterThan(l, r)) =>
+      assert(l.hashCode() <= r.hashCode())


I think we can remove assert, because the test cases already cover the scenario. You can add a comment to explain.

gatorsmile · 2017-01-29T07:58:54Z

LGTM except one comment. Thanks!

SparkQA · 2017-01-29T10:23:35Z

Test build #72126 has finished for PR 16719 at commit c5fc394.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

lw-lin · 2017-01-29T11:58:03Z

Jenkins retest this please

SparkQA · 2017-01-29T14:25:42Z

Test build #72128 has finished for PR 16719 at commit c5fc394.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-29T20:57:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala


-    case Not(GreaterThan(l, r)) if l.hashCode() > r.hashCode() => GreaterThan(r, l)
+    // Note in the following `NOT` cases, `l.hashCode() <= r.hashCode()` holds. The reason is that
+    // canonicalization is conducted bottom-up -- see [[Expression.canonicalized]].


To the other reviewers, this PR added test cases in ExpressionSetSuite.scala to ensure it. Thus, it is safe to clean the codes.

gatorsmile · 2017-01-29T21:01:39Z

Thanks! Merging to master.

…ot expect such cases that l.hashcode > r.hashcode ## What changes were proposed in this pull request? During canonicalization, `NOT(...(l, r))` should not expect such cases that `l.hashcode > r.hashcode`. Take the rule `case NOT(GreaterThan(l, r)) if l.hashcode > r.hashcode` for example, it should never be matched since `GreaterThan(l, r)` itself would be re-written as `GreaterThan(r, l)` given `l.hashcode > r.hashcode` after canonicalization. This patch consolidates rules like `case NOT(GreaterThan(l, r)) if l.hashcode > r.hashcode` and `case NOT(GreaterThan(l, r))`. ## How was this patch tested? This patch expanded the `NOT` test case to cover both cases where: - `l.hashcode > r.hashcode` - `l.hashcode < r.hashcode` Author: Liwei Lin <lwlin7@gmail.com> Closes apache#16719 from lw-lin/canonicalize.

Fix

da0b98c

Fix style

9c42889

dongjoon-hyun reviewed Jan 27, 2017

View reviewed changes

lw-lin changed the title ~~[SPARK-19385][SQL] During canonicalization, NOT(l, r) should not expect such cases that l.hashcode > r.hashcode~~ [SPARK-19385][SQL] During canonicalization, NOT(...(l, r)) should not expect such cases that l.hashcode > r.hashcode Jan 28, 2017

gatorsmile reviewed Jan 29, 2017

View reviewed changes

Comments from @gatorsmile

4c0af3a

gatorsmile reviewed Jan 29, 2017

View reviewed changes

Remove asserts

c5fc394

gatorsmile reviewed Jan 29, 2017

View reviewed changes

asfgit closed this in ade075a Jan 29, 2017

lw-lin deleted the canonicalize branch March 1, 2017 09:06

[SPARK-19385][SQL] During canonicalization, NOT(...(l, r)) should not expect such cases that l.hashcode > r.hashcode #16719

[SPARK-19385][SQL] During canonicalization, NOT(...(l, r)) should not expect such cases that l.hashcode > r.hashcode #16719

Uh oh!

Conversation

lw-lin commented Jan 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jan 27, 2017

Uh oh!

lw-lin commented Jan 27, 2017

Uh oh!

SparkQA commented Jan 27, 2017

Uh oh!

dongjoon-hyun commented Jan 27, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lw-lin Jan 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jan 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lw-lin Jan 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 29, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jan 29, 2017

Uh oh!

SparkQA commented Jan 29, 2017

Uh oh!

lw-lin commented Jan 29, 2017

Uh oh!

SparkQA commented Jan 29, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jan 29, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-19385][SQL] During canonicalization, `NOT(...(l, r))` should not expect such cases that l.hashcode > r.hashcode #16719

[SPARK-19385][SQL] During canonicalization, `NOT(...(l, r))` should not expect such cases that l.hashcode > r.hashcode #16719

lw-lin commented Jan 27, 2017 •

edited

Loading

lw-lin Jan 28, 2017 •

edited

Loading

dongjoon-hyun commented Jan 27, 2017 •

edited

Loading

lw-lin Jan 29, 2017 •

edited

Loading