Skip to content

Conversation

@lw-lin
Copy link
Contributor

@lw-lin lw-lin commented Jan 27, 2017

What changes were proposed in this pull request?

During canonicalization, NOT(...(l, r)) should not expect such cases that l.hashcode > r.hashcode.

Take the rule case NOT(GreaterThan(l, r)) if l.hashcode > r.hashcode for example, it should never be matched since GreaterThan(l, r) itself would be re-written as GreaterThan(r, l) given l.hashcode > r.hashcode after canonicalization.

This patch consolidates rules like case NOT(GreaterThan(l, r)) if l.hashcode > r.hashcode and case NOT(GreaterThan(l, r)).

How was this patch tested?

This patch expanded the NOT test case to cover both cases where:

  • l.hashcode > r.hashcode
  • l.hashcode < r.hashcode

@SparkQA
Copy link

SparkQA commented Jan 27, 2017

Test build #72075 has finished for PR 16719 at commit da0b98c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@lw-lin
Copy link
Contributor Author

lw-lin commented Jan 27, 2017

@cloud-fan @gatorsmile @dongjoon-hyun would you take a look, thanks!

@SparkQA
Copy link

SparkQA commented Jan 27, 2017

Test build #72076 has finished for PR 16719 at commit 9c42889.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Hi, @lw-lin .
Thank you for pining me. I'll take a look.

setTest(1, Not(maxHash >= 1), maxHash < 1, Not(Literal(1) <= maxHash), Literal(1) > maxHash)
setTest(1, Not(minHash >= 1), minHash < 1, Not(Literal(1) <= minHash), Literal(1) > minHash)
setTest(1, Not(maxHash <= 1), maxHash > 1, Not(Literal(1) >= maxHash), Literal(1) < maxHash)
setTest(1, Not(minHash <= 1), minHash > 1, Not(Literal(1) >= minHash), Literal(1) < minHash)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test cases are covered previously correctly. Actually, this PR simplifies the logics only. Am I right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea sure they are covered correctly even prior to this patch's changes!

the previous aUpper'hashcode is either greater than or less than 1's hashcode but can not be both, while this change aims to test both cases -- but I'm quite open to revert the changes if they are considered unnecessary.

case Not(LessThanOrEqual(l, r)) if l.hashCode() > r.hashCode() => LessThanOrEqual(r, l)
case Not(LessThanOrEqual(l, r)) => GreaterThan(l, r)
case Not(GreaterThan(l, r)) =>
assert(l.hashCode() <= r.hashCode())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove these asserts? It seems to be verified with your test cases now.

Copy link
Contributor Author

@lw-lin lw-lin Jan 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! maybe an alternative way is to add comments saying it's guaranteed that l.hashcode <= r.hashcode, otherwise people might wonder why there is no case Not(GreaterThan(l, r)) if l.hashCode() > r.hashCode() at their first glance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fine to get rid of assert, as long as we add the code comments and the needed test cases.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 27, 2017

The original logic was designed to be safe for changing the caller bottom-up code, here.

  lazy val canonicalized: Expression = {
    val canonicalizedChildren = children.map(_.canonicalized)
    Canonicalize.execute(withNewChildren(canonicalizedChildren))
  }

But, I agree that it's safe to simplify that with the new @lw-lin 's test cases.

For the assert statements, I think @cloud-fan and @gatorsmile can give more insightful advice.

For me, LGTM except that. Oh, could you update the title, NOT(l, r)? It looks a little strange.

@lw-lin lw-lin changed the title [SPARK-19385][SQL] During canonicalization, NOT(l, r) should not expect such cases that l.hashcode > r.hashcode [SPARK-19385][SQL] During canonicalization, NOT(...(l, r)) should not expect such cases that l.hashcode > r.hashcode Jan 28, 2017
case GreaterThanOrEqual(l, r) if l.hashCode() > r.hashCode() => LessThanOrEqual(r, l)
case LessThanOrEqual(l, r) if l.hashCode() > r.hashCode() => GreaterThanOrEqual(r, l)

case Not(GreaterThan(l, r)) if l.hashCode() > r.hashCode() => GreaterThan(r, l)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a dead code, because our canonicalization order is bottom up, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh. Just saw the above comment from @dongjoon-hyun . Thanks!

// maxHash's hashcode is calculated based on this exprId's hashcode, so we set this
// exprId's hashCode to this specific value to make sure maxHash's hashcode is almost
// `Int.MaxValue`
override def hashCode: Int = 826929706
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not override def hashCode: Int = Int.MaxValue?

Copy link
Contributor Author

@lw-lin lw-lin Jan 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks.

the reason is in Canonicalize.scala#ignoreNamesTypes, we're making copies of e (maxHash in this case):

  private def ignoreNamesTypes(e: Expression): Expression = e match {
    case a: AttributeReference =>
      AttributeReference("none", a.dataType.asNullable)(exprId = a.exprId)
    case _ => e
  }

so, even if we override def hashCode: Int = Int.MaxValue on maxHash, it has nothing to do with the copy's hashcode.

then i took a step back -- by defining exprId's hashcode to a specific value (as provided in this patch), we further defined the copied attribute-reference's hashcode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh, I did not read the comment carefully. Thanks for the explanation.
You can set it to -1030353449. Then, maxHash.hashCode() will be equal to Int.MaxValue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, -1030353449 works great! let me push a commit updating this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

// minHash's hashcode is calculated based on this exprId's hashcode, so we set this
// exprId's hashCode to this specific value to make sure minHash's hashcode is almost
// `Int.MinValue`
override def hashCode: Int = 826929707
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not override def hashCode: Int = Int.MinValue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make minHash.hashCode() equal to Int.MinValue, you can set it to 1407330692

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks!

@SparkQA
Copy link

SparkQA commented Jan 29, 2017

Test build #72123 has started for PR 16719 at commit 4c0af3a.

case Not(LessThanOrEqual(l, r)) if l.hashCode() > r.hashCode() => LessThanOrEqual(r, l)
case Not(LessThanOrEqual(l, r)) => GreaterThan(l, r)
case Not(GreaterThan(l, r)) =>
assert(l.hashCode() <= r.hashCode())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove assert, because the test cases already cover the scenario. You can add a comment to explain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure!

@gatorsmile
Copy link
Member

LGTM except one comment. Thanks!

@SparkQA
Copy link

SparkQA commented Jan 29, 2017

Test build #72126 has finished for PR 16719 at commit c5fc394.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@lw-lin
Copy link
Contributor Author

lw-lin commented Jan 29, 2017

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Jan 29, 2017

Test build #72128 has finished for PR 16719 at commit c5fc394.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


case Not(GreaterThan(l, r)) if l.hashCode() > r.hashCode() => GreaterThan(r, l)
// Note in the following `NOT` cases, `l.hashCode() <= r.hashCode()` holds. The reason is that
// canonicalization is conducted bottom-up -- see [[Expression.canonicalized]].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the other reviewers, this PR added test cases in ExpressionSetSuite.scala to ensure it. Thus, it is safe to clean the codes.

@gatorsmile
Copy link
Member

Thanks! Merging to master.

@asfgit asfgit closed this in ade075a Jan 29, 2017
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…ot expect such cases that l.hashcode > r.hashcode

## What changes were proposed in this pull request?

During canonicalization, `NOT(...(l, r))` should not expect such cases that `l.hashcode > r.hashcode`.

Take the rule `case NOT(GreaterThan(l, r)) if l.hashcode > r.hashcode` for example, it should never be matched since `GreaterThan(l, r)` itself would be re-written as `GreaterThan(r, l)` given `l.hashcode > r.hashcode` after canonicalization.

This patch consolidates rules like `case NOT(GreaterThan(l, r)) if l.hashcode > r.hashcode` and `case NOT(GreaterThan(l, r))`.

## How was this patch tested?

This patch expanded the `NOT` test case to cover both cases where:
- `l.hashcode > r.hashcode`
- `l.hashcode < r.hashcode`

Author: Liwei Lin <lwlin7@gmail.com>

Closes apache#16719 from lw-lin/canonicalize.
@lw-lin lw-lin deleted the canonicalize branch March 1, 2017 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants