Skip to content

Conversation

@jiangxb1987
Copy link
Contributor

@jiangxb1987 jiangxb1987 commented Nov 22, 2016

What changes were proposed in this pull request?

The expression in(empty seq) is invalid in some data source. Since in(empty seq) is always false, we should generate in(empty seq) to false literal in optimizer.
The sql SELECT * FROM t WHERE a IN () throws a ParseException which is consistent with Hive, don't need to change that behavior.

How was this patch tested?

Add new test case in OptimizeInSuite.

@jiangxb1987
Copy link
Contributor Author

@hvanhovell
Copy link
Contributor

hvanhovell commented Nov 22, 2016

@SparkQA
Copy link

SparkQA commented Nov 22, 2016

Test build #68991 has finished for PR 15977 at commit 57dfc23.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

@hvanhovell Thanks for that catch! I'll fix that too.

@jiangxb1987
Copy link
Contributor Author

jiangxb1987 commented Nov 22, 2016

The failed test case seems not related to our change here.

s"$attr IN (${compileValue(value)})"
} else {
// Return false literal when value is empty.
s"false"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove interpolation

@SparkQA
Copy link

SparkQA commented Nov 22, 2016

Test build #68998 has finished for PR 15977 at commit b6681de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case StringEndsWith(attr, value) => s"${attr} LIKE '%${value}'"
case StringContains(attr, value) => s"${attr} LIKE '%${value}%'"
case In(attr, value) => s"$attr IN (${compileValue(value)})"
case In(attr, value) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: It is shorter to put both options in different case statements:

...
case In(attr, values) if value.nonEmpty => s"$attr IN (${compileValue(value)})"
case In(_, _) => "false"
...

case class OptimizeIn(conf: CatalystConf) extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case q: LogicalPlan => q transformExpressionsDown {
case expr @ In(v, list) if list.isEmpty =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the current behavior for (null in ()) ? We want to make sure the implementation of In returns the same result (False here).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NULL IN () returns null, unfortunately NULL NOT IN () also returns null. This rule can cause NOT IN () to evaluate to true instead of null which is illegal (a filter only accepts rows for which the predicate evaluates to true).

We either have to drop the rule, or apply it to top level IN expressions only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we add a new case to handle null literal in value? Like the following:

      case expr @ In(v @ Literal(null, _), list) =>
        v
        
      case expr @ In(v, list) if list.isEmpty =>
        FalseLiteral

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this change. IN() is very efficient in these cases.

if (value.nonEmpty) {
s"$attr IN (${compileValue(value)})"
} else {
// Return false literal when value is empty.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dead code, right?

If Optimizer already replace the empty value by a constant false, how to reach this branch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't rely on the optimizer for correctness.

@SparkQA
Copy link

SparkQA commented Nov 24, 2016

Test build #69113 has started for PR 15977 at commit 9bb1264.

* 1. Removes literal repetitions.
* 2. Replaces [[In (value, seq[Literal])]] with optimized version
* [[InSet (value, HashSet[Literal])]] which is much faster.
* 3. Replaces [[In (value, Seq.empty)]] with false literal.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to update this, right?

@SparkQA
Copy link

SparkQA commented Nov 24, 2016

Test build #69118 has started for PR 15977 at commit 2f31e72.

@jiangxb1987
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Nov 24, 2016

Test build #69120 has finished for PR 15977 at commit 2f31e72.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case StringEndsWith(attr, value) => s"${attr} LIKE '%${value}'"
case StringContains(attr, value) => s"${attr} LIKE '%${value}%'"
case In(attr, value) => s"$attr IN (${compileValue(value)})"
case In(null, value) => "NULL"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question but does this replacement make sense? can NULL be a predicate? Something that might otherwise render as SELECT * FROM foo WHERE NULL in () now becomes SELECT * FROM foo WHERE NULL?

Copy link
Member

@gatorsmile gatorsmile Nov 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. NULL in predicates means UNKNOWN.

FYI,

@rxin
Copy link
Contributor

rxin commented Nov 24, 2016

I believe the correct replacement for "x in ()" is "if (isnull(x)) null else false"

@gatorsmile
Copy link
Member

How about adding the following two lines into the test case in PredicateSuite.scala?

    checkEvaluation(In(Literal.create(null, IntegerType), Seq.empty), null)
    checkEvaluation(In(Literal(1), Seq.empty), false)

@rxin
Copy link
Contributor

rxin commented Nov 24, 2016

Also use NonFoldableLiteral to do the test. Make sure you create a null nonfoldable literal to verify the result is null.

@jiangxb1987
Copy link
Contributor Author

jiangxb1987 commented Nov 25, 2016

Thank you for comment @rxin @srowen @gatorsmile! I'll update that later today!

@SparkQA
Copy link

SparkQA commented Nov 25, 2016

Test build #69150 has finished for PR 15977 at commit 05b5016.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 25, 2016

Test build #69151 has started for PR 15977 at commit 99d4623.

@jiangxb1987
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Nov 25, 2016

Test build #69154 has finished for PR 15977 at commit 99d4623.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

retest this please. - The failure seems not related to our change in this PR.

@SparkQA
Copy link

SparkQA commented Nov 25, 2016

Test build #69161 has finished for PR 15977 at commit 99d4623.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case class OptimizeIn(conf: CatalystConf) extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case q: LogicalPlan => q transformExpressionsDown {
case expr @ In(v, list) if list.isEmpty =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this change. IN() is very efficient in these cases.

case StringStartsWith(attr, value) => s"${attr} LIKE '${value}%'"
case StringEndsWith(attr, value) => s"${attr} LIKE '%${value}'"
case StringContains(attr, value) => s"${attr} LIKE '%${value}%'"
case In(attr, value) if value.isEmpty => s"IF(${attr} IS NULL, NULL, false)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IF is not a part of the SQL standard, use: CASE WHEN $attr IS NULL THEN NULL ELSE FALSE END

@SparkQA
Copy link

SparkQA commented Nov 25, 2016

Test build #69168 has finished for PR 15977 at commit d770934.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

LGTM. Merging to master/2.1. Thanks!

asfgit pushed a commit that referenced this pull request Nov 25, 2016
## What changes were proposed in this pull request?

The expression `in(empty seq)` is invalid in some data source. Since `in(empty seq)` is always false, we should generate `in(empty seq)` to false literal in optimizer.
The sql `SELECT * FROM t WHERE a IN ()` throws a `ParseException` which is consistent with Hive, don't need to change that behavior.

## How was this patch tested?
Add new test case in `OptimizeInSuite`.

Author: jiangxingbo <jiangxb1987@gmail.com>

Closes #15977 from jiangxb1987/isin-empty.

(cherry picked from commit e2fb9fd)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
asfgit pushed a commit that referenced this pull request Nov 25, 2016
## What changes were proposed in this pull request?

The expression `in(empty seq)` is invalid in some data source. Since `in(empty seq)` is always false, we should generate `in(empty seq)` to false literal in optimizer.
The sql `SELECT * FROM t WHERE a IN ()` throws a `ParseException` which is consistent with Hive, don't need to change that behavior.

## How was this patch tested?
Add new test case in `OptimizeInSuite`.

Author: jiangxingbo <jiangxb1987@gmail.com>

Closes #15977 from jiangxb1987/isin-empty.

(cherry picked from commit e2fb9fd)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
@asfgit asfgit closed this in e2fb9fd Nov 25, 2016
@jiangxb1987 jiangxb1987 deleted the isin-empty branch November 26, 2016 10:14
zzcclp added a commit to zzcclp/spark that referenced this pull request Nov 30, 2016
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 2, 2016
## What changes were proposed in this pull request?

The expression `in(empty seq)` is invalid in some data source. Since `in(empty seq)` is always false, we should generate `in(empty seq)` to false literal in optimizer.
The sql `SELECT * FROM t WHERE a IN ()` throws a `ParseException` which is consistent with Hive, don't need to change that behavior.

## How was this patch tested?
Add new test case in `OptimizeInSuite`.

Author: jiangxingbo <jiangxb1987@gmail.com>

Closes apache#15977 from jiangxb1987/isin-empty.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

The expression `in(empty seq)` is invalid in some data source. Since `in(empty seq)` is always false, we should generate `in(empty seq)` to false literal in optimizer.
The sql `SELECT * FROM t WHERE a IN ()` throws a `ParseException` which is consistent with Hive, don't need to change that behavior.

## How was this patch tested?
Add new test case in `OptimizeInSuite`.

Author: jiangxingbo <jiangxb1987@gmail.com>

Closes apache#15977 from jiangxb1987/isin-empty.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants