[SPARK-18436][SQL] isin causing SQL syntax error with JDBC #15977

jiangxb1987 · 2016-11-22T10:28:02Z

What changes were proposed in this pull request?

The expression in(empty seq) is invalid in some data source. Since in(empty seq) is always false, we should generate in(empty seq) to false literal in optimizer.
The sql SELECT * FROM t WHERE a IN () throws a ParseException which is consistent with Hive, don't need to change that behavior.

How was this patch tested?

Add new test case in OptimizeInSuite.

jiangxb1987 · 2016-11-22T10:32:56Z

cc @hvanhovell @rxin @srowen @gatorsmile @windpiger

hvanhovell · 2016-11-22T11:05:54Z

Could you also fix the actual problem: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L123

SparkQA · 2016-11-22T12:07:44Z

Test build #68991 has finished for PR 15977 at commit 57dfc23.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2016-11-22T12:21:39Z

@hvanhovell Thanks for that catch! I'll fix that too.

jiangxb1987 · 2016-11-22T12:22:05Z

The failed test case seems not related to our change here.

srowen · 2016-11-22T13:34:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala

+          s"$attr IN (${compileValue(value)})"
+        } else {
+          // Return false literal when value is empty.
+          s"false"


Nit: remove interpolation

SparkQA · 2016-11-22T15:15:59Z

Test build #68998 has finished for PR 15977 at commit b6681de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-11-22T16:20:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala

      case StringEndsWith(attr, value) => s"${attr} LIKE '%${value}'"
      case StringContains(attr, value) => s"${attr} LIKE '%${value}%'"
-      case In(attr, value) => s"$attr IN (${compileValue(value)})"
+      case In(attr, value) =>


NIT: It is shorter to put both options in different case statements:

... case In(attr, values) if value.nonEmpty => s"$attr IN (${compileValue(value)})" case In(_, _) => "false" ...

rxin · 2016-11-22T18:20:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

 case class OptimizeIn(conf: CatalystConf) extends Rule[LogicalPlan] {
  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    case q: LogicalPlan => q transformExpressionsDown {
+      case expr @ In(v, list) if list.isEmpty =>


what's the current behavior for (null in ()) ? We want to make sure the implementation of In returns the same result (False here).

NULL IN () returns null, unfortunately NULL NOT IN () also returns null. This rule can cause NOT IN () to evaluate to true instead of null which is illegal (a filter only accepts rows for which the predicate evaluates to true).

We either have to drop the rule, or apply it to top level IN expressions only.

How about we add a new case to handle null literal in value? Like the following:

case expr @ In(v @ Literal(null, _), list) => v case expr @ In(v, list) if list.isEmpty => FalseLiteral

You can remove this change. IN() is very efficient in these cases.

gatorsmile · 2016-11-22T19:10:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala

+        if (value.nonEmpty) {
+          s"$attr IN (${compileValue(value)})"
+        } else {
+          // Return false literal when value is empty.


This is dead code, right?

If Optimizer already replace the empty value by a constant false, how to reach this branch?

We shouldn't rely on the optimizer for correctness.

SparkQA · 2016-11-24T07:27:33Z

Test build #69113 has started for PR 15977 at commit 9bb1264.

gatorsmile · 2016-11-24T07:27:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

 * 1. Removes literal repetitions.
 * 2. Replaces [[In (value, seq[Literal])]] with optimized version
 *    [[InSet (value, HashSet[Literal])]] which is much faster.
+ * 3. Replaces [[In (value, Seq.empty)]] with false literal.


You need to update this, right?

SparkQA · 2016-11-24T07:47:34Z

Test build #69118 has started for PR 15977 at commit 2f31e72.

jiangxb1987 · 2016-11-24T08:19:22Z

retest this please.

SparkQA · 2016-11-24T10:42:34Z

Test build #69120 has finished for PR 15977 at commit 2f31e72.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-11-24T10:45:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala

      case StringEndsWith(attr, value) => s"${attr} LIKE '%${value}'"
      case StringContains(attr, value) => s"${attr} LIKE '%${value}%'"
-      case In(attr, value) => s"$attr IN (${compileValue(value)})"
+      case In(null, value) => "NULL"


Dumb question but does this replacement make sense? can NULL be a predicate? Something that might otherwise render as SELECT * FROM foo WHERE NULL in () now becomes SELECT * FROM foo WHERE NULL?

Yeah. NULL in predicates means UNKNOWN.

FYI,

Our source code for the IN operator:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

Lines 141 to 142 in 84284e8

if (evaluatedValue == null) {

null

Oracle Docs of NULL: https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements005.htm

rxin · 2016-11-24T18:50:21Z

I believe the correct replacement for "x in ()" is "if (isnull(x)) null else false"

gatorsmile · 2016-11-24T19:31:13Z

How about adding the following two lines into the test case in PredicateSuite.scala?

    checkEvaluation(In(Literal.create(null, IntegerType), Seq.empty), null)
    checkEvaluation(In(Literal(1), Seq.empty), false)

rxin · 2016-11-24T19:38:04Z

Also use NonFoldableLiteral to do the test. Make sure you create a null nonfoldable literal to verify the result is null.

jiangxb1987 · 2016-11-25T04:27:55Z

Thank you for comment @rxin @srowen @gatorsmile! I'll update that later today!

SparkQA · 2016-11-25T07:39:18Z

Test build #69150 has finished for PR 15977 at commit 05b5016.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-25T07:42:35Z

Test build #69151 has started for PR 15977 at commit 99d4623.

jiangxb1987 · 2016-11-25T08:46:32Z

retest this please.

SparkQA · 2016-11-25T10:43:16Z

Test build #69154 has finished for PR 15977 at commit 99d4623.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2016-11-25T10:49:57Z

retest this please. - The failure seems not related to our change in this PR.

SparkQA · 2016-11-25T13:33:18Z

Test build #69161 has finished for PR 15977 at commit 99d4623.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-11-25T13:38:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

 case class OptimizeIn(conf: CatalystConf) extends Rule[LogicalPlan] {
  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    case q: LogicalPlan => q transformExpressionsDown {
+      case expr @ In(v, list) if list.isEmpty =>


You can remove this change. IN() is very efficient in these cases.

hvanhovell · 2016-11-25T13:41:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala

      case StringStartsWith(attr, value) => s"${attr} LIKE '${value}%'"
      case StringEndsWith(attr, value) => s"${attr} LIKE '%${value}'"
      case StringContains(attr, value) => s"${attr} LIKE '%${value}%'"
+      case In(attr, value) if value.isEmpty => s"IF(${attr} IS NULL, NULL, false)"


IF is not a part of the SQL standard, use: CASE WHEN $attr IS NULL THEN NULL ELSE FALSE END

SparkQA · 2016-11-25T19:20:23Z

Test build #69168 has finished for PR 15977 at commit d770934.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-11-25T20:44:09Z

LGTM. Merging to master/2.1. Thanks!

## What changes were proposed in this pull request? The expression `in(empty seq)` is invalid in some data source. Since `in(empty seq)` is always false, we should generate `in(empty seq)` to false literal in optimizer. The sql `SELECT * FROM t WHERE a IN ()` throws a `ParseException` which is consistent with Hive, don't need to change that behavior. ## How was this patch tested? Add new test case in `OptimizeInSuite`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #15977 from jiangxb1987/isin-empty. (cherry picked from commit e2fb9fd) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

…e#15977

## What changes were proposed in this pull request? The expression `in(empty seq)` is invalid in some data source. Since `in(empty seq)` is always false, we should generate `in(empty seq)` to false literal in optimizer. The sql `SELECT * FROM t WHERE a IN ()` throws a `ParseException` which is consistent with Hive, don't need to change that behavior. ## How was this patch tested? Add new test case in `OptimizeInSuite`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes apache#15977 from jiangxb1987/isin-empty.

Replace In(value, Seq.empty) with false literal.

57dfc23

handle the case in JDBCRDD.

b6681de

srowen reviewed Nov 22, 2016

View reviewed changes

hvanhovell reviewed Nov 22, 2016

View reviewed changes

rxin reviewed Nov 22, 2016

View reviewed changes

gatorsmile reviewed Nov 22, 2016

View reviewed changes

define behavior for In(NULL, list).

9bb1264

gatorsmile reviewed Nov 24, 2016

View reviewed changes

update comment.

2f31e72

srowen reviewed Nov 24, 2016

View reviewed changes

jiangxb1987 added 2 commits November 25, 2016 15:31

bugfix

05b5016

fix scala style check fail.

99d4623

hvanhovell reviewed Nov 25, 2016

View reviewed changes

remove optimizer rule for IN(value, Seq.empty).

d770934

asfgit closed this in e2fb9fd Nov 25, 2016

jiangxb1987 deleted the isin-empty branch November 26, 2016 10:14

zzcclp added a commit to zzcclp/spark that referenced this pull request Nov 30, 2016

[EXT][SPARK-18436][SQL] isin causing SQL syntax error with JDBC apach…

2abb6a4

…e#15977

[SPARK-18436][SQL] isin causing SQL syntax error with JDBC #15977

[SPARK-18436][SQL] isin causing SQL syntax error with JDBC #15977

Uh oh!

Conversation

jiangxb1987 commented Nov 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

jiangxb1987 commented Nov 22, 2016

Uh oh!

hvanhovell commented Nov 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Nov 22, 2016

Uh oh!

jiangxb1987 commented Nov 22, 2016

Uh oh!

jiangxb1987 commented Nov 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 24, 2016

Uh oh!

jiangxb1987 commented Nov 24, 2016

Uh oh!

SparkQA commented Nov 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Nov 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Nov 24, 2016

Uh oh!

gatorsmile commented Nov 24, 2016

Uh oh!

rxin commented Nov 24, 2016

Uh oh!

jiangxb1987 commented Nov 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Nov 25, 2016

Uh oh!

SparkQA commented Nov 25, 2016

Uh oh!

jiangxb1987 commented Nov 25, 2016

Uh oh!

SparkQA commented Nov 25, 2016

Uh oh!

jiangxb1987 commented Nov 25, 2016

Uh oh!

SparkQA commented Nov 25, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 commented Nov 22, 2016 •

edited

Loading

hvanhovell commented Nov 22, 2016 •

edited

Loading

jiangxb1987 commented Nov 22, 2016 •

edited

Loading

gatorsmile Nov 24, 2016 •

edited

Loading

jiangxb1987 commented Nov 25, 2016 •

edited

Loading