-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18436][SQL] isin causing SQL syntax error with JDBC #15977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Could you also fix the actual problem: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L123 |
|
Test build #68991 has finished for PR 15977 at commit
|
|
@hvanhovell Thanks for that catch! I'll fix that too. |
|
The failed test case seems not related to our change here. |
| s"$attr IN (${compileValue(value)})" | ||
| } else { | ||
| // Return false literal when value is empty. | ||
| s"false" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: remove interpolation
|
Test build #68998 has finished for PR 15977 at commit
|
| case StringEndsWith(attr, value) => s"${attr} LIKE '%${value}'" | ||
| case StringContains(attr, value) => s"${attr} LIKE '%${value}%'" | ||
| case In(attr, value) => s"$attr IN (${compileValue(value)})" | ||
| case In(attr, value) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: It is shorter to put both options in different case statements:
...
case In(attr, values) if value.nonEmpty => s"$attr IN (${compileValue(value)})"
case In(_, _) => "false"
...| case class OptimizeIn(conf: CatalystConf) extends Rule[LogicalPlan] { | ||
| def apply(plan: LogicalPlan): LogicalPlan = plan transform { | ||
| case q: LogicalPlan => q transformExpressionsDown { | ||
| case expr @ In(v, list) if list.isEmpty => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the current behavior for (null in ()) ? We want to make sure the implementation of In returns the same result (False here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NULL IN () returns null, unfortunately NULL NOT IN () also returns null. This rule can cause NOT IN () to evaluate to true instead of null which is illegal (a filter only accepts rows for which the predicate evaluates to true).
We either have to drop the rule, or apply it to top level IN expressions only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we add a new case to handle null literal in value? Like the following:
case expr @ In(v @ Literal(null, _), list) =>
v
case expr @ In(v, list) if list.isEmpty =>
FalseLiteral
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove this change. IN() is very efficient in these cases.
| if (value.nonEmpty) { | ||
| s"$attr IN (${compileValue(value)})" | ||
| } else { | ||
| // Return false literal when value is empty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is dead code, right?
If Optimizer already replace the empty value by a constant false, how to reach this branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't rely on the optimizer for correctness.
|
Test build #69113 has started for PR 15977 at commit |
| * 1. Removes literal repetitions. | ||
| * 2. Replaces [[In (value, seq[Literal])]] with optimized version | ||
| * [[InSet (value, HashSet[Literal])]] which is much faster. | ||
| * 3. Replaces [[In (value, Seq.empty)]] with false literal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to update this, right?
|
Test build #69118 has started for PR 15977 at commit |
|
retest this please. |
|
Test build #69120 has finished for PR 15977 at commit
|
| case StringEndsWith(attr, value) => s"${attr} LIKE '%${value}'" | ||
| case StringContains(attr, value) => s"${attr} LIKE '%${value}%'" | ||
| case In(attr, value) => s"$attr IN (${compileValue(value)})" | ||
| case In(null, value) => "NULL" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dumb question but does this replacement make sense? can NULL be a predicate? Something that might otherwise render as SELECT * FROM foo WHERE NULL in () now becomes SELECT * FROM foo WHERE NULL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. NULL in predicates means UNKNOWN.
FYI,
-
Our source code for the
INoperator:spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
Lines 141 to 142 in 84284e8
if (evaluatedValue == null) { null -
Oracle Docs of
NULL: https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements005.htm
|
I believe the correct replacement for "x in ()" is "if (isnull(x)) null else false" |
|
How about adding the following two lines into the test case in checkEvaluation(In(Literal.create(null, IntegerType), Seq.empty), null)
checkEvaluation(In(Literal(1), Seq.empty), false) |
|
Also use NonFoldableLiteral to do the test. Make sure you create a null nonfoldable literal to verify the result is null. |
|
Thank you for comment @rxin @srowen @gatorsmile! I'll update that later today! |
|
Test build #69150 has finished for PR 15977 at commit
|
|
Test build #69151 has started for PR 15977 at commit |
|
retest this please. |
|
Test build #69154 has finished for PR 15977 at commit
|
|
retest this please. - The failure seems not related to our change in this PR. |
|
Test build #69161 has finished for PR 15977 at commit
|
| case class OptimizeIn(conf: CatalystConf) extends Rule[LogicalPlan] { | ||
| def apply(plan: LogicalPlan): LogicalPlan = plan transform { | ||
| case q: LogicalPlan => q transformExpressionsDown { | ||
| case expr @ In(v, list) if list.isEmpty => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove this change. IN() is very efficient in these cases.
| case StringStartsWith(attr, value) => s"${attr} LIKE '${value}%'" | ||
| case StringEndsWith(attr, value) => s"${attr} LIKE '%${value}'" | ||
| case StringContains(attr, value) => s"${attr} LIKE '%${value}%'" | ||
| case In(attr, value) if value.isEmpty => s"IF(${attr} IS NULL, NULL, false)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IF is not a part of the SQL standard, use: CASE WHEN $attr IS NULL THEN NULL ELSE FALSE END
|
Test build #69168 has finished for PR 15977 at commit
|
|
LGTM. Merging to master/2.1. Thanks! |
## What changes were proposed in this pull request? The expression `in(empty seq)` is invalid in some data source. Since `in(empty seq)` is always false, we should generate `in(empty seq)` to false literal in optimizer. The sql `SELECT * FROM t WHERE a IN ()` throws a `ParseException` which is consistent with Hive, don't need to change that behavior. ## How was this patch tested? Add new test case in `OptimizeInSuite`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #15977 from jiangxb1987/isin-empty. (cherry picked from commit e2fb9fd) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
## What changes were proposed in this pull request? The expression `in(empty seq)` is invalid in some data source. Since `in(empty seq)` is always false, we should generate `in(empty seq)` to false literal in optimizer. The sql `SELECT * FROM t WHERE a IN ()` throws a `ParseException` which is consistent with Hive, don't need to change that behavior. ## How was this patch tested? Add new test case in `OptimizeInSuite`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #15977 from jiangxb1987/isin-empty. (cherry picked from commit e2fb9fd) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
## What changes were proposed in this pull request? The expression `in(empty seq)` is invalid in some data source. Since `in(empty seq)` is always false, we should generate `in(empty seq)` to false literal in optimizer. The sql `SELECT * FROM t WHERE a IN ()` throws a `ParseException` which is consistent with Hive, don't need to change that behavior. ## How was this patch tested? Add new test case in `OptimizeInSuite`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes apache#15977 from jiangxb1987/isin-empty.
## What changes were proposed in this pull request? The expression `in(empty seq)` is invalid in some data source. Since `in(empty seq)` is always false, we should generate `in(empty seq)` to false literal in optimizer. The sql `SELECT * FROM t WHERE a IN ()` throws a `ParseException` which is consistent with Hive, don't need to change that behavior. ## How was this patch tested? Add new test case in `OptimizeInSuite`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes apache#15977 from jiangxb1987/isin-empty.
What changes were proposed in this pull request?
The expression
in(empty seq)is invalid in some data source. Sincein(empty seq)is always false, we should generatein(empty seq)to false literal in optimizer.The sql
SELECT * FROM t WHERE a IN ()throws aParseExceptionwhich is consistent with Hive, don't need to change that behavior.How was this patch tested?
Add new test case in
OptimizeInSuite.