[SPARK-22413][SQL] Type coercion for IN is not coherent between Literals and subquery #19635

mgaido91 · 2017-11-01T20:50:03Z

What changes were proposed in this pull request?

Now, type coercion for IN is not coherent between Literals and subquery. This PR changes the behavior for the case with literals and makes it coherent with the case of the subquery and also with the binary comparisons.

Before the patch, when IN is used with literals, we are using findWiderCommonType to determine the type to cast all the elements in the list and the value attribute of the In operator. This is not consistent with the behavior In has when there is a subquery, where we are using findCommonTypeForBinaryComparison.

The PR changes In type coercion with Literals to make it coherent to the one with subqueries (which is also the one used in other places, like simple comparisons).

How was this patch tested?

Added UT

…als and subquery

hvanhovell · 2017-11-01T20:54:23Z

@mgaido91 Can you update the PR and describe there what you exactly changed?

hvanhovell · 2017-11-01T20:54:31Z

ok to test

SparkQA · 2017-11-01T22:31:43Z

Test build #83304 has finished for PR 19635 at commit 8fb9c9d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2017-11-02T21:05:37Z

retest this please

SparkQA · 2017-11-03T00:01:50Z

Test build #83355 has finished for PR 19635 at commit 8481b59.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-11-04T07:23:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala


      case i @ In(a, b) if b.exists(_.dataType != a.dataType) =>
-        findWiderCommonType(i.children.map(_.dataType)) match {
+        findWiderCommonType(b.map(_.dataType)).flatMap(listDataType => {


nit: .flatMap { listDataType => , and remove the ) in line 457

mgaido91 · 2017-11-04T10:33:57Z

retest this please

SparkQA · 2017-11-04T12:35:58Z

Test build #83439 has finished for PR 19635 at commit c973bb7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2017-11-04T13:13:24Z

has anyone any idea of the reason of the failure?

java.io.IOException: Failed to delete: /home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-d50fe63e-d412-4620-bb0d-0ae3cfe5cc9d

I guess it is an infra error.

mgaido91 · 2017-11-04T13:13:31Z

retest this please

SparkQA · 2017-11-04T16:06:57Z

Test build #83445 has finished for PR 19635 at commit c973bb7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-11-04T23:10:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

-        findWiderCommonType(i.children.map(_.dataType)) match {
+        findWiderCommonType(b.map(_.dataType)).flatMap { listDataType =>
+          findCommonTypeForBinaryComparison(listDataType, a.dataType)
+            .orElse(findWiderTypeForTwo(listDataType, a.dataType))


What is the reason we need to call findWiderTypeForTwo ?

before this PR, we were calling always findWiderCommonType and this was applied to all the elements of the list and the value. Here, I am calling findWiderTypeForTwo if findCommonTypeForBinaryComparison fails to have the same previous behavior in those cases.

gatorsmile · 2017-11-04T23:13:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

        val commonTypes = lhs.zip(rhs).flatMap { case (l, r) =>
          findCommonTypeForBinaryComparison(l.dataType, r.dataType)
-            .orElse(findTightestCommonType(l.dataType, r.dataType))
+            .orElse(findWiderTypeForTwo(l.dataType, r.dataType))


Why we make this change?

To be coherent with what is done when there are literals instead of a subquery.

gatorsmile · 2017-11-04T23:14:17Z

I start worrying about the behavior change introduced by this PR.

mgaido91 · 2017-11-05T09:10:04Z

@gatorsmile I see you point. And of course there is a behavioral change. For instance, select 1 in ('01') before the PR return false, after it returns true.

Nonetheless, I think that the current behavior is not good at all. Indeed, the problem is not only that IN works differently from = (which actually IS a problem, since there are places in the code like

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala

Line 150 in 572284c

case In(a: AttributeReference, list: Seq[Expression])

where IN is translated to a sequence of equality comparisons). But the real issue is that IN behaves differently whether it is used with a subquery or with a list of literals. For instance, please refer to the test I added in this PR. This is very bad, since maybe people are using hardcoded literals for testing and a subquery in their real workload and the behavior might change between these two scenarios. Sometimes, currently, what is working with literals is even throwing an exception with subqueries. Or they are simply returning different results.

Thus, I do believe that despite introducing a behavior change is generally something we would like to avoid, here the current situation is too bad to let it as it is. And I think this is the change which minimizes the behavioral changes making them coherent, but of course I am open to any kind of discussion about this.

gatorsmile · 2017-11-05T18:12:21Z

Since we are trying to introduce a new behavior, could you try the other systems and see how they behave in this scenario?

mgaido91 · 2017-11-05T18:18:05Z

yes, of course. Which ones should I try? Hive and Oracle?

mgaido91 · 2017-11-05T18:34:55Z

Oracle behaves like Spark after the patch:

select 'a' from dual where 1 in ('01');
// returns 'a'
select 'a' from dual where 1 in (select '01' from dual);
// returns 'a'

mgaido91 · 2017-11-06T10:29:43Z

Hive is interesting. In older versions, it behaves like current Spark. But in its current master branch the behavior is like after the patch:

0: jdbc:hive2://localhost:10000> select 'a' where 1 in ('01');
INFO  : Compiling command(queryId=root_20171106045740_254a2d60-ae1f-4851-b304-dfa18551fff2): select 'a' where 1 in ('01')
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=root_20171106045740_254a2d60-ae1f-4851-b304-dfa18551fff2); Time taken: 5.794 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=root_20171106045740_254a2d60-ae1f-4851-b304-dfa18551fff2): select 'a' where 1 in ('01')
INFO  : Completed executing command(queryId=root_20171106045740_254a2d60-ae1f-4851-b304-dfa18551fff2); Time taken: 0.008 seconds
INFO  : OK
+------+
| _c0  |
+------+
| a    |
+------+
1 row selected (6.321 seconds)
0: jdbc:hive2://localhost:10000> select 'a' where 1 in (select '01' from (select 1) dual);
INFO  : Compiling command(queryId=root_20171106045757_48e04001-bfbd-4557-9dd5-4e97674708ff): select 'a' where 1 in (select '01' from (select 1) dual)
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=root_20171106045757_48e04001-bfbd-4557-9dd5-4e97674708ff); Time taken: 0.869 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=root_20171106045757_48e04001-bfbd-4557-9dd5-4e97674708ff): select 'a' where 1 in (select '01' from (select 1) dual)
WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
INFO  : Query ID = root_20171106045757_48e04001-bfbd-4557-9dd5-4e97674708ff
INFO  : Total jobs = 1
INFO  : Starting task [Stage-4:MAPREDLOCAL] in serial mode
INFO  : Execution completed successfully
INFO  : MapredLocal task succeeded
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-3:MAPRED] in serial mode
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
WARN  : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1509962180830_0001
INFO  : The url to track the job: http://6edb04432864:8088/proxy/application_1509962180830_0001/
INFO  : Starting Job = job_1509962180830_0001, Tracking URL = http://6edb04432864:8088/proxy/application_1509962180830_0001/
INFO  : Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1509962180830_0001
INFO  : Hadoop job information for Stage-3: number of mappers: 0; number of reducers: 0
INFO  : 2017-11-06 04:58:27,891 Stage-3 map = 0%,  reduce = 0%
INFO  : 2017-11-06 04:58:35,774 Stage-3 map = 100%,  reduce = 0%
INFO  : Ended Job = job_1509962180830_0001
INFO  : MapReduce Jobs Launched:
INFO  : Stage-Stage-3:  HDFS Read: 0 HDFS Write: 0 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 0 msec
INFO  : Completed executing command(queryId=root_20171106045757_48e04001-bfbd-4557-9dd5-4e97674708ff); Time taken: 37.367 seconds
INFO  : OK
+------+
| _c0  |
+------+
| a    |
+------+
1 row selected (38.501 seconds)

It looks like it has been fixed but I am not sure about the relevant JIRA: maybe it is HIVE-13957.

@gatorsmile should I check other databases?

mgaido91 · 2017-11-14T15:33:12Z

@gatorsmile any further comment on this?

mgaido91 · 2017-11-27T21:21:29Z

kindly ping @gatorsmile

SparkQA · 2018-01-13T21:25:13Z

Test build #86106 has finished for PR 19635 at commit 2db7cf0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-01-13T21:39:02Z

the tests which are failing are because of the tests added about the current behavior of Spark (which are introduced looking forward to introduce a Hive compliant mode).

Therefore, before fixing the tests, I'd like to know which is the directions the committers are thinking about for this. If we should introduce as part of the Hive compliance mode or not. My opinion is that this should be fixed also for Spark native mode because this is causing an inconsistent behavior which is something very confusing for a user and I think is evident to be wrong.

any thoughts @gatorsmile @hvanhovell ?
Thanks.

mgaido91 · 2018-07-16T10:48:25Z

@gatorsmile @hvanhovell As there were objections on having this in or not, before resolving the conflict I just wanted to check what do you think about fixing this behavior. What do you think? Thanks.

wangyum · 2018-07-26T07:37:41Z

The below SQL will throw AnalysisException. but it can success on Spark 2.1.x. Hope this can fix soon.

CREATE TEMPORARY VIEW t4 AS SELECT * FROM VALUES
  (CAST(1 AS DOUBLE), CAST(2 AS STRING), CAST(3 AS STRING))
AS t1(t4a, t4b, t4c);

CREATE TEMPORARY VIEW t5 AS SELECT * FROM VALUES
  (CAST(1 AS DECIMAL(18, 0)), CAST(2 AS STRING), CAST(3 AS BIGINT))
AS t1(t5a, t5b, t5c);

SELECT * FROM t4
WHERE
(t4a, t4b, t4c) IN (SELECT t5a, t5b, t5c FROM t5);

cc @yucai @seancxmao

mgaido91 · 2018-08-06T11:32:23Z

kindy ping @gatorsmile @hvanhovell

HyukjinKwon · 2019-12-04T02:27:23Z

@mgaido91 can you update this PR when you're available? I think it might be fine given that we're now going ahead for Spark 3

mgaido91 · 2019-12-04T10:16:49Z

sure, I will ASAP @HyukjinKwon , thanks.

SparkQA · 2019-12-04T20:50:18Z

Test build #114879 has finished for PR 19635 at commit a8723c2.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-04T23:04:24Z

Test build #114881 has finished for PR 19635 at commit 7656864.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

github-actions · 2020-03-14T00:14:12Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

[SPARK-22413][SQL] Type coercion for IN is not coherent between Liter…

8fb9c9d

…als and subquery

use widest type instead of tightest

8481b59

wzhfy reviewed Nov 4, 2017

View reviewed changes

style fix

c973bb7

gatorsmile reviewed Nov 4, 2017

View reviewed changes

Merge branch 'master' into SPARK-22413

2db7cf0

mgaido91 mentioned this pull request Jul 25, 2018

[SPARK-24916][SQL] Fix type coercion for IN expression with subquery #21871

Closed

mgaido91 mentioned this pull request Nov 16, 2018

[SPARK-26070][SQL] add rule for implicit type coercion for decimal(x,0) #23042

Closed

dongjoon-hyun added the SQL label Jun 14, 2019

mgaido91 mentioned this pull request Nov 14, 2019

[SPARK-29860][SQL] Fix dataType mismatch issue for InSubquery. #26485

Closed

Merge branch 'master' into SPARK-22413

a8723c2

fix build

7656864

github-actions bot added the Stale label Mar 14, 2020

github-actions bot closed this Mar 15, 2020

[SPARK-22413][SQL] Type coercion for IN is not coherent between Literals and subquery #19635

[SPARK-22413][SQL] Type coercion for IN is not coherent between Literals and subquery #19635

Uh oh!

Conversation

mgaido91 commented Nov 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

hvanhovell commented Nov 1, 2017

Uh oh!

hvanhovell commented Nov 1, 2017

Uh oh!

SparkQA commented Nov 1, 2017

Uh oh!

mgaido91 commented Nov 2, 2017

Uh oh!

SparkQA commented Nov 3, 2017

Uh oh!

wzhfy Nov 4, 2017

Choose a reason for hiding this comment

Uh oh!

mgaido91 commented Nov 4, 2017

Uh oh!

SparkQA commented Nov 4, 2017

Uh oh!

mgaido91 commented Nov 4, 2017

Uh oh!

mgaido91 commented Nov 4, 2017

Uh oh!

SparkQA commented Nov 4, 2017

Uh oh!

gatorsmile Nov 4, 2017

Choose a reason for hiding this comment

Uh oh!

mgaido91 Nov 5, 2017

Choose a reason for hiding this comment

Uh oh!

gatorsmile Nov 4, 2017

Choose a reason for hiding this comment

Uh oh!

mgaido91 Nov 5, 2017

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Nov 4, 2017

Uh oh!

mgaido91 commented Nov 5, 2017

Uh oh!

gatorsmile commented Nov 5, 2017

Uh oh!

mgaido91 commented Nov 5, 2017

Uh oh!

mgaido91 commented Nov 5, 2017

Uh oh!

mgaido91 commented Nov 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgaido91 commented Nov 14, 2017

Uh oh!

mgaido91 commented Nov 27, 2017

Uh oh!

SparkQA commented Jan 13, 2018

Uh oh!

mgaido91 commented Jan 13, 2018

Uh oh!

mgaido91 commented Jul 16, 2018

Uh oh!

wangyum commented Jul 26, 2018

Uh oh!

mgaido91 commented Aug 6, 2018

Uh oh!

HyukjinKwon commented Dec 4, 2019

Uh oh!

mgaido91 commented Dec 4, 2019

Uh oh!

SparkQA commented Dec 4, 2019

Uh oh!

SparkQA commented Dec 4, 2019

Uh oh!

github-actions bot commented Mar 14, 2020

mgaido91 commented Nov 1, 2017 •

edited

Loading

mgaido91 commented Nov 6, 2017 •

edited

Loading