Skip to content

Conversation

@mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented Nov 1, 2017

What changes were proposed in this pull request?

Now, type coercion for IN is not coherent between Literals and subquery. This PR changes the behavior for the case with literals and makes it coherent with the case of the subquery and also with the binary comparisons.

Before the patch, when IN is used with literals, we are using findWiderCommonType to determine the type to cast all the elements in the list and the value attribute of the In operator. This is not consistent with the behavior In has when there is a subquery, where we are using findCommonTypeForBinaryComparison.

The PR changes In type coercion with Literals to make it coherent to the one with subqueries (which is also the one used in other places, like simple comparisons).

How was this patch tested?

Added UT

@hvanhovell
Copy link
Contributor

@mgaido91 Can you update the PR and describe there what you exactly changed?

@hvanhovell
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Nov 1, 2017

Test build #83304 has finished for PR 19635 at commit 8fb9c9d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Nov 2, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Nov 3, 2017

Test build #83355 has finished for PR 19635 at commit 8481b59.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


case i @ In(a, b) if b.exists(_.dataType != a.dataType) =>
findWiderCommonType(i.children.map(_.dataType)) match {
findWiderCommonType(b.map(_.dataType)).flatMap(listDataType => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: .flatMap { listDataType => , and remove the ) in line 457

@mgaido91
Copy link
Contributor Author

mgaido91 commented Nov 4, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Nov 4, 2017

Test build #83439 has finished for PR 19635 at commit c973bb7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Nov 4, 2017

has anyone any idea of the reason of the failure?

java.io.IOException: Failed to delete: /home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-d50fe63e-d412-4620-bb0d-0ae3cfe5cc9d

I guess it is an infra error.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Nov 4, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Nov 4, 2017

Test build #83445 has finished for PR 19635 at commit c973bb7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

findWiderCommonType(i.children.map(_.dataType)) match {
findWiderCommonType(b.map(_.dataType)).flatMap { listDataType =>
findCommonTypeForBinaryComparison(listDataType, a.dataType)
.orElse(findWiderTypeForTwo(listDataType, a.dataType))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason we need to call findWiderTypeForTwo ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before this PR, we were calling always findWiderCommonType and this was applied to all the elements of the list and the value. Here, I am calling findWiderTypeForTwo if findCommonTypeForBinaryComparison fails to have the same previous behavior in those cases.

val commonTypes = lhs.zip(rhs).flatMap { case (l, r) =>
findCommonTypeForBinaryComparison(l.dataType, r.dataType)
.orElse(findTightestCommonType(l.dataType, r.dataType))
.orElse(findWiderTypeForTwo(l.dataType, r.dataType))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we make this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be coherent with what is done when there are literals instead of a subquery.

@gatorsmile
Copy link
Member

I start worrying about the behavior change introduced by this PR.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Nov 5, 2017

@gatorsmile I see you point. And of course there is a behavioral change. For instance, select 1 in ('01') before the PR return false, after it returns true.

Nonetheless, I think that the current behavior is not good at all. Indeed, the problem is not only that IN works differently from = (which actually IS a problem, since there are places in the code like

where IN is translated to a sequence of equality comparisons). But the real issue is that IN behaves differently whether it is used with a subquery or with a list of literals. For instance, please refer to the test I added in this PR. This is very bad, since maybe people are using hardcoded literals for testing and a subquery in their real workload and the behavior might change between these two scenarios. Sometimes, currently, what is working with literals is even throwing an exception with subqueries. Or they are simply returning different results.

Thus, I do believe that despite introducing a behavior change is generally something we would like to avoid, here the current situation is too bad to let it as it is. And I think this is the change which minimizes the behavioral changes making them coherent, but of course I am open to any kind of discussion about this.

@gatorsmile
Copy link
Member

Since we are trying to introduce a new behavior, could you try the other systems and see how they behave in this scenario?

@mgaido91
Copy link
Contributor Author

mgaido91 commented Nov 5, 2017

yes, of course. Which ones should I try? Hive and Oracle?

@mgaido91
Copy link
Contributor Author

mgaido91 commented Nov 5, 2017

Oracle behaves like Spark after the patch:

select 'a' from dual where 1 in ('01');
// returns 'a'
select 'a' from dual where 1 in (select '01' from dual);
// returns 'a'

@mgaido91
Copy link
Contributor Author

mgaido91 commented Nov 6, 2017

Hive is interesting. In older versions, it behaves like current Spark. But in its current master branch the behavior is like after the patch:

0: jdbc:hive2://localhost:10000> select 'a' where 1 in ('01');
INFO  : Compiling command(queryId=root_20171106045740_254a2d60-ae1f-4851-b304-dfa18551fff2): select 'a' where 1 in ('01')
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=root_20171106045740_254a2d60-ae1f-4851-b304-dfa18551fff2); Time taken: 5.794 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=root_20171106045740_254a2d60-ae1f-4851-b304-dfa18551fff2): select 'a' where 1 in ('01')
INFO  : Completed executing command(queryId=root_20171106045740_254a2d60-ae1f-4851-b304-dfa18551fff2); Time taken: 0.008 seconds
INFO  : OK
+------+
| _c0  |
+------+
| a    |
+------+
1 row selected (6.321 seconds)
0: jdbc:hive2://localhost:10000> select 'a' where 1 in (select '01' from (select 1) dual);
INFO  : Compiling command(queryId=root_20171106045757_48e04001-bfbd-4557-9dd5-4e97674708ff): select 'a' where 1 in (select '01' from (select 1) dual)
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=root_20171106045757_48e04001-bfbd-4557-9dd5-4e97674708ff); Time taken: 0.869 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=root_20171106045757_48e04001-bfbd-4557-9dd5-4e97674708ff): select 'a' where 1 in (select '01' from (select 1) dual)
WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
INFO  : Query ID = root_20171106045757_48e04001-bfbd-4557-9dd5-4e97674708ff
INFO  : Total jobs = 1
INFO  : Starting task [Stage-4:MAPREDLOCAL] in serial mode
INFO  : Execution completed successfully
INFO  : MapredLocal task succeeded
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-3:MAPRED] in serial mode
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
WARN  : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1509962180830_0001
INFO  : The url to track the job: http://6edb04432864:8088/proxy/application_1509962180830_0001/
INFO  : Starting Job = job_1509962180830_0001, Tracking URL = http://6edb04432864:8088/proxy/application_1509962180830_0001/
INFO  : Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1509962180830_0001
INFO  : Hadoop job information for Stage-3: number of mappers: 0; number of reducers: 0
INFO  : 2017-11-06 04:58:27,891 Stage-3 map = 0%,  reduce = 0%
INFO  : 2017-11-06 04:58:35,774 Stage-3 map = 100%,  reduce = 0%
INFO  : Ended Job = job_1509962180830_0001
INFO  : MapReduce Jobs Launched:
INFO  : Stage-Stage-3:  HDFS Read: 0 HDFS Write: 0 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 0 msec
INFO  : Completed executing command(queryId=root_20171106045757_48e04001-bfbd-4557-9dd5-4e97674708ff); Time taken: 37.367 seconds
INFO  : OK
+------+
| _c0  |
+------+
| a    |
+------+
1 row selected (38.501 seconds)

It looks like it has been fixed but I am not sure about the relevant JIRA: maybe it is HIVE-13957.

@gatorsmile should I check other databases?

@mgaido91
Copy link
Contributor Author

@gatorsmile any further comment on this?

@mgaido91
Copy link
Contributor Author

kindly ping @gatorsmile

@SparkQA
Copy link

SparkQA commented Jan 13, 2018

Test build #86106 has finished for PR 19635 at commit 2db7cf0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

the tests which are failing are because of the tests added about the current behavior of Spark (which are introduced looking forward to introduce a Hive compliant mode).

Therefore, before fixing the tests, I'd like to know which is the directions the committers are thinking about for this. If we should introduce as part of the Hive compliance mode or not. My opinion is that this should be fixed also for Spark native mode because this is causing an inconsistent behavior which is something very confusing for a user and I think is evident to be wrong.

any thoughts @gatorsmile @hvanhovell ?
Thanks.

@mgaido91
Copy link
Contributor Author

@gatorsmile @hvanhovell As there were objections on having this in or not, before resolving the conflict I just wanted to check what do you think about fixing this behavior. What do you think? Thanks.

@wangyum
Copy link
Member

wangyum commented Jul 26, 2018

The below SQL will throw AnalysisException. but it can success on Spark 2.1.x. Hope this can fix soon.

CREATE TEMPORARY VIEW t4 AS SELECT * FROM VALUES
  (CAST(1 AS DOUBLE), CAST(2 AS STRING), CAST(3 AS STRING))
AS t1(t4a, t4b, t4c);

CREATE TEMPORARY VIEW t5 AS SELECT * FROM VALUES
  (CAST(1 AS DECIMAL(18, 0)), CAST(2 AS STRING), CAST(3 AS BIGINT))
AS t1(t5a, t5b, t5c);

SELECT * FROM t4
WHERE
(t4a, t4b, t4c) IN (SELECT t5a, t5b, t5c FROM t5);

cc @yucai @seancxmao

@mgaido91
Copy link
Contributor Author

mgaido91 commented Aug 6, 2018

kindy ping @gatorsmile @hvanhovell

@HyukjinKwon
Copy link
Member

@mgaido91 can you update this PR when you're available? I think it might be fine given that we're now going ahead for Spark 3

@mgaido91
Copy link
Contributor Author

mgaido91 commented Dec 4, 2019

sure, I will ASAP @HyukjinKwon , thanks.

@SparkQA
Copy link

SparkQA commented Dec 4, 2019

Test build #114879 has finished for PR 19635 at commit a8723c2.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 4, 2019

Test build #114881 has finished for PR 19635 at commit 7656864.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Mar 14, 2020
@github-actions github-actions bot closed this Mar 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants