-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22413][SQL] Type coercion for IN is not coherent between Literals and subquery #19635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@mgaido91 Can you update the PR and describe there what you exactly changed? |
|
ok to test |
|
Test build #83304 has finished for PR 19635 at commit
|
|
retest this please |
|
Test build #83355 has finished for PR 19635 at commit
|
|
|
||
| case i @ In(a, b) if b.exists(_.dataType != a.dataType) => | ||
| findWiderCommonType(i.children.map(_.dataType)) match { | ||
| findWiderCommonType(b.map(_.dataType)).flatMap(listDataType => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: .flatMap { listDataType => , and remove the ) in line 457
|
retest this please |
|
Test build #83439 has finished for PR 19635 at commit
|
|
has anyone any idea of the reason of the failure? I guess it is an infra error. |
|
retest this please |
|
Test build #83445 has finished for PR 19635 at commit
|
| findWiderCommonType(i.children.map(_.dataType)) match { | ||
| findWiderCommonType(b.map(_.dataType)).flatMap { listDataType => | ||
| findCommonTypeForBinaryComparison(listDataType, a.dataType) | ||
| .orElse(findWiderTypeForTwo(listDataType, a.dataType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason we need to call findWiderTypeForTwo ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before this PR, we were calling always findWiderCommonType and this was applied to all the elements of the list and the value. Here, I am calling findWiderTypeForTwo if findCommonTypeForBinaryComparison fails to have the same previous behavior in those cases.
| val commonTypes = lhs.zip(rhs).flatMap { case (l, r) => | ||
| findCommonTypeForBinaryComparison(l.dataType, r.dataType) | ||
| .orElse(findTightestCommonType(l.dataType, r.dataType)) | ||
| .orElse(findWiderTypeForTwo(l.dataType, r.dataType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we make this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be coherent with what is done when there are literals instead of a subquery.
|
I start worrying about the behavior change introduced by this PR. |
|
@gatorsmile I see you point. And of course there is a behavioral change. For instance, Nonetheless, I think that the current behavior is not good at all. Indeed, the problem is not only that spark/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala Line 150 in 572284c
IN is translated to a sequence of equality comparisons). But the real issue is that IN behaves differently whether it is used with a subquery or with a list of literals. For instance, please refer to the test I added in this PR. This is very bad, since maybe people are using hardcoded literals for testing and a subquery in their real workload and the behavior might change between these two scenarios. Sometimes, currently, what is working with literals is even throwing an exception with subqueries. Or they are simply returning different results.
Thus, I do believe that despite introducing a behavior change is generally something we would like to avoid, here the current situation is too bad to let it as it is. And I think this is the change which minimizes the behavioral changes making them coherent, but of course I am open to any kind of discussion about this. |
|
Since we are trying to introduce a new behavior, could you try the other systems and see how they behave in this scenario? |
|
yes, of course. Which ones should I try? Hive and Oracle? |
|
Oracle behaves like Spark after the patch: |
|
Hive is interesting. In older versions, it behaves like current Spark. But in its current master branch the behavior is like after the patch: It looks like it has been fixed but I am not sure about the relevant JIRA: maybe it is HIVE-13957. @gatorsmile should I check other databases? |
|
@gatorsmile any further comment on this? |
|
kindly ping @gatorsmile |
|
Test build #86106 has finished for PR 19635 at commit
|
|
the tests which are failing are because of the tests added about the current behavior of Spark (which are introduced looking forward to introduce a Hive compliant mode). Therefore, before fixing the tests, I'd like to know which is the directions the committers are thinking about for this. If we should introduce as part of the Hive compliance mode or not. My opinion is that this should be fixed also for Spark native mode because this is causing an inconsistent behavior which is something very confusing for a user and I think is evident to be wrong. any thoughts @gatorsmile @hvanhovell ? |
|
@gatorsmile @hvanhovell As there were objections on having this in or not, before resolving the conflict I just wanted to check what do you think about fixing this behavior. What do you think? Thanks. |
|
The below SQL will throw CREATE TEMPORARY VIEW t4 AS SELECT * FROM VALUES
(CAST(1 AS DOUBLE), CAST(2 AS STRING), CAST(3 AS STRING))
AS t1(t4a, t4b, t4c);
CREATE TEMPORARY VIEW t5 AS SELECT * FROM VALUES
(CAST(1 AS DECIMAL(18, 0)), CAST(2 AS STRING), CAST(3 AS BIGINT))
AS t1(t5a, t5b, t5c);
SELECT * FROM t4
WHERE
(t4a, t4b, t4c) IN (SELECT t5a, t5b, t5c FROM t5);cc @yucai @seancxmao |
|
kindy ping @gatorsmile @hvanhovell |
|
@mgaido91 can you update this PR when you're available? I think it might be fine given that we're now going ahead for Spark 3 |
|
sure, I will ASAP @HyukjinKwon , thanks. |
|
Test build #114879 has finished for PR 19635 at commit
|
|
Test build #114881 has finished for PR 19635 at commit
|
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Now, type coercion for IN is not coherent between Literals and subquery. This PR changes the behavior for the case with literals and makes it coherent with the case of the subquery and also with the binary comparisons.
Before the patch, when IN is used with literals, we are using
findWiderCommonTypeto determine the type to cast all the elements in the list and the value attribute of the In operator. This is not consistent with the behavior In has when there is a subquery, where we are usingfindCommonTypeForBinaryComparison.The PR changes In type coercion with Literals to make it coherent to the one with subqueries (which is also the one used in other places, like simple comparisons).
How was this patch tested?
Added UT