-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24012][SQL] Union of map and other compatible column #21100
Conversation
ok to test |
Test build #89593 has finished for PR 21100 at commit
|
""" | ||
|SELECT map(1, 2), 'str' | ||
|UNION ALL | ||
|SELECT map(1, 2, 3, NULL), 1""".stripMargin), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you give some insight about why it doesn't work? I'd expect Spark first do type coercion for map(1, 2, 3, NULL)
, and the result is map<int, nullable int>, then Union should accept nullability difference and pass the analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
map<int, nullable int> and map<int, not nullable int> are accepted by Union, but, string and int are not.
If types of one column can not be accepted by Union, TCWSOT(TypeCoercion.WidenSetOperationTypes) will try to coerce them to a completely identical type. TCWSOT works when all of the columns can be coerced and not work when columns can not be coerced exist.
map<int, nullable int> and map<int, not nullable int> can not be coerced, so, TCWSOT didn't work and string and int will not be coerced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we make map<int, nullable int> and map<int, not nullable int> coerce-able?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course we can.
two solution:
- Try cast two map types to one no matter key types are not the same or value types are not the same.
select map(1, 2) union all map(1, 'str')
will work. - Cast two map types to one only when the key types are the same and value types are the same. This solution just resolve the problem that map<t1, nullable t2> and map<t1, not nullable t2> can't be union.
Hive doesn't support select map(1, 2) union all map(1, 'str')
, should spark be compatible with hive?
…ot nullable type2> coerce-able
@@ -171,6 +171,15 @@ object TypeCoercion { | |||
.orElse((t1, t2) match { | |||
case (ArrayType(et1, containsNull1), ArrayType(et2, containsNull2)) => | |||
findWiderTypeForTwo(et1, et2).map(ArrayType(_, containsNull1 || containsNull2)) | |||
case (MapType(keyType1, valueType1, n1), MapType(keyType2, valueType2, n2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have similar logic for struct type in findTightestCommonType
, I think we should also handle array and map type there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I implements this logic in findTightestCommonType
, looking forward to further review.
Test build #89626 has finished for PR 21100 at commit
|
…ndTightestCommonType
@@ -111,6 +111,18 @@ object TypeCoercion { | |||
val dataType = findTightestCommonType(f1.dataType, f2.dataType).get | |||
StructField(f1.name, dataType, nullable = f1.nullable || f2.nullable) | |||
})) | |||
case (a1 @ ArrayType(et1, containsNull1), a2 @ ArrayType(et2, containsNull2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can shorten the name here: hasNull1
hasNull2
case (a1 @ ArrayType(et1, containsNull1), a2 @ ArrayType(et2, containsNull2)) | ||
if a1.sameType(a2) => | ||
findTightestCommonType(et1, et2).map(ArrayType(_, containsNull1 || containsNull2)) | ||
case (m1 @ MapType(keyType1, valueType1, n1), m2 @ MapType(keyType2, valueType2, n2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto: kt1
, vt1
, hasNull1
if m1.sameType(m2) => | ||
val keyType = findTightestCommonType(keyType1, keyType2) | ||
val valueType = findTightestCommonType(valueType1, valueType2) | ||
if(keyType.isEmpty || valueType.isEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this, it's guaranteed by m1.sameType(m2)
Test build #89696 has finished for PR 21100 at commit
|
Test build #89700 has finished for PR 21100 at commit
|
retest this please |
@@ -111,6 +111,14 @@ object TypeCoercion { | |||
val dataType = findTightestCommonType(f1.dataType, f2.dataType).get | |||
StructField(f1.name, dataType, nullable = f1.nullable || f2.nullable) | |||
})) | |||
case (a1 @ ArrayType(et1, hasNull1), a2 @ ArrayType(et2, hasNull2)) | |||
if a1.sameType(a2) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after shortening the name, can we merge the if
to the case ...
line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also we need blank line between these cases
@@ -35,6 +35,11 @@ FROM (SELECT col AS col | |||
SELECT col | |||
FROM p3) T1) T2; | |||
|
|||
-- SPARK-24012 Union of map and other compatible columns. | |||
SELECT map(1, 2), 'str' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we also add a test for array?
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too
Test build #89704 has finished for PR 21100 at commit
|
@@ -896,6 +896,25 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { | |||
} | |||
} | |||
|
|||
test("SPARK-24012 Union of map and other compatible columns") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @gatorsmile , what's the policy for end-to-end tests? Shall we add it in both the sql golden file and SQLQuerySuite
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. please add them to SQLQueryTestSuite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed with @gatorsmile , we should put end-to-end test in a single place, and currently we encourage people to put SQL related end-to-end test in the SQL golden files. That is to say, we should remove this test from SQLQuerySuite
.
In the meanwhile, a bug fix should also have a unit test. For this case, we should add a test case in TypeCoercionSuite
. @liutang123 if you are not familiar with that test suite, please let us know, we can merge your PR first and add UT in TypeCoercionSuite
in a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan , Yes, I am not familiar with TypeCoercionSuite. In order to save time, in my opinion, this PR can be merged first. Thanks a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, please remove this test and it's ready to go.
Test build #89755 has finished for PR 21100 at commit
|
Test build #89757 has finished for PR 21100 at commit
|
Test build #89774 has finished for PR 21100 at commit
|
|
||
case (m1 @ MapType(kt1, vt1, hasNull1), m2 @ MapType(kt2, vt2, hasNull2)) if m1.sameType(m2) => | ||
val keyType = findTightestCommonType(kt1, kt2) | ||
val valueType = findTightestCommonType(vt1, vt2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I think we should do the same thing in findWiderTypeForTwo
to cover some corner cases such as decimal or string promotion within keys and values .. ? and seems #21100 (comment) suggested the same thing ..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something we should figure out: why findWiderTypeForTwo
only take care of array type? seems all complex types should be handled there, especially if it follows Hive's behavior.
Anyway it's orthogonal to findTightestCommonType
, they are used for different operators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I was just wondering while reading it. However, doesn't that mean we don't do type widening for nested types in the same way? I was thinking we should do the same type widening for nested types too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, I was thinking we should do that in both places in findTightestCommonType
and findWiderTypeForTwo
. Otherwise, the nested types in struct, map or array won't do, for example, decimal or string promotion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it's orthogonal. Yup, I was just wondering. I am okay to leave this out of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooops, I misread your comment. Sorry. I was talking about the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to look into findTightestCommonType
. Currently we are just very conservative and only allow nullability change for complex types in findTightestCommonType
. We need to take a look at other systems and see what they do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that given past discussions - I didn't mean we should change something now .. but was just wondering.
## What changes were proposed in this pull request? a followup of apache#21100 ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes apache#21154 from cloud-fan/test.
## What changes were proposed in this pull request? Union of map and other compatible column result in unresolved operator 'Union; exception Reproduction `spark-sql>select map(1,2), 'str' union all select map(1,2,3,null), 1` Output: ``` Error in query: unresolved operator 'Union;; 'Union :- Project [map(1, 2) AS map(1, 2)apache#106, str AS str#107] : +- OneRowRelation$ +- Project [map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS INT), 3, CAST(NULL AS INT))apache#109, 1 AS 1#108] +- OneRowRelation$ ``` So, we should cast part of columns to be compatible when appropriate. ## How was this patch tested? Added a test (query union of map and other columns) to SQLQueryTestSuite's union.sql. Author: liutang123 <liutang123@yeah.net> Closes apache#21100 from liutang123/SPARK-24012. (cherry picked from commit 64e8408)
What changes were proposed in this pull request?
Union of map and other compatible column result in unresolved operator 'Union; exception
Reproduction
spark-sql>select map(1,2), 'str' union all select map(1,2,3,null), 1
Output:
So, we should cast part of columns to be compatible when appropriate.
How was this patch tested?
Added a test (query union of map and other columns) to SQLQueryTestSuite's union.sql.