[SPARK-24012][SQL] Union of map and other compatible column #21100

liutang123 · 2018-04-18T14:43:36Z

What changes were proposed in this pull request?

Union of map and other compatible column result in unresolved operator 'Union; exception

Reproduction
spark-sql>select map(1,2), 'str' union all select map(1,2,3,null), 1
Output:

Error in query: unresolved operator 'Union;;
'Union
:- Project [map(1, 2) AS map(1, 2)#106, str AS str#107]
:  +- OneRowRelation$
+- Project [map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS INT), 3, CAST(NULL AS INT))#109, 1 AS 1#108]
   +- OneRowRelation$

So, we should cast part of columns to be compatible when appropriate.

How was this patch tested?

Added a test (query union of map and other columns) to SQLQueryTestSuite's union.sql.

gatorsmile · 2018-04-19T22:05:59Z

ok to test

gatorsmile · 2018-04-19T22:06:29Z

cc @jiangxb1987 @cloud-fan

SparkQA · 2018-04-20T03:08:06Z

Test build #89593 has finished for PR 21100 at commit a422a7f.

This patch fails from timeout after a configured wait of `300m`.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-04-20T03:35:39Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+        """
+          |SELECT map(1, 2), 'str'
+          |UNION ALL
+          |SELECT map(1, 2, 3, NULL), 1""".stripMargin),


can you give some insight about why it doesn't work? I'd expect Spark first do type coercion for map(1, 2, 3, NULL), and the result is map<int, nullable int>, then Union should accept nullability difference and pass the analysis.

map<int, nullable int> and map<int, not nullable int> are accepted by Union, but, string and int are not.

If types of one column can not be accepted by Union, TCWSOT(TypeCoercion.WidenSetOperationTypes) will try to coerce them to a completely identical type. TCWSOT works when all of the columns can be coerced and not work when columns can not be coerced exist.
map<int, nullable int> and map<int, not nullable int> can not be coerced, so, TCWSOT didn't work and string and int will not be coerced.

Shall we make map<int, nullable int> and map<int, not nullable int> coerce-able?

Of course we can.
two solution：

Try cast two map types to one no matter key types are not the same or value types are not the same.
select map(1, 2) union all map(1, 'str') will work.

Cast two map types to one only when the key types are the same and value types are the same. This solution just resolve the problem that map<t1, nullable t2> and map<t1, not nullable t2> can't be union.

Hive doesn't support select map(1, 2) union all map(1, 'str'), should spark be compatible with hive?

…ot nullable type2> coerce-able

cloud-fan · 2018-04-20T12:23:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

@@ -171,6 +171,15 @@ object TypeCoercion {
      .orElse((t1, t2) match {
        case (ArrayType(et1, containsNull1), ArrayType(et2, containsNull2)) =>
          findWiderTypeForTwo(et1, et2).map(ArrayType(_, containsNull1 || containsNull2))
+        case (MapType(keyType1, valueType1, n1), MapType(keyType2, valueType2, n2))


We have similar logic for struct type in findTightestCommonType, I think we should also handle array and map type there.

Hi, I implements this logic in findTightestCommonType, looking forward to further review.

SparkQA · 2018-04-20T16:12:57Z

Test build #89626 has finished for PR 21100 at commit cb883d9.

This patch fails from timeout after a configured wait of `300m`.
This patch merges cleanly.
This patch adds no public classes.

…ndTightestCommonType

cloud-fan · 2018-04-23T03:28:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

@@ -111,6 +111,18 @@ object TypeCoercion {
        val dataType = findTightestCommonType(f1.dataType, f2.dataType).get
        StructField(f1.name, dataType, nullable = f1.nullable || f2.nullable)
      }))
+    case (a1 @ ArrayType(et1, containsNull1), a2 @ ArrayType(et2, containsNull2))


we can shorten the name here: hasNull1 hasNull2

cloud-fan · 2018-04-23T03:28:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

+    case (a1 @ ArrayType(et1, containsNull1), a2 @ ArrayType(et2, containsNull2))
+      if a1.sameType(a2) =>
+      findTightestCommonType(et1, et2).map(ArrayType(_, containsNull1 || containsNull2))
+    case (m1 @ MapType(keyType1, valueType1, n1), m2 @ MapType(keyType2, valueType2, n2))


ditto: kt1, vt1, hasNull1

cloud-fan · 2018-04-23T03:28:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

+      if m1.sameType(m2) =>
+      val keyType = findTightestCommonType(keyType1, keyType2)
+      val valueType = findTightestCommonType(valueType1, valueType2)
+      if(keyType.isEmpty || valueType.isEmpty) {


We don't need this, it's guaranteed by m1.sameType(m2)

SparkQA · 2018-04-23T07:05:01Z

Test build #89696 has finished for PR 21100 at commit 19b5c6a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-23T07:05:02Z

Test build #89700 has finished for PR 21100 at commit 0845739.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-04-23T07:18:17Z

retest this please

cloud-fan · 2018-04-23T07:19:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

@@ -111,6 +111,14 @@ object TypeCoercion {
        val dataType = findTightestCommonType(f1.dataType, f2.dataType).get
        StructField(f1.name, dataType, nullable = f1.nullable || f2.nullable)
      }))
+    case (a1 @ ArrayType(et1, hasNull1), a2 @ ArrayType(et2, hasNull2))
+      if a1.sameType(a2) =>


after shortening the name, can we merge the if to the case ... line?

also we need blank line between these cases

cloud-fan · 2018-04-23T07:20:20Z

sql/core/src/test/resources/sql-tests/inputs/union.sql

@@ -35,6 +35,11 @@ FROM   (SELECT col AS col
              SELECT col
              FROM p3) T1) T2;

+-- SPARK-24012 Union of map and other compatible columns.
+SELECT map(1, 2), 'str'


shall we also add a test for array?

cloud-fan · 2018-04-23T07:23:45Z

LGTM

jiangxb1987

LGTM too

SparkQA · 2018-04-23T11:05:00Z

Test build #89704 has finished for PR 21100 at commit 0845739.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-04-24T04:16:23Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

@@ -896,6 +896,25 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {
    }
  }

+  test("SPARK-24012 Union of map and other compatible columns") {


cc @gatorsmile , what's the policy for end-to-end tests? Shall we add it in both the sql golden file and SQLQuerySuite?

yes. please add them to SQLQueryTestSuite

discussed with @gatorsmile , we should put end-to-end test in a single place, and currently we encourage people to put SQL related end-to-end test in the SQL golden files. That is to say, we should remove this test from SQLQuerySuite.

In the meanwhile, a bug fix should also have a unit test. For this case, we should add a test case in TypeCoercionSuite. @liutang123 if you are not familiar with that test suite, please let us know, we can merge your PR first and add UT in TypeCoercionSuite in a followup.

@cloud-fan , Yes, I am not familiar with TypeCoercionSuite. In order to save time, in my opinion, this PR can be merged first. Thanks a lot.

OK, please remove this test and it's ready to go.

SparkQA · 2018-04-24T06:58:17Z

Test build #89755 has finished for PR 21100 at commit 670824f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-24T07:05:02Z

Test build #89757 has finished for PR 21100 at commit 8cb240f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-24T12:17:28Z

Test build #89774 has finished for PR 21100 at commit 4b1ce36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-04-25T11:40:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

+
+    case (m1 @ MapType(kt1, vt1, hasNull1), m2 @ MapType(kt2, vt2, hasNull2)) if m1.sameType(m2) =>
+      val keyType = findTightestCommonType(kt1, kt2)
+      val valueType = findTightestCommonType(vt1, vt2)


BTW, I think we should do the same thing in findWiderTypeForTwo to cover some corner cases such as decimal or string promotion within keys and values .. ? and seems #21100 (comment) suggested the same thing ..?

This is something we should figure out: why findWiderTypeForTwo only take care of array type? seems all complex types should be handled there, especially if it follows Hive's behavior.

Anyway it's orthogonal to findTightestCommonType, they are used for different operators.

Yea, I was just wondering while reading it. However, doesn't that mean we don't do type widening for nested types in the same way? I was thinking we should do the same type widening for nested types too.

I mean, I was thinking we should do that in both places in findTightestCommonType and findWiderTypeForTwo. Otherwise, the nested types in struct, map or array won't do, for example, decimal or string promotion.

Sure, it's orthogonal. Yup, I was just wondering. I am okay to leave this out of this PR.

Ooops, I misread your comment. Sorry. I was talking about the same thing.

We also need to look into findTightestCommonType. Currently we are just very conservative and only allow nullability change for complex types in findTightestCommonType. We need to take a look at other systems and see what they do.

I agree with that given past discussions - I didn't mean we should change something now .. but was just wondering.

## What changes were proposed in this pull request? a followup of apache#21100 ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes apache#21154 from cloud-fan/test.

## What changes were proposed in this pull request? Union of map and other compatible column result in unresolved operator 'Union; exception Reproduction `spark-sql>select map(1,2), 'str' union all select map(1,2,3,null), 1` Output: ``` Error in query: unresolved operator 'Union;; 'Union :- Project [map(1, 2) AS map(1, 2)apache#106, str AS str#107] : +- OneRowRelation$ +- Project [map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS INT), 3, CAST(NULL AS INT))apache#109, 1 AS 1#108] +- OneRowRelation$ ``` So, we should cast part of columns to be compatible when appropriate. ## How was this patch tested? Added a test (query union of map and other columns) to SQLQueryTestSuite's union.sql. Author: liutang123 <liutang123@yeah.net> Closes apache#21100 from liutang123/SPARK-24012. (cherry picked from commit 64e8408)

[SPARK-24012][SQL] Union of map and other compatible column

a422a7f

cloud-fan reviewed Apr 20, 2018

View reviewed changes

[SPARK-24012][SQL] change map<type1, nullable type2> and map<type1, n…

cb883d9

…ot nullable type2> coerce-able

cloud-fan reviewed Apr 20, 2018

View reviewed changes

SPARK-24012 add same type checke for map and array in TypeCoercion#fi…

19b5c6a

…ndTightestCommonType

cloud-fan reviewed Apr 23, 2018

View reviewed changes

SPARK-24012 code style fix

0845739

cloud-fan reviewed Apr 23, 2018

View reviewed changes

jiangxb1987 approved these changes Apr 23, 2018

View reviewed changes

liutang123 added 2 commits April 24, 2018 11:02

SPARK-24012 code style fix

670824f

SPARK-24012 add UT for array

8cb240f

cloud-fan reviewed Apr 24, 2018

View reviewed changes

SPARK-24012 remove UT from SQLQuerySuite

4b1ce36

asfgit closed this in 64e8408 Apr 25, 2018

cloud-fan mentioned this pull request Apr 25, 2018

[SPARK-24012][SQL][TEST][followup] add unit test #21154

Closed

HyukjinKwon reviewed Apr 25, 2018

View reviewed changes

[SPARK-24012][SQL] Union of map and other compatible column #21100

[SPARK-24012][SQL] Union of map and other compatible column #21100

Conversation

liutang123 commented Apr 18, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

gatorsmile commented Apr 19, 2018

gatorsmile commented Apr 19, 2018

SparkQA commented Apr 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liutang123 Apr 20, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 23, 2018

SparkQA commented Apr 23, 2018

cloud-fan commented Apr 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Apr 23, 2018

jiangxb1987 left a comment

Choose a reason for hiding this comment

SparkQA commented Apr 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 24, 2018

SparkQA commented Apr 24, 2018

SparkQA commented Apr 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liutang123 commented Apr 18, 2018 •

edited

Loading

liutang123 Apr 20, 2018 •

edited

Loading