[SPARK-30301][SQL] Fix wrong results when datetimes as fields of complex types #26942

yaooqinn · 2019-12-19T03:34:25Z

What changes were proposed in this pull request?

When date and timestamp values are fields of arrays, maps, etc, we convert them to hive string using toString. This makes the result wrong before the default transition ’1582-10-15‘.

https://bugs.openjdk.java.net/browse/JDK-8061577?focusedCommentId=13566712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13566712

cases to reproduce:

+-- !query 47
+select array(cast('1582-10-13' as date), date '1582-10-14', date '1582-10-15', null)
+-- !query 47 schema
+struct<array(CAST(1582-10-13 AS DATE), DATE '1582-10-14', DATE '1582-10-15', CAST(NULL AS DATE)):array<date>>
+-- !query 47 output
+[1582-10-03,1582-10-04,1582-10-15,null]
+
+
+-- !query 48
+select cast('1582-10-13' as date), date '1582-10-14', date '1582-10-15'
+-- !query 48 schema
+struct<CAST(1582-10-13 AS DATE):date,DATE '1582-10-14':date,DATE '1582-10-15':date>
+-- !query 48 output
+1582-10-13     1582-10-14      1582-10-15

other refencences
h2database/h2database#831

Why are the changes needed?

bug fix

Does this PR introduce any user-facing change?

yes, complex types containing datetimes in spark-sql script and thrift server can result same as self-contained spark app or spark-shell script

How was this patch tested?

add uts

…g results wrong

yaooqinn · 2019-12-19T03:35:09Z

cc: @maropu @HyukjinKwon @cloud-fan @gengliangwang @dongjoon-hyun, thanks for reviewing in advance.

SparkQA · 2019-12-19T04:39:10Z

Test build #115534 has finished for PR 26942 at commit 87fc4ab.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-12-19T04:54:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala

+    case (d: Date, DateType) =>
+      dateFormatter.format(DateTimeUtils.fromJavaDate(d))
+    case (t: Timestamp, TimestampType) =>
+      DateTimeUtils.timestampToString(timestampFormatter, DateTimeUtils.fromJavaTimestamp(t))


it's the same as timestampFormatter.format(DateTimeUtils.fromJavaTimestamp(t))

cloud-fan · 2019-12-19T04:55:49Z

is it possible to unify toHiveStructString and toHiveString

yaooqinn · 2019-12-19T06:02:58Z

~~The current unify logic seems not to allow udt with '"' prefix or postfix.~~

SparkQA · 2019-12-19T06:41:30Z

Test build #115537 has finished for PR 26942 at commit 7e1c437.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-12-19T07:13:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala

-    case (other, _ : UserDefinedType[_]) => other.toString
-    case (other, tpe) if primitiveTypes.contains(tpe) => other.toString
-  }
+      case (null, _) => "NULL"


nit: 2 space identation.

sql/core/src/test/resources/sql-tests/results/csv-functions.sql.out

SparkQA · 2019-12-19T07:56:02Z

Test build #115541 has finished for PR 26942 at commit 5b8bec3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-12-19T08:18:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala

+    case (n, _: NumericType) => n.toString
+    case (s: String, StringType) => "\"" + s + "\""
    case (interval: CalendarInterval, CalendarIntervalType) =>
      SQLConf.get.intervalOutputStyle match {


not related to this PR, but we should also simplify it. We can output SQL standard format if ansi mode is enabled, and output multi-unit format otherwise.

cc @gengliangwang @xuanyuanking

If we are doing so, I suggest we support SQL standard input first

cloud-fan · 2019-12-19T08:20:52Z

retest this please

SparkQA · 2019-12-19T09:13:18Z

Test build #115550 has finished for PR 26942 at commit bb77e3c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-19T10:17:18Z

Test build #115546 has finished for PR 26942 at commit bb77e3c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-19T14:20:36Z

Test build #115569 has finished for PR 26942 at commit 1d4a9bf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-19T16:22:41Z

Test build #115577 has finished for PR 26942 at commit e4bcc8c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-19T20:00:13Z

Test build #115579 has finished for PR 26942 at commit 32383e6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-20T06:39:09Z

Test build #115593 has finished for PR 26942 at commit 8310be0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-12-20T09:33:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala

-      }.toSeq.sorted.mkString("{", ",", "}")
-    case (null, _) => "NULL"
+  def toHiveString(a: (Any, DataType), nested: Boolean = false): String = a match {
+    case (null, _) => if (nested) "null" else "NULL"


is this weird behavior inherited from Hive?

yes, there are many hive compatibility unit tests there

cloud-fan · 2019-12-20T11:22:05Z

thanks, merging to master!

juliuszsompolski · 2020-03-24T20:02:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala

-        case ISO_8601 => toIso8601String(interval)
-        case MULTI_UNITS => toMultiUnitsString(interval)
-      }
-    case (other, tpe) if primitiveTypes contains tpe => other.toString


@yaooqinn @cloud-fan
for ArrayType(ByteType), this was returning something like "[B@ba4f370" (by calling toString on the Array[Byte]), after this change it returns the String created via new String(bin, StandardCharsets.UTF_8)
Which of these is the intended output?
[B@ba4f370 is useless, while the string made from binary at least contains the contents of it, so it seems that after this change it at least makes more sense...

juliuszsompolski · 2020-03-24T20:20:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala

    case (t: Timestamp, TimestampType) =>
-      DateTimeUtils.timestampToString(timestampFormatter, DateTimeUtils.fromJavaTimestamp(t))
+      timestampFormatter.format(DateTimeUtils.fromJavaTimestamp(t))
    case (bin: Array[Byte], BinaryType) => new String(bin, StandardCharsets.UTF_8)


On the other hand, if the nested-struct result is to be parsed back e.g. as json, the arbitrary bytes from the String created from binary may mess it up...
Maybe returning if (nested) "<BINARY_BLOB>" else new String(bin, StandardCharsets.UTF_8) would make sense here?

This is a good point. Maybe we can follow df.show

case binary: Array[Byte] => binary.map("%02X".format(_)).mkString("[", " ", "]")

The binary will be displayed like [a0b87...], which will not conflict with json strings.

Hm, but IIRC Hive results string from binary, and this is toHiveString. Does Hive return <BINARY_BLOB>?

It reminds me of #25480 and #25379

hive> create table bt(k binary); OK Time taken: 0.637 seconds hive> insert into bt values ("spark"), ("hello"), ("3"), ("."), ("0"); Query ID = root_20200325055555_1c227d63-47dd-4899-a879-0b6a98269908 Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1585115622286_0001, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1585115622286_0001/ Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1585115622286_0001 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2020-03-25 05:57:03,557 Stage-1 map = 0%, reduce = 0% 2020-03-25 05:57:09,766 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.86 sec MapReduce Total cumulative CPU time: 1 seconds 860 msec Ended Job = job_1585115622286_0001 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to: hdfs://quickstart.cloudera:8020/user/hive/warehouse/bt/.hive-staging_hive_2020-03-25_05-56-54_007_226069574359185383-1/-ext-10000 Loading data to table default.bt Table default.bt stats: [numFiles=1, numRows=5, totalSize=33, rawDataSize=28] MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 1.86 sec HDFS Read: 3080 HDFS Write: 99 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 860 msec OK Time taken: 17.185 seconds hive> select k, array(k) from bt; OK spark [spark] hello [hello] 3 [3] . [.] 0 [0]

Anything I am missing here?

array(k) seems the array type of byte

ah i see. Then seems nothing needs to be changed. cc @juliuszsompolski

0: jdbc:hive2://localhost:10000> desc bt_arr; INFO : Compiling command(queryId=hive_20200325062323_10f71e38-fe8d-4094-aa87-5361df066edb): desc bt_arr INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:col_name, type:string, comment:from deserializer), FieldSchema(name:data_type, type:string, comment:from deserializer), FieldSchema(name:comment, type:string, comment:from deserializer)], properties:null) INFO : Completed compiling command(queryId=hive_20200325062323_10f71e38-fe8d-4094-aa87-5361df066edb); Time taken: 0.056 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hive_20200325062323_10f71e38-fe8d-4094-aa87-5361df066edb): desc bt_arr INFO : Starting task [Stage-0:DDL] in serial mode INFO : Completed executing command(queryId=hive_20200325062323_10f71e38-fe8d-4094-aa87-5361df066edb); Time taken: 0.02 seconds INFO : OK +-----------+----------------+----------+--+ | col_name | data_type | comment | +-----------+----------------+----------+--+ | _c0 | array<binary> | | +-----------+----------------+----------+--+ 1 row selected (0.158 seconds) 0: jdbc:hive2://localhost:10000> select * from bt_arr; INFO : Compiling command(queryId=hive_20200325062323_edfa2e37-a8da-481b-97e9-f4ed9a37b9a4): select * from bt_arr INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:bt_arr._c0, type:array<binary>, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20200325062323_edfa2e37-a8da-481b-97e9-f4ed9a37b9a4); Time taken: 0.053 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hive_20200325062323_edfa2e37-a8da-481b-97e9-f4ed9a37b9a4): select * from bt_arr INFO : Completed executing command(queryId=hive_20200325062323_edfa2e37-a8da-481b-97e9-f4ed9a37b9a4); Time taken: 0.001 seconds INFO : OK +-------------+--+ | bt_arr._c0 | +-------------+--+ | [spark] | | [hello] | | [3] | | [.] | | [0] | +-------------+--+

I believe the new behaviour is more sensible than the previous - returning just [B@ba4f370 was useless; returning the String made from Binary blob at least returns some content... I know some systems try to parse back these results as JSON to be able to explore the nested data - e.g. I think PowerBI does that with what thriftserver returns... I fear that if the binary has some strange contents, and is unquoted, it will break that JSON parsing... But I haven't tested it.
I don't expect anyone using this kind of type combination and running into issues in practice...

juliuszsompolski · 2020-03-25T09:40:47Z

BTW: there's also a difference in Decimals:
Nested in toHiveStructString it used to just do decimal.toString, while now it always does HiveDecimal.create(decimal).toString.
It results in "...[-240162003],"col2":[0.00000000000000000007649305546030259135],"col3":{-100050828:..." did not equal "...[-240162003],"col2":[7.649305546030259135E-20],"col3":{-100050828:..." in some tests I run.
I also believe it doesn't really matter; likely the new behaviour is better.

…lex types When date and timestamp values are fields of arrays, maps, etc, we convert them to hive string using `toString`. This makes the result wrong before the default transition ’1582-10-15‘. https://bugs.openjdk.java.net/browse/JDK-8061577?focusedCommentId=13566712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13566712 cases to reproduce: ```sql +-- !query 47 +select array(cast('1582-10-13' as date), date '1582-10-14', date '1582-10-15', null) +-- !query 47 schema +struct<array(CAST(1582-10-13 AS DATE), DATE '1582-10-14', DATE '1582-10-15', CAST(NULL AS DATE)):array<date>> +-- !query 47 output +[1582-10-03,1582-10-04,1582-10-15,null] + + +-- !query 48 +select cast('1582-10-13' as date), date '1582-10-14', date '1582-10-15' +-- !query 48 schema +struct<CAST(1582-10-13 AS DATE):date,DATE '1582-10-14':date,DATE '1582-10-15':date> +-- !query 48 output +1582-10-13 1582-10-14 1582-10-15 ``` other refencences h2database/h2database#831 bug fix yes, complex types containing datetimes in `spark-sql `script and thrift server can result same as self-contained spark app or `spark-shell` script add uts Closes apache#26942 from yaooqinn/SPARK-30301. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

[SPARK-30301][SQL] Datetimes as fields of complex types to hive strin…

87fc4ab

…g results wrong

cloud-fan reviewed Dec 19, 2019

View reviewed changes

unify

7e1c437

nit

5b8bec3

cloud-fan reviewed Dec 19, 2019

View reviewed changes

sql/core/src/test/resources/sql-tests/results/csv-functions.sql.out Show resolved Hide resolved

fix tests

bb77e3c

cloud-fan reviewed Dec 19, 2019

View reviewed changes

cloud-fan approved these changes Dec 19, 2019

View reviewed changes

fix tests

1d4a9bf

fix hive compatibilty

e4bcc8c

regen all g files

32383e6

dongjoon-hyun added the SQL label Dec 19, 2019

fix examples

8310be0

HyukjinKwon approved these changes Dec 20, 2019

View reviewed changes

cloud-fan reviewed Dec 20, 2019

View reviewed changes

cloud-fan closed this in 12249fc Dec 20, 2019

juliuszsompolski reviewed Mar 24, 2020

View reviewed changes

[SPARK-30301][SQL] Fix wrong results when datetimes as fields of complex types #26942

[SPARK-30301][SQL] Fix wrong results when datetimes as fields of complex types #26942

Uh oh!

Conversation

yaooqinn commented Dec 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

yaooqinn commented Dec 19, 2019

Uh oh!

SparkQA commented Dec 19, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 19, 2019

Uh oh!

yaooqinn commented Dec 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Dec 19, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Dec 19, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 19, 2019

Uh oh!

SparkQA commented Dec 19, 2019

Uh oh!

SparkQA commented Dec 19, 2019

Uh oh!

SparkQA commented Dec 19, 2019

Uh oh!

SparkQA commented Dec 19, 2019

Uh oh!

SparkQA commented Dec 19, 2019

Uh oh!

SparkQA commented Dec 20, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 20, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaooqinn Mar 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juliuszsompolski commented Mar 25, 2020

yaooqinn commented Dec 19, 2019 •

edited

Loading

yaooqinn commented Dec 19, 2019 •

edited

Loading

yaooqinn Mar 25, 2020 •

edited

Loading