Skip to content

Conversation

@yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Dec 19, 2019

What changes were proposed in this pull request?

When date and timestamp values are fields of arrays, maps, etc, we convert them to hive string using toString. This makes the result wrong before the default transition ’1582-10-15‘.

https://bugs.openjdk.java.net/browse/JDK-8061577?focusedCommentId=13566712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13566712

cases to reproduce:

+-- !query 47
+select array(cast('1582-10-13' as date), date '1582-10-14', date '1582-10-15', null)
+-- !query 47 schema
+struct<array(CAST(1582-10-13 AS DATE), DATE '1582-10-14', DATE '1582-10-15', CAST(NULL AS DATE)):array<date>>
+-- !query 47 output
+[1582-10-03,1582-10-04,1582-10-15,null]
+
+
+-- !query 48
+select cast('1582-10-13' as date), date '1582-10-14', date '1582-10-15'
+-- !query 48 schema
+struct<CAST(1582-10-13 AS DATE):date,DATE '1582-10-14':date,DATE '1582-10-15':date>
+-- !query 48 output
+1582-10-13     1582-10-14      1582-10-15

other refencences
h2database/h2database#831

Why are the changes needed?

bug fix

Does this PR introduce any user-facing change?

yes, complex types containing datetimes in spark-sql script and thrift server can result same as self-contained spark app or spark-shell script

How was this patch tested?

add uts

@yaooqinn
Copy link
Member Author

cc: @maropu @HyukjinKwon @cloud-fan @gengliangwang @dongjoon-hyun, thanks for reviewing in advance.

@SparkQA
Copy link

SparkQA commented Dec 19, 2019

Test build #115534 has finished for PR 26942 at commit 87fc4ab.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case (d: Date, DateType) =>
dateFormatter.format(DateTimeUtils.fromJavaDate(d))
case (t: Timestamp, TimestampType) =>
DateTimeUtils.timestampToString(timestampFormatter, DateTimeUtils.fromJavaTimestamp(t))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's the same as timestampFormatter.format(DateTimeUtils.fromJavaTimestamp(t))

@cloud-fan
Copy link
Contributor

is it possible to unify toHiveStructString and toHiveString

@yaooqinn
Copy link
Member Author

yaooqinn commented Dec 19, 2019

The current unify logic seems not to allow udt with '"' prefix or postfix.

@SparkQA
Copy link

SparkQA commented Dec 19, 2019

Test build #115537 has finished for PR 26942 at commit 7e1c437.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case (other, _ : UserDefinedType[_]) => other.toString
case (other, tpe) if primitiveTypes.contains(tpe) => other.toString
}
case (null, _) => "NULL"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 2 space identation.

@SparkQA
Copy link

SparkQA commented Dec 19, 2019

Test build #115541 has finished for PR 26942 at commit 5b8bec3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case (n, _: NumericType) => n.toString
case (s: String, StringType) => "\"" + s + "\""
case (interval: CalendarInterval, CalendarIntervalType) =>
SQLConf.get.intervalOutputStyle match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this PR, but we should also simplify it. We can output SQL standard format if ansi mode is enabled, and output multi-unit format otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are doing so, I suggest we support SQL standard input first

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Dec 19, 2019

Test build #115550 has finished for PR 26942 at commit bb77e3c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 19, 2019

Test build #115546 has finished for PR 26942 at commit bb77e3c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 19, 2019

Test build #115569 has finished for PR 26942 at commit 1d4a9bf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 19, 2019

Test build #115577 has finished for PR 26942 at commit e4bcc8c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 19, 2019

Test build #115579 has finished for PR 26942 at commit 32383e6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 20, 2019

Test build #115593 has finished for PR 26942 at commit 8310be0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}.toSeq.sorted.mkString("{", ",", "}")
case (null, _) => "NULL"
def toHiveString(a: (Any, DataType), nested: Boolean = false): String = a match {
case (null, _) => if (nested) "null" else "NULL"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this weird behavior inherited from Hive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there are many hive compatibility unit tests there

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 12249fc Dec 20, 2019
case ISO_8601 => toIso8601String(interval)
case MULTI_UNITS => toMultiUnitsString(interval)
}
case (other, tpe) if primitiveTypes contains tpe => other.toString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaooqinn @cloud-fan
for ArrayType(ByteType), this was returning something like "[B@ba4f370" (by calling toString on the Array[Byte]), after this change it returns the String created via new String(bin, StandardCharsets.UTF_8)
Which of these is the intended output?
[B@ba4f370 is useless, while the string made from binary at least contains the contents of it, so it seems that after this change it at least makes more sense...

case (t: Timestamp, TimestampType) =>
DateTimeUtils.timestampToString(timestampFormatter, DateTimeUtils.fromJavaTimestamp(t))
timestampFormatter.format(DateTimeUtils.fromJavaTimestamp(t))
case (bin: Array[Byte], BinaryType) => new String(bin, StandardCharsets.UTF_8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, if the nested-struct result is to be parsed back e.g. as json, the arbitrary bytes from the String created from binary may mess it up...
Maybe returning if (nested) "<BINARY_BLOB>" else new String(bin, StandardCharsets.UTF_8) would make sense here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. Maybe we can follow df.show

case binary: Array[Byte] => binary.map("%02X".format(_)).mkString("[", " ", "]")

The binary will be displayed like [a0b87...], which will not conflict with json strings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, but IIRC Hive results string from binary, and this is toHiveString. Does Hive return <BINARY_BLOB>?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It reminds me of #25480 and #25379

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hive> create table bt(k binary);
OK
Time taken: 0.637 seconds
hive> insert into bt values ("spark"), ("hello"), ("3"), ("."), ("0");
Query ID = root_20200325055555_1c227d63-47dd-4899-a879-0b6a98269908
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1585115622286_0001, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1585115622286_0001/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1585115622286_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-03-25 05:57:03,557 Stage-1 map = 0%,  reduce = 0%
2020-03-25 05:57:09,766 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.86 sec
MapReduce Total cumulative CPU time: 1 seconds 860 msec
Ended Job = job_1585115622286_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://quickstart.cloudera:8020/user/hive/warehouse/bt/.hive-staging_hive_2020-03-25_05-56-54_007_226069574359185383-1/-ext-10000
Loading data to table default.bt
Table default.bt stats: [numFiles=1, numRows=5, totalSize=33, rawDataSize=28]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 1.86 sec   HDFS Read: 3080 HDFS Write: 99 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 860 msec
OK
Time taken: 17.185 seconds
hive> select k, array(k) from bt;
OK
spark	[spark]
hello	[hello]
3	[3]
.	[.]
0	[0]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything I am missing here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array(k) seems the array type of byte

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see. Then seems nothing needs to be changed. cc @juliuszsompolski

Copy link
Member Author

@yaooqinn yaooqinn Mar 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0: jdbc:hive2://localhost:10000> desc bt_arr;
INFO  : Compiling command(queryId=hive_20200325062323_10f71e38-fe8d-4094-aa87-5361df066edb): desc bt_arr
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:col_name, type:string, comment:from deserializer), FieldSchema(name:data_type, type:string, comment:from deserializer), FieldSchema(name:comment, type:string, comment:from deserializer)], properties:null)
INFO  : Completed compiling command(queryId=hive_20200325062323_10f71e38-fe8d-4094-aa87-5361df066edb); Time taken: 0.056 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20200325062323_10f71e38-fe8d-4094-aa87-5361df066edb): desc bt_arr
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hive_20200325062323_10f71e38-fe8d-4094-aa87-5361df066edb); Time taken: 0.02 seconds
INFO  : OK
+-----------+----------------+----------+--+
| col_name  |   data_type    | comment  |
+-----------+----------------+----------+--+
| _c0       | array<binary>  |          |
+-----------+----------------+----------+--+
1 row selected (0.158 seconds)
0: jdbc:hive2://localhost:10000> select * from bt_arr;
INFO  : Compiling command(queryId=hive_20200325062323_edfa2e37-a8da-481b-97e9-f4ed9a37b9a4): select * from bt_arr
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:bt_arr._c0, type:array<binary>, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20200325062323_edfa2e37-a8da-481b-97e9-f4ed9a37b9a4); Time taken: 0.053 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20200325062323_edfa2e37-a8da-481b-97e9-f4ed9a37b9a4): select * from bt_arr
INFO  : Completed executing command(queryId=hive_20200325062323_edfa2e37-a8da-481b-97e9-f4ed9a37b9a4); Time taken: 0.001 seconds
INFO  : OK
+-------------+--+
| bt_arr._c0  |
+-------------+--+
| [spark]     |
| [hello]     |
| [3]         |
| [.]         |
| [0]         |
+-------------+--+

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the new behaviour is more sensible than the previous - returning just [B@ba4f370 was useless; returning the String made from Binary blob at least returns some content... I know some systems try to parse back these results as JSON to be able to explore the nested data - e.g. I think PowerBI does that with what thriftserver returns... I fear that if the binary has some strange contents, and is unquoted, it will break that JSON parsing... But I haven't tested it.
I don't expect anyone using this kind of type combination and running into issues in practice...

@juliuszsompolski
Copy link
Contributor

BTW: there's also a difference in Decimals:
Nested in toHiveStructString it used to just do decimal.toString, while now it always does HiveDecimal.create(decimal).toString.
It results in "...[-240162003],"col2":[0.00000000000000000007649305546030259135],"col3":{-100050828:..." did not equal "...[-240162003],"col2":[7.649305546030259135E-20],"col3":{-100050828:..." in some tests I run.
I also believe it doesn't really matter; likely the new behaviour is better.

PavithraRamachandran pushed a commit to PavithraRamachandran/spark that referenced this pull request Sep 24, 2020
…lex types

When date and timestamp values are fields of arrays, maps, etc, we convert them to hive string using `toString`. This makes the result wrong before the default transition ’1582-10-15‘.

https://bugs.openjdk.java.net/browse/JDK-8061577?focusedCommentId=13566712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13566712

cases to reproduce:

```sql
+-- !query 47
+select array(cast('1582-10-13' as date), date '1582-10-14', date '1582-10-15', null)
+-- !query 47 schema
+struct<array(CAST(1582-10-13 AS DATE), DATE '1582-10-14', DATE '1582-10-15', CAST(NULL AS DATE)):array<date>>
+-- !query 47 output
+[1582-10-03,1582-10-04,1582-10-15,null]
+
+
+-- !query 48
+select cast('1582-10-13' as date), date '1582-10-14', date '1582-10-15'
+-- !query 48 schema
+struct<CAST(1582-10-13 AS DATE):date,DATE '1582-10-14':date,DATE '1582-10-15':date>
+-- !query 48 output
+1582-10-13     1582-10-14      1582-10-15
```

other refencences
h2database/h2database#831

bug fix

yes, complex types containing datetimes in `spark-sql `script and thrift server can result same as self-contained spark app or `spark-shell` script

add uts

Closes apache#26942 from yaooqinn/SPARK-30301.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants