-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-30301][SQL] Fix wrong results when datetimes as fields of complex types #26942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc: @maropu @HyukjinKwon @cloud-fan @gengliangwang @dongjoon-hyun, thanks for reviewing in advance. |
|
Test build #115534 has finished for PR 26942 at commit
|
| case (d: Date, DateType) => | ||
| dateFormatter.format(DateTimeUtils.fromJavaDate(d)) | ||
| case (t: Timestamp, TimestampType) => | ||
| DateTimeUtils.timestampToString(timestampFormatter, DateTimeUtils.fromJavaTimestamp(t)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's the same as timestampFormatter.format(DateTimeUtils.fromJavaTimestamp(t))
|
is it possible to unify |
|
|
|
Test build #115537 has finished for PR 26942 at commit
|
| case (other, _ : UserDefinedType[_]) => other.toString | ||
| case (other, tpe) if primitiveTypes.contains(tpe) => other.toString | ||
| } | ||
| case (null, _) => "NULL" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 2 space identation.
|
Test build #115541 has finished for PR 26942 at commit
|
| case (n, _: NumericType) => n.toString | ||
| case (s: String, StringType) => "\"" + s + "\"" | ||
| case (interval: CalendarInterval, CalendarIntervalType) => | ||
| SQLConf.get.intervalOutputStyle match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR, but we should also simplify it. We can output SQL standard format if ansi mode is enabled, and output multi-unit format otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are doing so, I suggest we support SQL standard input first
|
retest this please |
|
Test build #115550 has finished for PR 26942 at commit
|
|
Test build #115546 has finished for PR 26942 at commit
|
|
Test build #115569 has finished for PR 26942 at commit
|
|
Test build #115577 has finished for PR 26942 at commit
|
|
Test build #115579 has finished for PR 26942 at commit
|
|
Test build #115593 has finished for PR 26942 at commit
|
| }.toSeq.sorted.mkString("{", ",", "}") | ||
| case (null, _) => "NULL" | ||
| def toHiveString(a: (Any, DataType), nested: Boolean = false): String = a match { | ||
| case (null, _) => if (nested) "null" else "NULL" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this weird behavior inherited from Hive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, there are many hive compatibility unit tests there
|
thanks, merging to master! |
| case ISO_8601 => toIso8601String(interval) | ||
| case MULTI_UNITS => toMultiUnitsString(interval) | ||
| } | ||
| case (other, tpe) if primitiveTypes contains tpe => other.toString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yaooqinn @cloud-fan
for ArrayType(ByteType), this was returning something like "[B@ba4f370" (by calling toString on the Array[Byte]), after this change it returns the String created via new String(bin, StandardCharsets.UTF_8)
Which of these is the intended output?
[B@ba4f370 is useless, while the string made from binary at least contains the contents of it, so it seems that after this change it at least makes more sense...
| case (t: Timestamp, TimestampType) => | ||
| DateTimeUtils.timestampToString(timestampFormatter, DateTimeUtils.fromJavaTimestamp(t)) | ||
| timestampFormatter.format(DateTimeUtils.fromJavaTimestamp(t)) | ||
| case (bin: Array[Byte], BinaryType) => new String(bin, StandardCharsets.UTF_8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, if the nested-struct result is to be parsed back e.g. as json, the arbitrary bytes from the String created from binary may mess it up...
Maybe returning if (nested) "<BINARY_BLOB>" else new String(bin, StandardCharsets.UTF_8) would make sense here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. Maybe we can follow df.show
case binary: Array[Byte] => binary.map("%02X".format(_)).mkString("[", " ", "]")
The binary will be displayed like [a0b87...], which will not conflict with json strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, but IIRC Hive results string from binary, and this is toHiveString. Does Hive return <BINARY_BLOB>?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hive> create table bt(k binary);
OK
Time taken: 0.637 seconds
hive> insert into bt values ("spark"), ("hello"), ("3"), ("."), ("0");
Query ID = root_20200325055555_1c227d63-47dd-4899-a879-0b6a98269908
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1585115622286_0001, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1585115622286_0001/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1585115622286_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-03-25 05:57:03,557 Stage-1 map = 0%, reduce = 0%
2020-03-25 05:57:09,766 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.86 sec
MapReduce Total cumulative CPU time: 1 seconds 860 msec
Ended Job = job_1585115622286_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://quickstart.cloudera:8020/user/hive/warehouse/bt/.hive-staging_hive_2020-03-25_05-56-54_007_226069574359185383-1/-ext-10000
Loading data to table default.bt
Table default.bt stats: [numFiles=1, numRows=5, totalSize=33, rawDataSize=28]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.86 sec HDFS Read: 3080 HDFS Write: 99 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 860 msec
OK
Time taken: 17.185 seconds
hive> select k, array(k) from bt;
OK
spark [spark]
hello [hello]
3 [3]
. [.]
0 [0]There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything I am missing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
array(k) seems the array type of byte
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i see. Then seems nothing needs to be changed. cc @juliuszsompolski
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0: jdbc:hive2://localhost:10000> desc bt_arr;
INFO : Compiling command(queryId=hive_20200325062323_10f71e38-fe8d-4094-aa87-5361df066edb): desc bt_arr
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:col_name, type:string, comment:from deserializer), FieldSchema(name:data_type, type:string, comment:from deserializer), FieldSchema(name:comment, type:string, comment:from deserializer)], properties:null)
INFO : Completed compiling command(queryId=hive_20200325062323_10f71e38-fe8d-4094-aa87-5361df066edb); Time taken: 0.056 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hive_20200325062323_10f71e38-fe8d-4094-aa87-5361df066edb): desc bt_arr
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=hive_20200325062323_10f71e38-fe8d-4094-aa87-5361df066edb); Time taken: 0.02 seconds
INFO : OK
+-----------+----------------+----------+--+
| col_name | data_type | comment |
+-----------+----------------+----------+--+
| _c0 | array<binary> | |
+-----------+----------------+----------+--+
1 row selected (0.158 seconds)
0: jdbc:hive2://localhost:10000> select * from bt_arr;
INFO : Compiling command(queryId=hive_20200325062323_edfa2e37-a8da-481b-97e9-f4ed9a37b9a4): select * from bt_arr
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:bt_arr._c0, type:array<binary>, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20200325062323_edfa2e37-a8da-481b-97e9-f4ed9a37b9a4); Time taken: 0.053 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hive_20200325062323_edfa2e37-a8da-481b-97e9-f4ed9a37b9a4): select * from bt_arr
INFO : Completed executing command(queryId=hive_20200325062323_edfa2e37-a8da-481b-97e9-f4ed9a37b9a4); Time taken: 0.001 seconds
INFO : OK
+-------------+--+
| bt_arr._c0 |
+-------------+--+
| [spark] |
| [hello] |
| [3] |
| [.] |
| [0] |
+-------------+--+There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the new behaviour is more sensible than the previous - returning just [B@ba4f370 was useless; returning the String made from Binary blob at least returns some content... I know some systems try to parse back these results as JSON to be able to explore the nested data - e.g. I think PowerBI does that with what thriftserver returns... I fear that if the binary has some strange contents, and is unquoted, it will break that JSON parsing... But I haven't tested it.
I don't expect anyone using this kind of type combination and running into issues in practice...
|
BTW: there's also a difference in Decimals: |
…lex types When date and timestamp values are fields of arrays, maps, etc, we convert them to hive string using `toString`. This makes the result wrong before the default transition ’1582-10-15‘. https://bugs.openjdk.java.net/browse/JDK-8061577?focusedCommentId=13566712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13566712 cases to reproduce: ```sql +-- !query 47 +select array(cast('1582-10-13' as date), date '1582-10-14', date '1582-10-15', null) +-- !query 47 schema +struct<array(CAST(1582-10-13 AS DATE), DATE '1582-10-14', DATE '1582-10-15', CAST(NULL AS DATE)):array<date>> +-- !query 47 output +[1582-10-03,1582-10-04,1582-10-15,null] + + +-- !query 48 +select cast('1582-10-13' as date), date '1582-10-14', date '1582-10-15' +-- !query 48 schema +struct<CAST(1582-10-13 AS DATE):date,DATE '1582-10-14':date,DATE '1582-10-15':date> +-- !query 48 output +1582-10-13 1582-10-14 1582-10-15 ``` other refencences h2database/h2database#831 bug fix yes, complex types containing datetimes in `spark-sql `script and thrift server can result same as self-contained spark app or `spark-shell` script add uts Closes apache#26942 from yaooqinn/SPARK-30301. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
When date and timestamp values are fields of arrays, maps, etc, we convert them to hive string using
toString. This makes the result wrong before the default transition ’1582-10-15‘.https://bugs.openjdk.java.net/browse/JDK-8061577?focusedCommentId=13566712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13566712
cases to reproduce:
other refencences
h2database/h2database#831
Why are the changes needed?
bug fix
Does this PR introduce any user-facing change?
yes, complex types containing datetimes in
spark-sqlscript and thrift server can result same as self-contained spark app orspark-shellscriptHow was this patch tested?
add uts