[SPARK-17545] [SQL] Handle additional time offset formats of ISO 8601 #15147

nbeyer · 2016-09-19T02:29:52Z

What changes were proposed in this pull request?

Allows flexibility in handling additional ISO 8601 time offset variants. The current parsing of ISO 8601 is exclusive to W3C's datetime note (and XML Schema datetime). This change will allow offset to be handled as "HH:MM", "HHMM" and "HH".

This is one suggested approach to handling these variants. The other suggestions are to switch back to SimpleDateFormat and utilize the 'X' pattern flag. Another suggestion is to wait for a future release of commons-lang. Either of these suggestions can be implemented later over top of this suggestion.

How was this patch tested?

The patch was tested by running all existing unit tests and by augmenting the existing tests with additional assertions.

AmplabJenkins · 2016-09-19T02:32:15Z

Can one of the admins verify this patch?

HyukjinKwon · 2016-09-19T02:55:48Z

To continue the discussion of JIRA, I think the issue you faced is to read those in CSV, right?

Whether it is intended or not in FastDateFormat, the default pattern "yyyy-MM-dd'T'HH:mm:ss.SSSZZ" in both CSVOptions and JSONOptions seems covering this (this patch will only affect both JSON and CSV).

it seems I can't reproduce the issue you met in the JIRA. Have you tried the problematic codes in the master branch?

That would not ran in Spark 2.0 but we made a change.

So the codes below

val path = "/tmp/timestamps"
val textDf = Seq(
  "time",
  "2015-07-20T15:09:23.736-0500",
  "2015-07-20T15:10:51.687-0500",
  "2015-11-21T23:15:01.499-0600").toDF()
textDf.coalesce(1).write.text(path)
val df = spark.read.format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load(path)
df.show()
df.printSchema()

seems working fine

+--------------------+
|                time|
+--------------------+
|2015-07-20 13:09:...|
|2015-07-20 13:10:...|
|2015-11-21 21:15:...|
+--------------------+

root
 |-- time: timestamp (nullable = true)

HyukjinKwon · 2016-09-19T02:57:42Z

cc @srowen who was in the JIRA too.

HyukjinKwon · 2016-09-19T03:00:40Z

FYI, we backported this change to branch 2.0 too. So this will be fixed from 2.0.1 #14799.

nbeyer · 2016-09-19T14:23:21Z

@HyukjinKwon I'll have to try out the 2.0.1/master changes. Is this DateTimeUtils method no longer used anywhere then? The other scenario where I ran into trouble was using 'cast' on Column class.

If the changes you mention fix all places where StringType containing ISO8601 variants can be massaged to TimestampType, then that may take care of everything. However, if there is still use of this DateTimeUtils, then I'd like to push forward with this PR as a suggestion.

HyukjinKwon · 2016-09-19T21:09:53Z

@nbeyer I manually checked the method call before and it seems that is currently only called when reading/inferring CSV/JSON (but I'd like you confirm this to double check in case).

nbeyer · 2016-09-20T04:27:09Z

@HyukjinKwon Based on my further reading of the code, I'd like to suggest that add a deprecation to the stringToTime method and then update the stringToTimestamp method, specifically here https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L356 to handle the "no colon" case. It is the stringToTimestamp method that is used by the 'cast'.

HyukjinKwon · 2016-09-20T05:00:31Z

@nbeyer Thanks for your investigation. I think that sounds reasonable though I think it might be arguable because adding more cases virtually means more time and computation to parse/infer schema (specifically in case of TimestampType in CSV for inference) and there is an option to specify the dateformat; however, it'd make sense that users don't really want to specify any option when they want to read time data. I'd follow committer's lead.

(Personally, I tend to agree with the same rule with casting operation which looks consistent.)

Anyway, this might be not related with this JIRA anymore. How about creating another one describing current state and suggestion maybe?

HyukjinKwon · 2016-09-20T05:48:02Z

I mean the problem in the JIRA is not reproduced in the master branch and therefore I believe we need another JIRA to describe the support for other time formats as the same one as casting operation as you suggested.

nbeyer · 2016-09-21T00:05:25Z

@HyukjinKwon Yeah, I agree, this PR doesn't really apply anymore. I'll work up a different JIRA and PR.

HyukjinKwon · 2016-09-21T00:27:19Z

@nbeyer Thanks for bearing with me.

Handle additional time offset formats of ISO 8601

db2e1f0

nbeyer closed this Sep 21, 2016

HyukjinKwon mentioned this pull request Mar 31, 2017

[SPARK-20166][SQL] Use XXX for ISO 8601 timezone instead of ZZ (FastDateFormat specific) in CSV/JSON timeformat options #17489

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-17545] [SQL] Handle additional time offset formats of ISO 8601 #15147

[SPARK-17545] [SQL] Handle additional time offset formats of ISO 8601 #15147

Uh oh!

nbeyer commented Sep 19, 2016

Uh oh!

AmplabJenkins commented Sep 19, 2016

Uh oh!

HyukjinKwon commented Sep 19, 2016 •

edited

Loading

Uh oh!

HyukjinKwon commented Sep 19, 2016

Uh oh!

HyukjinKwon commented Sep 19, 2016 •

edited

Loading

Uh oh!

nbeyer commented Sep 19, 2016

Uh oh!

HyukjinKwon commented Sep 19, 2016

Uh oh!

nbeyer commented Sep 20, 2016

Uh oh!

HyukjinKwon commented Sep 20, 2016 •

edited

Loading

Uh oh!

HyukjinKwon commented Sep 20, 2016

Uh oh!

nbeyer commented Sep 21, 2016

Uh oh!

HyukjinKwon commented Sep 21, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-17545] [SQL] Handle additional time offset formats of ISO 8601 #15147

[SPARK-17545] [SQL] Handle additional time offset formats of ISO 8601 #15147

Uh oh!

Conversation

nbeyer commented Sep 19, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Sep 19, 2016

Uh oh!

HyukjinKwon commented Sep 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Sep 19, 2016

Uh oh!

HyukjinKwon commented Sep 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nbeyer commented Sep 19, 2016

Uh oh!

HyukjinKwon commented Sep 19, 2016

Uh oh!

nbeyer commented Sep 20, 2016

Uh oh!

HyukjinKwon commented Sep 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Sep 20, 2016

Uh oh!

nbeyer commented Sep 21, 2016

Uh oh!

HyukjinKwon commented Sep 21, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HyukjinKwon commented Sep 19, 2016 •

edited

Loading

HyukjinKwon commented Sep 19, 2016 •

edited

Loading

HyukjinKwon commented Sep 20, 2016 •

edited

Loading