Skip to content

Conversation

@nbeyer
Copy link

@nbeyer nbeyer commented Sep 19, 2016

What changes were proposed in this pull request?

Allows flexibility in handling additional ISO 8601 time offset variants. The current parsing of ISO 8601 is exclusive to W3C's datetime note (and XML Schema datetime). This change will allow offset to be handled as "HH:MM", "HHMM" and "HH".

This is one suggested approach to handling these variants. The other suggestions are to switch back to SimpleDateFormat and utilize the 'X' pattern flag. Another suggestion is to wait for a future release of commons-lang. Either of these suggestions can be implemented later over top of this suggestion.

How was this patch tested?

The patch was tested by running all existing unit tests and by augmenting the existing tests with additional assertions.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 19, 2016

To continue the discussion of JIRA, I think the issue you faced is to read those in CSV, right?

Whether it is intended or not in FastDateFormat, the default pattern "yyyy-MM-dd'T'HH:mm:ss.SSSZZ" in both CSVOptions and JSONOptions seems covering this (this patch will only affect both JSON and CSV).

it seems I can't reproduce the issue you met in the JIRA. Have you tried the problematic codes in the master branch?

That would not ran in Spark 2.0 but we made a change.

So the codes below

val path = "/tmp/timestamps"
val textDf = Seq(
  "time",
  "2015-07-20T15:09:23.736-0500",
  "2015-07-20T15:10:51.687-0500",
  "2015-11-21T23:15:01.499-0600").toDF()
textDf.coalesce(1).write.text(path)
val df = spark.read.format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load(path)
df.show()
df.printSchema()

seems working fine

+--------------------+
|                time|
+--------------------+
|2015-07-20 13:09:...|
|2015-07-20 13:10:...|
|2015-11-21 21:15:...|
+--------------------+

root
 |-- time: timestamp (nullable = true)

@HyukjinKwon
Copy link
Member

cc @srowen who was in the JIRA too.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 19, 2016

FYI, we backported this change to branch 2.0 too. So this will be fixed from 2.0.1 #14799.

@nbeyer
Copy link
Author

nbeyer commented Sep 19, 2016

@HyukjinKwon I'll have to try out the 2.0.1/master changes. Is this DateTimeUtils method no longer used anywhere then? The other scenario where I ran into trouble was using 'cast' on Column class.

If the changes you mention fix all places where StringType containing ISO8601 variants can be massaged to TimestampType, then that may take care of everything. However, if there is still use of this DateTimeUtils, then I'd like to push forward with this PR as a suggestion.

@HyukjinKwon
Copy link
Member

@nbeyer I manually checked the method call before and it seems that is currently only called when reading/inferring CSV/JSON (but I'd like you confirm this to double check in case).

@nbeyer
Copy link
Author

nbeyer commented Sep 20, 2016

@HyukjinKwon Based on my further reading of the code, I'd like to suggest that add a deprecation to the stringToTime method and then update the stringToTimestamp method, specifically here https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L356 to handle the "no colon" case. It is the stringToTimestamp method that is used by the 'cast'.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 20, 2016

@nbeyer Thanks for your investigation. I think that sounds reasonable though I think it might be arguable because adding more cases virtually means more time and computation to parse/infer schema (specifically in case of TimestampType in CSV for inference) and there is an option to specify the dateformat; however, it'd make sense that users don't really want to specify any option when they want to read time data. I'd follow committer's lead.

(Personally, I tend to agree with the same rule with casting operation which looks consistent.)

Anyway, this might be not related with this JIRA anymore. How about creating another one describing current state and suggestion maybe?

@HyukjinKwon
Copy link
Member

I mean the problem in the JIRA is not reproduced in the master branch and therefore I believe we need another JIRA to describe the support for other time formats as the same one as casting operation as you suggested.

@nbeyer
Copy link
Author

nbeyer commented Sep 21, 2016

@HyukjinKwon Yeah, I agree, this PR doesn't really apply anymore. I'll work up a different JIRA and PR.

@nbeyer nbeyer closed this Sep 21, 2016
@HyukjinKwon
Copy link
Member

@nbeyer Thanks for bearing with me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants