-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17545] [SQL] Handle additional time offset formats of ISO 8601 #15147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
To continue the discussion of JIRA, I think the issue you faced is to read those in CSV, right? Whether it is intended or not in it seems I can't reproduce the issue you met in the JIRA. Have you tried the problematic codes in the master branch? That would not ran in Spark 2.0 but we made a change. So the codes below val path = "/tmp/timestamps"
val textDf = Seq(
"time",
"2015-07-20T15:09:23.736-0500",
"2015-07-20T15:10:51.687-0500",
"2015-11-21T23:15:01.499-0600").toDF()
textDf.coalesce(1).write.text(path)
val df = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load(path)
df.show()
df.printSchema()seems working fine |
|
cc @srowen who was in the JIRA too. |
|
FYI, we backported this change to branch 2.0 too. So this will be fixed from 2.0.1 #14799. |
|
@HyukjinKwon I'll have to try out the 2.0.1/master changes. Is this DateTimeUtils method no longer used anywhere then? The other scenario where I ran into trouble was using 'cast' on Column class. If the changes you mention fix all places where StringType containing ISO8601 variants can be massaged to TimestampType, then that may take care of everything. However, if there is still use of this DateTimeUtils, then I'd like to push forward with this PR as a suggestion. |
|
@nbeyer I manually checked the method call before and it seems that is currently only called when reading/inferring CSV/JSON (but I'd like you confirm this to double check in case). |
|
@HyukjinKwon Based on my further reading of the code, I'd like to suggest that add a deprecation to the stringToTime method and then update the stringToTimestamp method, specifically here https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L356 to handle the "no colon" case. It is the stringToTimestamp method that is used by the 'cast'. |
|
@nbeyer Thanks for your investigation. I think that sounds reasonable though I think it might be arguable because adding more cases virtually means more time and computation to parse/infer schema (specifically in case of (Personally, I tend to agree with the same rule with casting operation which looks consistent.) Anyway, this might be not related with this JIRA anymore. How about creating another one describing current state and suggestion maybe? |
|
I mean the problem in the JIRA is not reproduced in the master branch and therefore I believe we need another JIRA to describe the support for other time formats as the same one as casting operation as you suggested. |
|
@HyukjinKwon Yeah, I agree, this PR doesn't really apply anymore. I'll work up a different JIRA and PR. |
|
@nbeyer Thanks for bearing with me. |
What changes were proposed in this pull request?
Allows flexibility in handling additional ISO 8601 time offset variants. The current parsing of ISO 8601 is exclusive to W3C's datetime note (and XML Schema datetime). This change will allow offset to be handled as "HH:MM", "HHMM" and "HH".
This is one suggested approach to handling these variants. The other suggestions are to switch back to SimpleDateFormat and utilize the 'X' pattern flag. Another suggestion is to wait for a future release of commons-lang. Either of these suggestions can be implemented later over top of this suggestion.
How was this patch tested?
The patch was tested by running all existing unit tests and by augmenting the existing tests with additional assertions.