-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16216][SQL] CSV data source supports custom date format for writing and timezone option for both reading and writing #13912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| case dt: DataType => | ||
| (row: InternalRow, ordinal: Int) => | ||
| row.get(ordinal, dt).toString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
schema is being validated here. So it does not have to handle exception cases here again.
|
This has a big problem though -- the times are now wrong, because there is no timezone. |
|
Thank you for your comment @srowen ! +Another reason is about |
|
Hm, that's a bug isn't it? any time is ambiguous without a timezone. These things can make sense for local display but not for output of data. |
|
I understand the problem without a timezone but.. hm.. to be honest, I have seen a lot of cases of reading and writing times in CSV files without a timezone (e.g. by Microsoft Excel). |
|
Anyway do you think I should close this and deal with JSON one? I can't decide. |
|
I suppose that if it's configurable, and not the default, then you're letting people do what they want at least, including specify a format with a timezone. I think it's virtually always asking for trouble to not store a timezone, but hey. If Spark anywhere forces you to write data as a human-readable format without one, that seems like a problem. |
|
Test build #61258 has finished for PR 13912 at commit
|
| private def rowToString(row: InternalRow): Seq[String] = { | ||
| var i = 0 | ||
| val values = new Array[String](row.numFields) | ||
| while (i < row.numFields) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use values.indices and then map or foreach to make it more functional (and hopefully readable). With the change, you'll link values to indices (not relying on row.numFields used twice).
BTW, do we need values to be initialized first before adding elements? I'd vote for foldLeft as a better alternative, e.g.
def appendRowValue(arr: Array[String], i: Int) = {
// do the if here
arr
}
(0 to row.numFields).foldLeft(Array[String]()) { case (arr, i) => arr }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using funtional transformation is generally slower due to virtual function calls. This part is executed a lot and such overhead will become really heavy. See https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex
|
@srowen Hm.. I just wonder if I should correct JSON data source to be consistent with this meaning these below:
Currently, JSON data source is writing both // TimestampType
1970-01-01 11:46:40.0
// DateType
1970-01-01 |
|
@rxin, Could you please take a look? |
|
Test build #61280 has finished for PR 13912 at commit
|
|
Test build #61444 has finished for PR 13912 at commit
|
|
Test build #61443 has finished for PR 13912 at commit
|
|
I will update this if it is decided to be worth being added. |
|
@srowen @rxin Would love to see this get merged as this has been a pain point for us. Not a fan of timezoneless dates as an engineer but the need to passthrough or write timezoneless dates to csv's has been a necessary task due to a variety of reasons in the past. We use Spark to parse large amounts of daily financial asset data with pre-agreed upon date conventions. Most of the dates we deal doesn't belong to a timezone and are treated more like indices with the maximum granularity of a day. The CSV reports we produce for our customers do not ever contain timezone as a result and almost all dates as passthrough from the original timezoneless dates. This is not just a common pattern with financial data, but also common practice in large companies I've worked at in the past. A large tech company I've worked at in the past used the convention of MST date and datetime for all internal systems(database included). |
| which means trying to parse times and date by | ||
| ``java.sql.Timestamp.valueOf()`` and ``java.sql.Date.valueOf()``. | ||
| :param timezone: defines the timezone to be used for both date type and timestamp type. | ||
| If a timezone is specified in the data, this will load them after |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get what this is saying -- if an input string specifies a time zone then that is the time zone to use to parse it. That's it right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will use the timezone specified in the input. Since Date and Timestamp do not keep timezone information, it calculates the differences between specified timezone in the input and in the option but the standard is the one specified in the option.
I meant to say, for example, if timezone is set to GMT, all the read Date and Timestamp are in GMT timezone after calculating the differences. So..
If the CSV data is as below:
27/08/2015 00:00 PDT
27/08/2015 01:00 PDT
27/08/2015 02:00 PDT
If this is read as below:
val df = spark.read
.format("csv")
.option("timezone", "GTM")
.option("dateFormat", "dd/MM/yyyy HH:mm z")
.load("path")
df.show()it will become as below in dataframe (difference between GMT and PDT is 7 hours), these below will be timestamp objects but I just represented them as strings.
+-----------------------+
| _1|
+-----------------------+
|2015-08-27 07:00:00.000|
|2015-08-27 08:00:00.000|
|2015-08-27 09:00:00.000|
+-----------------------+
EDITTED: - I added an example for writing as well just to be clear
Then, if you write this as below:
df.write
.format("csv")
.option("header", "true")
.option("timezone", "GTM")
.option("dateFormat", "dd/MM/yyyy HH:mm z")
.save("path")the results will be
27/08/2015 07:00 GMT
27/08/2015 08:00 GMT
27/08/2015 09:00 GMT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this behaviour looks fine, I will just change the documentation to be more clear..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the example above, the input specifies a timezone and that must be used to interpret it. You say "it becomes in the dataframe" but what it becomes is a timestamp, which is unambiguous and has no timezone. Timezone matters when converting back to a string for display, but, your example only shows the parameter used on reading, and shows no timezone in the output. I am not sure that this is the intended behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it loses the timezone information after being loaded into Spark. I mean, Timestamp and Date instances don't have timezone information in them. The timezone specified in the input is being used in the example...
I am sorry that I think I didn't understand cleanly. Do you mind if I ask what you expect in before being read, after being read (in dataframe) and after being written please?
(I just added a example.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thank you for the detailed explanation. I should correct comments and the behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it is a default behaviour for SimpleDateFormat. I will look into this deeper and will fix or leave some more commemts tomorrow. Thanks again!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just have timezone as part of the dateFormat, so users can specify it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ic - you want to control the zone this gets converted to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will clean up the PR description and all those soon with a better proposal. Thanks!
|
This needs a rebase. |
|
|
||
| case TimestampType if params.dateFormat != null => | ||
| (row: InternalRow, ordinal: Int) => | ||
| params.dateFormat.format(DateTimeUtils.toJavaTimestamp(row.getLong(ordinal))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen Maybe you are looking for this case. I had to do it like this to avoid per-record type dispatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is formatting to a string, rather than parsing from a string right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry. This is not the part of this PR. This is
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
Lines 280 to 291 in e1dc853
| case _: TimestampType if options.dateFormat != null => | |
| // This one will lose microseconds parts. | |
| // See https://issues.apache.org/jira/browse/SPARK-10681. | |
| options.dateFormat.parse(datum).getTime * 1000L | |
| case _: TimestampType => | |
| // This one will lose microseconds parts. | |
| // See https://issues.apache.org/jira/browse/SPARK-10681. | |
| DateTimeUtils.stringToTime(datum).getTime * 1000L | |
| case _: DateType if options.dateFormat != null => | |
| DateTimeUtils.millisToDays(options.dateFormat.parse(datum).getTime) | |
| case _: DateType => | |
| DateTimeUtils.millisToDays(DateTimeUtils.stringToTime(datum).getTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was merged before, here but timezone was not concerned.
So, this PR adds the support for..
timezonefor both reading and writingdateFormatfor writing.
|
I am closing this first. I will open a new one only for the default format within today. I will cc you all as soon as I open. Thanks! |
What changes were proposed in this pull request?
This PR corrects CSV data sources in order to write
DateTypeandTimestampTypecorrectly just like JSON data source.Currently this writes the data as below:
This PR adds the support to write dates and timestamps in a formatted string as below:
With
dateFormatoption (e.g.dd/MM/yyyy HH:mm)With
dateFormat(e.g.dd/MM/yyyy HH:mm z) andtimezoneoption (e.g.GMT)With
dateFormatoption (e.g.yyyy/MM/dd)With
dateFormat(e.g.yyyy/MM/dd z) andtimezoneoption (e.g.GMT)This PR also adds the support for parsing timezone appropriately when reading. So, roundtrip in reading and writing timestamps and dates with the option above would be okay.
How was this patch tested?
Unit tests in
CsvSuite.scala.