[SPARK-16216][SQL] CSV data source supports custom date format for writing and timezone option for both reading and writing #13912

HyukjinKwon · 2016-06-26T06:18:05Z

What changes were proposed in this pull request?

This PR corrects CSV data sources in order to write DateType and TimestampType correctly just like JSON data source.

Currently this writes the data as below:

+----------------+
|            date|
+----------------+
|1440637200000000|
|1414459800000000|
|1454040000000000|
+----------------+

This PR adds the support to write dates and timestamps in a formatted string as below:

TimestampType

With dateFormat option (e.g. dd/MM/yyyy HH:mm)

+----------------+
|            date|
+----------------+
|2015/08/26 18:00|
|2014/10/27 18:30|
|2016/01/28 20:00|
+----------------+

With dateFormat (e.g. dd/MM/yyyy HH:mm z) and timezone option (e.g. GMT)

+--------------------+
|                date|
+--------------------+
|2015-08-26 18:00 GMT|
|2014-10-27 18:30 GMT|
|2016-01-28 20:00 GMT|
+--------------------+

DateType

With dateFormat option (e.g. yyyy/MM/dd)

+----------+
|      date|
+----------+
|2015/08/26|
|2014/10/27|
|2016/01/28|
+----------+

With dateFormat (e.g. yyyy/MM/dd z) and timezone option (e.g. GMT)

+--------------+
|          date|
+--------------+
|2015/08/26 GMT|
|2014/10/27 GMT|
|2016/01/28 GMT|
+--------------+

This PR also adds the support for parsing timezone appropriately when reading. So, roundtrip in reading and writing timestamps and dates with the option above would be okay.

How was this patch tested?

Unit tests in CsvSuite.scala.

HyukjinKwon · 2016-06-26T06:19:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala

+
+      case dt: DataType =>
+        (row: InternalRow, ordinal: Int) =>
+          row.get(ordinal, dt).toString


schema is being validated here. So it does not have to handle exception cases here again.

srowen · 2016-06-26T06:36:20Z

This has a big problem though -- the times are now wrong, because there is no timezone.
Why would we write formatted dates in a data file though? it's for human consumption. Timestamp is an appropriate way to render a point in time.

HyukjinKwon · 2016-06-26T06:58:26Z

Thank you for your comment @srowen !
I did this for two reasons. Firstly, it can't write and read back the dates or timestamps but it might have to be converted manually by users if it writes them as long values. The other reason is consistency with JSON data source. JSON writes dates and timestamps in the format as above.

+Another reason is about dateFormat option. I guess users want to use this as writing as well. This is only used as reading.

srowen · 2016-06-26T06:59:27Z

Hm, that's a bug isn't it? any time is ambiguous without a timezone. These things can make sense for local display but not for output of data.

HyukjinKwon · 2016-06-26T07:16:45Z

I understand the problem without a timezone but.. hm.. to be honest, I have seen a lot of cases of reading and writing times in CSV files without a timezone (e.g. by Microsoft Excel).

HyukjinKwon · 2016-06-26T07:18:07Z

Anyway do you think I should close this and deal with JSON one? I can't decide.

srowen · 2016-06-26T07:45:07Z

I suppose that if it's configurable, and not the default, then you're letting people do what they want at least, including specify a format with a timezone. I think it's virtually always asking for trouble to not store a timezone, but hey. If Spark anywhere forces you to write data as a human-readable format without one, that seems like a problem.

SparkQA · 2016-06-26T07:49:46Z

Test build #61258 has finished for PR 13912 at commit 79a290a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jaceklaskowski · 2016-06-26T17:43:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala

+  private def rowToString(row: InternalRow): Seq[String] = {
+    var i = 0
+    val values = new Array[String](row.numFields)
+    while (i < row.numFields) {


Please use values.indices and then map or foreach to make it more functional (and hopefully readable). With the change, you'll link values to indices (not relying on row.numFields used twice).

BTW, do we need values to be initialized first before adding elements? I'd vote for foldLeft as a better alternative, e.g.

def appendRowValue(arr: Array[String], i: Int) = { // do the if here arr } (0 to row.numFields).foldLeft(Array[String]()) { case (arr, i) => arr }

Using funtional transformation is generally slower due to virtual function calls. This part is executed a lot and such overhead will become really heavy. See https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex

HyukjinKwon · 2016-06-27T02:37:29Z

@srowen Hm.. I just wonder if I should correct JSON data source to be consistent with this meaning these below:

Default is to write time as numeric timestamps
Supports both dateFormat and timezone for reading and writing.

Currently, JSON data source is writing both Date and Timestamp as below:

// TimestampType
1970-01-01 11:46:40.0

// DateType
1970-01-01

HyukjinKwon · 2016-06-27T02:51:16Z

@rxin, Could you please take a look?

SparkQA · 2016-06-27T03:57:03Z

Test build #61280 has finished for PR 13912 at commit 649ab85.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-29T05:25:58Z

Test build #61444 has finished for PR 13912 at commit d03e7a0.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-29T05:29:03Z

Test build #61443 has finished for PR 13912 at commit 0e1901e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public final class JavaStructuredNetworkWordCount
- class OptionUtils(object):
- class DataFrameReader(OptionUtils):
- class DataFrameWriter(OptionUtils):
- class DataStreamReader(OptionUtils):
- case class ShowFunctionsCommand(

HyukjinKwon · 2016-06-30T08:42:17Z

I will update this if it is decided to be worth being added.

deanchen · 2016-07-18T04:54:30Z

@srowen @rxin Would love to see this get merged as this has been a pain point for us. Not a fan of timezoneless dates as an engineer but the need to passthrough or write timezoneless dates to csv's has been a necessary task due to a variety of reasons in the past.

We use Spark to parse large amounts of daily financial asset data with pre-agreed upon date conventions. Most of the dates we deal doesn't belong to a timezone and are treated more like indices with the maximum granularity of a day. The CSV reports we produce for our customers do not ever contain timezone as a result and almost all dates as passthrough from the original timezoneless dates.

This is not just a common pattern with financial data, but also common practice in large companies I've worked at in the past. A large tech company I've worked at in the past used the convention of MST date and datetime for all internal systems(database included).

srowen · 2016-07-18T07:59:41Z

python/pyspark/sql/readwriter.py

                           which means trying to parse times and date by
                           ``java.sql.Timestamp.valueOf()`` and ``java.sql.Date.valueOf()``.
+        :param timezone: defines the timezone to be used for both date type and timestamp type.
+                         If a timezone is specified in the data, this will load them after


I don't get what this is saying -- if an input string specifies a time zone then that is the time zone to use to parse it. That's it right?

Yes, it will use the timezone specified in the input. Since Date and Timestamp do not keep timezone information, it calculates the differences between specified timezone in the input and in the option but the standard is the one specified in the option.

I meant to say, for example, if timezone is set to GMT, all the read Date and Timestamp are in GMT timezone after calculating the differences. So..

If the CSV data is as below:

27/08/2015 00:00 PDT 27/08/2015 01:00 PDT 27/08/2015 02:00 PDT

If this is read as below:

val df = spark.read .format("csv") .option("timezone", "GTM") .option("dateFormat", "dd/MM/yyyy HH:mm z") .load("path") df.show()

it will become as below in dataframe (difference between GMT and PDT is 7 hours), these below will be timestamp objects but I just represented them as strings.

+-----------------------+ | _1| +-----------------------+ |2015-08-27 07:00:00.000| |2015-08-27 08:00:00.000| |2015-08-27 09:00:00.000| +-----------------------+

EDITTED: - I added an example for writing as well just to be clear

Then, if you write this as below:

df.write .format("csv") .option("header", "true") .option("timezone", "GTM") .option("dateFormat", "dd/MM/yyyy HH:mm z") .save("path")

the results will be

27/08/2015 07:00 GMT 27/08/2015 08:00 GMT 27/08/2015 09:00 GMT

If this behaviour looks fine, I will just change the documentation to be more clear..

In the example above, the input specifies a timezone and that must be used to interpret it. You say "it becomes in the dataframe" but what it becomes is a timestamp, which is unambiguous and has no timezone. Timezone matters when converting back to a string for display, but, your example only shows the parameter used on reading, and shows no timezone in the output. I am not sure that this is the intended behavior?

I thought it loses the timezone information after being loaded into Spark. I mean, Timestamp and Date instances don't have timezone information in them. The timezone specified in the input is being used in the example...

I am sorry that I think I didn't understand cleanly. Do you mind if I ask what you expect in before being read, after being read (in dataframe) and after being written please?

(I just added a example.)

I see. Thank you for the detailed explanation. I should correct comments and the behaviour.

It seems it is a default behaviour for SimpleDateFormat. I will look into this deeper and will fix or leave some more commemts tomorrow. Thanks again!

Why not just have timezone as part of the dateFormat, so users can specify it?

Ah ic - you want to control the zone this gets converted to.

I will clean up the PR description and all those soon with a better proposal. Thanks!

srowen · 2016-07-18T08:02:34Z

This needs a rebase.
Pardon, but where does this affect parsing of date strings? I'm missing that but I'm sure it's here.

HyukjinKwon · 2016-07-18T09:34:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala

+
+    case TimestampType if params.dateFormat != null =>
+      (row: InternalRow, ordinal: Int) =>
+        params.dateFormat.format(DateTimeUtils.toJavaTimestamp(row.getLong(ordinal)))


@srowen Maybe you are looking for this case. I had to do it like this to avoid per-record type dispatch.

This is formatting to a string, rather than parsing from a string right?

Oh, sorry. This is not the part of this PR. This is

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

Lines 280 to 291 in e1dc853

case _: TimestampType if options.dateFormat != null =>

// This one will lose microseconds parts.

// See https://issues.apache.org/jira/browse/SPARK-10681.

options.dateFormat.parse(datum).getTime * 1000L

case _: TimestampType =>

// This one will lose microseconds parts.

// See https://issues.apache.org/jira/browse/SPARK-10681.

DateTimeUtils.stringToTime(datum).getTime * 1000L

case _: DateType if options.dateFormat != null =>

DateTimeUtils.millisToDays(options.dateFormat.parse(datum).getTime)

case _: DateType =>

DateTimeUtils.millisToDays(DateTimeUtils.stringToTime(datum).getTime)

This was merged before, here but timezone was not concerned.

So, this PR adds the support for..

timezone for both reading and writing

dateFormat for writing.

HyukjinKwon · 2016-07-19T22:48:38Z

I am closing this first. I will open a new one only for the default format within today. I will cc you all as soon as I open. Thanks!

HyukjinKwon added 2 commits June 26, 2016 15:00

CSV data source does not write date and timestamp correctly

78785e0

Fix comments

79a290a

HyukjinKwon reviewed Jun 26, 2016
View reviewed changes

jaceklaskowski reviewed Jun 26, 2016
View reviewed changes

HyukjinKwon added 2 commits June 27, 2016 11:22

Add the python documentation

8930bf7

Do not use formatted string as default and add a option for timezone

649ab85

HyukjinKwon changed the title ~~[SPARK-16216][SQL] CSV data source does not write date and timestamp correctly~~ [SPARK-16216][SQL] CSV data source supports custom date format for writing and timezone option for both reading and writing Jun 27, 2016

HyukjinKwon added 3 commits June 29, 2016 12:36

Resolve conflicts

0e1901e

Remove a newline and remove bak file

3c1343f

Add appropriate arguments

d03e7a0

HyukjinKwon mentioned this pull request Jul 18, 2016

[SPARK-16101][SQL] Refactoring CSV data source to be consistent with JSON data source #13988

Closed

srowen reviewed Jul 18, 2016
View reviewed changes

HyukjinKwon reviewed Jul 18, 2016
View reviewed changes

HyukjinKwon closed this Jul 19, 2016

HyukjinKwon deleted the SPARK-16216 branch January 2, 2018 03:40

	case _: TimestampType if options.dateFormat != null =>
	// This one will lose microseconds parts.
	// See https://issues.apache.org/jira/browse/SPARK-10681.
	options.dateFormat.parse(datum).getTime * 1000L
	case _: TimestampType =>
	// This one will lose microseconds parts.
	// See https://issues.apache.org/jira/browse/SPARK-10681.
	DateTimeUtils.stringToTime(datum).getTime * 1000L
	case _: DateType if options.dateFormat != null =>
	DateTimeUtils.millisToDays(options.dateFormat.parse(datum).getTime)
	case _: DateType =>
	DateTimeUtils.millisToDays(DateTimeUtils.stringToTime(datum).getTime)

[SPARK-16216][SQL] CSV data source supports custom date format for writing and timezone option for both reading and writing #13912

[SPARK-16216][SQL] CSV data source supports custom date format for writing and timezone option for both reading and writing #13912

Uh oh!

Conversation

HyukjinKwon commented Jun 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Jun 26, 2016

Uh oh!

HyukjinKwon commented Jun 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Jun 26, 2016

Uh oh!

HyukjinKwon commented Jun 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Jun 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Jun 26, 2016

Uh oh!

SparkQA commented Jun 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jun 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jun 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Jun 27, 2016

Uh oh!

SparkQA commented Jun 27, 2016

Uh oh!

SparkQA commented Jun 29, 2016

Uh oh!

SparkQA commented Jun 29, 2016

Uh oh!

HyukjinKwon commented Jun 30, 2016

Uh oh!

deanchen commented Jul 18, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Jul 18, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

HyukjinKwon commented Jun 26, 2016 •

edited

Loading

HyukjinKwon commented Jun 26, 2016 •

edited

Loading

HyukjinKwon commented Jun 26, 2016 •

edited

Loading

HyukjinKwon commented Jun 26, 2016 •

edited

Loading

HyukjinKwon Jun 26, 2016 •

edited

Loading

HyukjinKwon commented Jun 27, 2016 •

edited

Loading

HyukjinKwon Jul 18, 2016 •

edited

Loading

HyukjinKwon Jul 18, 2016 •

edited

Loading

HyukjinKwon Jul 18, 2016 •

edited

Loading

HyukjinKwon Jul 18, 2016 •

edited

Loading

HyukjinKwon commented Jul 19, 2016 •

edited

Loading