[SPARK-29949][SQL][2.4] Fix formatting of timestamps by JSON/CSV datasources #26582

MaxGekk · 2019-11-18T18:03:38Z

What changes were proposed in this pull request?

In the PR, I propose to use the format() method of FastDateFormat which accepts an instance of the Calendar type. This allows to adjust the MILLISECOND field of the calendar directly before formatting. I added new method format() to DateTimeUtils.TimestampParser. This method splits the input timestamp to a part truncated to seconds and the seconds fractional part. The calendar is initialized by the first part in normal way, and the last one is converted to a form appropriate for correctly formatting by FastDateFormat as the second fraction up to microsecond precision.

I refactored MicrosCalendar by passing the number of digits from the fraction pattern as a parameter to the default constructor because it is used by the existing getMicros() and new one setMicros(). setMicros() is used to set the seconds fraction to calendar's MILLISECOND field directly before formatting.

This PR supports various patterns for seconds fractions from S up to SSSSSS. If the patterns has more than 6 S, the first 6 digits reflect to milliseconds and microseconds of the input timestamp but the rest digits are set to 0.

Why are the changes needed?

This fixes a bug of incorrectly formatting timestamps in microsecond precision. For example:

Seq(java.sql.Timestamp.valueOf("2019-11-18 11:56:00.123456")).toDF("t")
  .select(to_json(struct($"t"), Map("timestampFormat" -> "yyyy-MM-dd HH:mm:ss.SSSSSS")).as("json"))
  .show(false)
+----------------------------------+
|json                              |
+----------------------------------+
|{"t":"2019-11-18 11:56:00.000123"}|
+----------------------------------+

Does this PR introduce any user-facing change?

Yes. The example above outputs:

+----------------------------------+
|json                              |
+----------------------------------+
|{"t":"2019-11-18 11:56:00.123456"}|
+----------------------------------+

How was this patch tested?

By new tests for formatting by different patterns from S to SSSSSS in DateTimeUtilsSuite
A test for to_json() in JsonFunctionsSuite
A roundtrp test for writing and reading back a timestamp in a CSV file.

MaxGekk · 2019-11-18T18:04:15Z

ping @cloud-fan @HyukjinKwon @dongjoon-hyun

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

SparkQA · 2019-11-18T21:18:52Z

Test build #114029 has finished for PR 26582 at commit 5aa576a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-18T21:35:57Z

Test build #114030 has finished for PR 26582 at commit d71a2f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-11-19T09:10:32Z

thanks merging to 2.4!

…sources ### What changes were proposed in this pull request? In the PR, I propose to use the `format()` method of `FastDateFormat` which accepts an instance of the `Calendar` type. This allows to adjust the `MILLISECOND` field of the calendar directly before formatting. I added new method `format()` to `DateTimeUtils.TimestampParser`. This method splits the input timestamp to a part truncated to seconds and the seconds fractional part. The calendar is initialized by the first part in normal way, and the last one is converted to a form appropriate for correctly formatting by `FastDateFormat` as the second fraction up to microsecond precision. I refactored `MicrosCalendar` by passing the number of digits from the fraction pattern as a parameter to the default constructor because it is used by the existing `getMicros()` and new one `setMicros()`. `setMicros()` is used to set the seconds fraction to calendar's `MILLISECOND` field directly before formatting. This PR supports various patterns for seconds fractions from `S` up to `SSSSSS`. If the patterns has more than 6 `S`, the first 6 digits reflect to milliseconds and microseconds of the input timestamp but the rest digits are set to `0`. ### Why are the changes needed? This fixes a bug of incorrectly formatting timestamps in microsecond precision. For example: ```scala Seq(java.sql.Timestamp.valueOf("2019-11-18 11:56:00.123456")).toDF("t") .select(to_json(struct($"t"), Map("timestampFormat" -> "yyyy-MM-dd HH:mm:ss.SSSSSS")).as("json")) .show(false) +----------------------------------+ |json | +----------------------------------+ |{"t":"2019-11-18 11:56:00.000123"}| +----------------------------------+ ``` ### Does this PR introduce any user-facing change? Yes. The example above outputs: ```scala +----------------------------------+ |json | +----------------------------------+ |{"t":"2019-11-18 11:56:00.123456"}| +----------------------------------+ ``` ### How was this patch tested? - By new tests for formatting by different patterns from `S` to `SSSSSS` in `DateTimeUtilsSuite` - A test for `to_json()` in `JsonFunctionsSuite` - A roundtrp test for writing and reading back a timestamp in a CSV file. Closes #26582 from MaxGekk/micros-format-2.4. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… legacy date/timestamp formatters ### What changes were proposed in this pull request? In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`: - `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see #26507 & #26582 - `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input. ### Why are the changes needed? Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings: - `DateTimeFormat` in CSV/JSON datasource - `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing. - `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions. The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`. ### Does this PR introduce any user-facing change? This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4. ### How was this patch tested? - Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`. - Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`. Closes #27524 from MaxGekk/timestamp-formatter-legacy-fallback. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… legacy date/timestamp formatters ### What changes were proposed in this pull request? In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`: - `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see #26507 & #26582 - `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input. ### Why are the changes needed? Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings: - `DateTimeFormat` in CSV/JSON datasource - `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing. - `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions. The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`. ### Does this PR introduce any user-facing change? This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4. ### How was this patch tested? - Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`. - Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`. Closes #27524 from MaxGekk/timestamp-formatter-legacy-fallback. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit c198620) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… legacy date/timestamp formatters ### What changes were proposed in this pull request? In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`: - `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see apache#26507 & apache#26582 - `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input. ### Why are the changes needed? Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings: - `DateTimeFormat` in CSV/JSON datasource - `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing. - `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions. The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`. ### Does this PR introduce any user-facing change? This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4. ### How was this patch tested? - Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`. - Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`. Closes apache#27524 from MaxGekk/timestamp-formatter-legacy-fallback. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

MaxGekk added 5 commits November 18, 2019 20:11

Format timestamps up to micros precision

8c03d51

A test for to_json

374748f

Pass digitsInFraction to the constructor of MicrosCalendar

e6ec78b

Fix CSV

84902b8

Fix imports

5aa576a

MaxGekk commented Nov 18, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala Outdated Show resolved Hide resolved

Revert unrelated changes in CSVSuite

d71a2f9

MaxGekk changed the title ~~[WIP][SPARK-29949][SQL][2.4] Fix formatting of timestamps by JSON/CSV datasources~~ [SPARK-29949][SQL][2.4] Fix formatting of timestamps by JSON/CSV datasources Nov 18, 2019

HyukjinKwon approved these changes Nov 19, 2019

View reviewed changes

dongjoon-hyun added the SQL label Nov 19, 2019

cloud-fan closed this Nov 19, 2019

MaxGekk mentioned this pull request Feb 11, 2020

[SPARK-30788][SQL] Support SimpleDateFormat and FastDateFormat as legacy date/timestamp formatters #27524

Closed

MaxGekk deleted the micros-format-2.4 branch June 5, 2020 19:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-29949][SQL][2.4] Fix formatting of timestamps by JSON/CSV datasources #26582

[SPARK-29949][SQL][2.4] Fix formatting of timestamps by JSON/CSV datasources #26582

Uh oh!

MaxGekk commented Nov 18, 2019 •

edited

Loading

Uh oh!

MaxGekk commented Nov 18, 2019

Uh oh!

Uh oh!

SparkQA commented Nov 18, 2019

Uh oh!

SparkQA commented Nov 18, 2019

Uh oh!

cloud-fan commented Nov 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-29949][SQL][2.4] Fix formatting of timestamps by JSON/CSV datasources #26582

[SPARK-29949][SQL][2.4] Fix formatting of timestamps by JSON/CSV datasources #26582

Uh oh!

Conversation

MaxGekk commented Nov 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

MaxGekk commented Nov 18, 2019

Uh oh!

Uh oh!

SparkQA commented Nov 18, 2019

Uh oh!

SparkQA commented Nov 18, 2019

Uh oh!

cloud-fan commented Nov 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MaxGekk commented Nov 18, 2019 •

edited

Loading