Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Nov 18, 2019

What changes were proposed in this pull request?

In the PR, I propose to use the format() method of FastDateFormat which accepts an instance of the Calendar type. This allows to adjust the MILLISECOND field of the calendar directly before formatting. I added new method format() to DateTimeUtils.TimestampParser. This method splits the input timestamp to a part truncated to seconds and the seconds fractional part. The calendar is initialized by the first part in normal way, and the last one is converted to a form appropriate for correctly formatting by FastDateFormat as the second fraction up to microsecond precision.

I refactored MicrosCalendar by passing the number of digits from the fraction pattern as a parameter to the default constructor because it is used by the existing getMicros() and new one setMicros(). setMicros() is used to set the seconds fraction to calendar's MILLISECOND field directly before formatting.

This PR supports various patterns for seconds fractions from S up to SSSSSS. If the patterns has more than 6 S, the first 6 digits reflect to milliseconds and microseconds of the input timestamp but the rest digits are set to 0.

Why are the changes needed?

This fixes a bug of incorrectly formatting timestamps in microsecond precision. For example:

Seq(java.sql.Timestamp.valueOf("2019-11-18 11:56:00.123456")).toDF("t")
  .select(to_json(struct($"t"), Map("timestampFormat" -> "yyyy-MM-dd HH:mm:ss.SSSSSS")).as("json"))
  .show(false)
+----------------------------------+
|json                              |
+----------------------------------+
|{"t":"2019-11-18 11:56:00.000123"}|
+----------------------------------+

Does this PR introduce any user-facing change?

Yes. The example above outputs:

+----------------------------------+
|json                              |
+----------------------------------+
|{"t":"2019-11-18 11:56:00.123456"}|
+----------------------------------+

How was this patch tested?

  • By new tests for formatting by different patterns from S to SSSSSS in DateTimeUtilsSuite
  • A test for to_json() in JsonFunctionsSuite
  • A roundtrp test for writing and reading back a timestamp in a CSV file.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 18, 2019

@MaxGekk MaxGekk changed the title [WIP][SPARK-29949][SQL][2.4] Fix formatting of timestamps by JSON/CSV datasources [SPARK-29949][SQL][2.4] Fix formatting of timestamps by JSON/CSV datasources Nov 18, 2019
@SparkQA
Copy link

SparkQA commented Nov 18, 2019

Test build #114029 has finished for PR 26582 at commit 5aa576a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 18, 2019

Test build #114030 has finished for PR 26582 at commit d71a2f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks merging to 2.4!

cloud-fan pushed a commit that referenced this pull request Nov 19, 2019
…sources

### What changes were proposed in this pull request?
In the PR, I propose to use the `format()` method of `FastDateFormat` which accepts an instance of the `Calendar` type. This allows to adjust the `MILLISECOND` field of the calendar directly before formatting. I added new method `format()` to `DateTimeUtils.TimestampParser`. This method splits the input timestamp to a part truncated to seconds and the seconds fractional part. The calendar is initialized by the first part in normal way, and the last one is converted to a form appropriate for correctly formatting by `FastDateFormat` as the second fraction up to microsecond precision.

I refactored `MicrosCalendar` by passing the number of digits from the fraction pattern as a parameter to the default constructor because it is used by the existing `getMicros()` and new one `setMicros()`. `setMicros()` is used to set the seconds fraction to calendar's `MILLISECOND` field directly before formatting.

This PR supports various patterns for seconds fractions from `S` up to `SSSSSS`. If the patterns has more than 6 `S`, the first 6 digits reflect to milliseconds and microseconds of the input timestamp but the rest digits are set to `0`.

### Why are the changes needed?
This fixes a bug of incorrectly formatting timestamps in microsecond precision. For example:
```scala
Seq(java.sql.Timestamp.valueOf("2019-11-18 11:56:00.123456")).toDF("t")
  .select(to_json(struct($"t"), Map("timestampFormat" -> "yyyy-MM-dd HH:mm:ss.SSSSSS")).as("json"))
  .show(false)
+----------------------------------+
|json                              |
+----------------------------------+
|{"t":"2019-11-18 11:56:00.000123"}|
+----------------------------------+
```

### Does this PR introduce any user-facing change?
Yes. The example above outputs:
```scala
+----------------------------------+
|json                              |
+----------------------------------+
|{"t":"2019-11-18 11:56:00.123456"}|
+----------------------------------+
```

### How was this patch tested?
- By new tests for formatting by different patterns from `S` to `SSSSSS` in `DateTimeUtilsSuite`
- A test for `to_json()` in `JsonFunctionsSuite`
- A roundtrp test for writing and reading back a timestamp in a CSV file.

Closes #26582 from MaxGekk/micros-format-2.4.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan cloud-fan closed this Nov 19, 2019
cloud-fan pushed a commit that referenced this pull request Feb 12, 2020
… legacy date/timestamp formatters

### What changes were proposed in this pull request?
In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`:
- `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see #26507 & #26582
- `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input.

### Why are the changes needed?
Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings:
- `DateTimeFormat` in CSV/JSON datasource
- `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing.
- `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions.

The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`.

### Does this PR introduce any user-facing change?
This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4.

### How was this patch tested?
- Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`.
- Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`.

Closes #27524 from MaxGekk/timestamp-formatter-legacy-fallback.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Feb 12, 2020
… legacy date/timestamp formatters

### What changes were proposed in this pull request?
In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`:
- `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see #26507 & #26582
- `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input.

### Why are the changes needed?
Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings:
- `DateTimeFormat` in CSV/JSON datasource
- `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing.
- `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions.

The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`.

### Does this PR introduce any user-facing change?
This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4.

### How was this patch tested?
- Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`.
- Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`.

Closes #27524 from MaxGekk/timestamp-formatter-legacy-fallback.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit c198620)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
… legacy date/timestamp formatters

### What changes were proposed in this pull request?
In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`:
- `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see apache#26507 & apache#26582
- `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input.

### Why are the changes needed?
Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings:
- `DateTimeFormat` in CSV/JSON datasource
- `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing.
- `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions.

The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`.

### Does this PR introduce any user-facing change?
This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4.

### How was this patch tested?
- Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`.
- Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`.

Closes apache#27524 from MaxGekk/timestamp-formatter-legacy-fallback.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@MaxGekk MaxGekk deleted the micros-format-2.4 branch June 5, 2020 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants