Skip to content

[SPARK-39731][SQL] Fix issue in CSV data sources when parsing dates in "yyyyMMdd" format with CORRECTED time parser policy #6190

@amahussein

Description

@amahussein

Context

What changes were proposed in SPARK pull request?

  • This PR attempts to address correctness issue by introducing a new configuration option
    enableDateTimeParsingFallback which allows to enable/disable the backward compatible parsing.
  • By default Spark falls back to the backward compatible behavior only if parser policy is legacy
    and no custom pattern was set.

Why are the changes needed in Spark?

This PR fixes a correctness issue when reading a CSV or a JSON file with dates in "yyyyMMdd" format:

name,mydate
1,2020011
2,20201203

or

{"date": "2020011"}
{"date": "20201203"}

The invalid date is parsed because of the much more lenient parsing in DateTimeUtils.stringToDate,
the method treats 2020011 as a full year:

+----+--------------+
|name|mydate        |
+----+--------------+
|1   |+2020011-01-01|
|2   |2020-12-03    |
+----+--------------+

Similar result would be observed in JSON.

Does this PR introduce any user-facing change?

  • A new configuration option enableDateTimeParsingFallback has been added to control whether or not
    the code would fall back to the backward compatible behavior of parsing dates and timestamps in
    CSV and JSON data sources.
  • If the config is enabled and the date cannot be parsed, we will fall back to DateTimeUtils.stringToDate.
  • If the config is enabled and the timestamp cannot be parsed, DateTimeUtils.stringToTimestamp will be used.
  • Otherwise, depending on the parser policy and a custom pattern, the value will be parsed as null.

Why it might affect RAPIDS?

Impact on Testing?

Yes.

  • cover the new behavior in UTs
  • need modifications to integration tests that use timestampFormat and dateFormat.

Requires Doc update?

No.

Metadata

Metadata

Assignees

Labels

P0Must have for releaseaudit_3.4.0Audit related tasks for 3.4.0bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions