-
Notifications
You must be signed in to change notification settings - Fork 268
Closed
Labels
P0Must have for releaseMust have for releaseaudit_3.4.0Audit related tasks for 3.4.0Audit related tasks for 3.4.0bugSomething isn't workingSomething isn't working
Description
Context
- Spark-3.4 PR: apache/spark@a930445502
- others:
- Rapids audit [FEA] [SPARK-35780][SQL] Support DATE/TIMESTAMP literals across the full range #3406
- Spark PR-32959 that introduced the bug.
- [BUG] Support years with up to 7 digits when casting from String to Date in Spark 3.2 #3382
- [BUG] ParseDateTime should not support special dates with Spark 3.2 #3383
What changes were proposed in SPARK pull request?
- This PR attempts to address correctness issue by introducing a new configuration option
enableDateTimeParsingFallbackwhich allows to enable/disable the backward compatible parsing. - By default Spark falls back to the backward compatible behavior only if parser policy is legacy
and no custom pattern was set.
Why are the changes needed in Spark?
This PR fixes a correctness issue when reading a CSV or a JSON file with dates in "yyyyMMdd" format:
name,mydate
1,2020011
2,20201203
or
{"date": "2020011"}
{"date": "20201203"}
The invalid date is parsed because of the much more lenient parsing in DateTimeUtils.stringToDate,
the method treats 2020011 as a full year:
+----+--------------+
|name|mydate |
+----+--------------+
|1 |+2020011-01-01|
|2 |2020-12-03 |
+----+--------------+
Similar result would be observed in JSON.
Does this PR introduce any user-facing change?
- A new configuration option
enableDateTimeParsingFallbackhas been added to control whether or not
the code would fall back to the backward compatible behavior of parsing dates and timestamps in
CSV and JSON data sources. - If the config is enabled and the date cannot be parsed, we will fall back to
DateTimeUtils.stringToDate. - If the config is enabled and the timestamp cannot be parsed,
DateTimeUtils.stringToTimestampwill be used. - Otherwise, depending on the parser policy and a custom pattern, the value will be parsed as null.
Why it might affect RAPIDS?
- The Spark change the policy of how the parsing is done. This needs to propagate to the plugin.
- This is a new flag added to CSV options
- We need to handle the new flag to know what is the expected parsing method.
- Previously, a a few of bugs were addressed when the bug was introduced in spark-3.2. However, Spark now adds a config to get the old behavior restored. So, probably we need to revisit that again.
Impact on Testing?
Yes.
- cover the new behavior in UTs
- need modifications to integration tests that use
timestampFormatanddateFormat.
Requires Doc update?
No.
Metadata
Metadata
Assignees
Labels
P0Must have for releaseMust have for releaseaudit_3.4.0Audit related tasks for 3.4.0Audit related tasks for 3.4.0bugSomething isn't workingSomething isn't working