-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40474][SQL] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps #37933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
0394030
f4fadf7
813ac74
5c2dde8
0d2be1d
df56946
4bc480d
f6ed29f
93b6422
6942f2b
b4a6f1d
1502618
4767ae7
1f57098
e9150ec
a07e432
255aea3
533c487
be4c86f
c7225b1
9e87d6e
af66b83
812fa65
5288eb0
a2f0b80
00a8661
16e187c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -59,6 +59,13 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { | |
| ExprUtils.getDecimalParser(options.locale) | ||
| } | ||
|
|
||
| // Date formats that could be parsed in DefaultTimestampFormatter | ||
| // Reference: DateTimeUtils.parseTimestampString | ||
| // Used to determine inferring a column with mixture of dates and timestamps as TimestampType or | ||
| // StringType when no timestamp format is specified (the lenient timestamp formatter will be used) | ||
| private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set( | ||
| "yyyy-MM-dd", "yyyy-M-d", "yyyy-M-dd", "yyyy-MM-d", "yyyy-MM", "yyyy-M", "yyyy") | ||
HyukjinKwon marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| /** | ||
| * Similar to the JSON schema inference | ||
| * 1. Infer type of each row | ||
|
|
@@ -123,10 +130,8 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { | |
| case LongType => tryParseLong(field) | ||
| case _: DecimalType => tryParseDecimal(field) | ||
| case DoubleType => tryParseDouble(field) | ||
| case DateType => tryParseDateTime(field) | ||
| case TimestampNTZType if options.prefersDate => tryParseDateTime(field) | ||
| case DateType => tryParseDate(field) | ||
| case TimestampNTZType => tryParseTimestampNTZ(field) | ||
| case TimestampType if options.prefersDate => tryParseDateTime(field) | ||
| case TimestampType => tryParseTimestamp(field) | ||
| case BooleanType => tryParseBoolean(field) | ||
| case StringType => StringType | ||
|
|
@@ -179,13 +184,13 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { | |
| if ((allCatch opt field.toDouble).isDefined || isInfOrNan(field)) { | ||
| DoubleType | ||
| } else if (options.prefersDate) { | ||
| tryParseDateTime(field) | ||
| tryParseDate(field) | ||
| } else { | ||
| tryParseTimestampNTZ(field) | ||
| } | ||
| } | ||
|
|
||
| private def tryParseDateTime(field: String): DataType = { | ||
| private def tryParseDate(field: String): DataType = { | ||
| if ((allCatch opt dateFormatter.parse(field)).isDefined) { | ||
| DateType | ||
| } else { | ||
|
|
@@ -233,7 +238,40 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { | |
| * is compatible with both input data types. | ||
| */ | ||
| private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = { | ||
| TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) | ||
| (t1, t2) match { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this match be in
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What result does
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Never mind, I checked the code, the resulting type will be TimestampType or TimestampNTZ. |
||
| case (DateType, TimestampType) | (DateType, TimestampNTZType) | | ||
| (TimestampNTZType, DateType) | (TimestampType, DateType) => | ||
| // For a column containing a mixture of dates and timestamps, infer it as timestamp type | ||
| // if its dates can be inferred as timestamp type, otherwise infer it as StringType. | ||
| // This only happens when the timestamp pattern is not specified, as the default timestamp | ||
| // parser is very lenient and can parse date string as well. | ||
| val dateFormat = options.dateFormatInRead.getOrElse(DateFormatter.defaultPattern) | ||
| t1 match { | ||
| case DateType if canParseDateAsTimestamp(dateFormat, t2) => | ||
| Some(t2) | ||
| case TimestampType | TimestampNTZType if canParseDateAsTimestamp(dateFormat, t1) => | ||
| Some(t1) | ||
| case _ => Some(StringType) | ||
| } | ||
| case _ => TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Return true if strings of given date format can be parsed as timestamps | ||
| * 1. If user provides timestamp format, we will parse strings as timestamps using | ||
| * Iso8601TimestampFormatter (with strict timestamp parsing). Any date string can not be parsed | ||
| * as timestamp type in this case | ||
| * 2. Otherwise, we will use DefaultTimestampFormatter to parse strings as timestamps, which | ||
| * is more lenient and can parse strings of some date formats as timestamps. | ||
| */ | ||
| private def canParseDateAsTimestamp(dateFormat: String, tsType: DataType): Boolean = { | ||
| if ((tsType.isInstanceOf[TimestampType] && options.timestampFormatInRead.isEmpty) || | ||
| (tsType.isInstanceOf[TimestampNTZType] && options.timestampNTZFormatInRead.isEmpty)) { | ||
| LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS.contains(dateFormat) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we really need to cover these corner cases? We can just say that we can only parse date as timestamp if neither timestamp pattern nor date pattern is specified.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a behavior change in terms of Spark 3.3 branch, where a column with mixed dates and timestamps could be inferred as timestamp type if possible when no timestamp pattern specified. |
||
| } else { | ||
| false | ||
| } | ||
| } | ||
|
|
||
| /** | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't quite get this part, would you be able to elaborate on why we need to keep a set of such formats?
More of a thought experiment, I don't think this actually happens in practice: cast allows years longer than 4 digits, is it something that needs to be supported here? For more context, https://issues.apache.org/jira/browse/SPARK-39731.
Also, will this work? My understanding is that we will not, only yyyy-MM-dd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this set to determine inferring a column with mixture of dates and timestamps as
TimestampTypeorStringTypewhen no timestamp format is specified (the lenient timestamp formatter will be used)Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand your question on this case.
But speaking in the context of this PR, because
timestampFormatis specified, a column with a mix of dates and timestamps will be inferred asStringType.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's some interesting formats, I am not sure if we need to take care of them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on why we need LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS? I understand how it is used. Also, I am not a supporter of hardcoding date/time formats here.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When timestamp format is not specified, the desired behavior is that a column with mix of dates and timestamps could be inferred as timestamp type if the lenient timestamp formatter can parse the date strings under the column as well.
To achieve that without bringing other performance concern, we want to simply check if the date format could be supported by the lenient timestamp formatter. Does that make sense?