-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40819][SQL] Timestamp nanos behaviour regression #38312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } else { | ||
| TimestampNTZType | ||
| } | ||
| case timestamp: TimestampLogicalTypeAnnotation if timestamp.getUnit == TimeUnit.NANOS => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, this case would be merged with above case, but that would require TimestampType and TimestampNTZType to support nanos, which is a bigger change.
This case deserves a comment that nanos are not supported as TimestampType but as LongType, without any timezone awareness.
Supporting nanos as TimestampType in the future looks like a breaking change then (Spark 4.x?). Or another TimestampNSType like TimstampNTZType could be introduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment has been added.
I agree, full support is a breaking change and my aim for now is to address the regression and not introduce new functionality/support as that would require further validation and testing of other components e.g. time based functions etc. However, definitely something that should be considered in future development
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, how is this sufficient? we also should handle the long (nanos) accordingly in the Parquet reader, is that right? e.g.: ParquetVectorUpdaterFactory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sunchao so for type such as
message root {
required int64 _1 (TIMESTAMP(NANOS,true));
}
the descriptor.getPrimitiveType().getPrimitiveTypeName() is INT64; as the sparkType is LongType it returns new LongUpdate();
Therefore no change is required from this perspective as it continues to be handled in the same way prior to #31776
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. It's a bit weird though that we read nanos as long, but not timestamp. I'm not sure whether this should be considered as a regression, or that the previous behavior before 3.2 is merely unintended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's weird and if it is unintended behaviour, from what I know it's been unintended since spark 2.3 so quite a long time (I have not checked versions before 2.3). However, will let you guys take the call on whether is should be considered a regression.
I assume the change to support nanos will be a breaking change; I am completely for the support of nanos but would be nice if the original behaviour still existed up until the support of nanos is developed and released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the (earlier) existing behaviour was useful and should be restored until properly supported as typed nano timestamp. Unless a workaround can be found that restores the earlier behaviour.
Does providing a schema (spark.read.schema(...)) with long type override the parquet timestamp type from the parquet file? Would that be a workaround?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EnricoMi so it is possible to use spark.read.schema(..) as a workaround, however, you end up loosing functionality like mergeSchema which will automatically handle schema evolution etc and potentially will need to know the entire schema up front if all columns are required. Also for other consumers/users, especially in the exploratory analysis space, it will require them to have a better understanding of the underlying data structure before they can use it and this gets more difficult when the file is extremely wide.
I can imagine developers creating otherways to avoid the nuisance; which seems a bit crazy considering that functionality already exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is not a valid workaround. @sunchao how is your feeling about restoring earlier useful behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. I'm OK with it.
|
@sunchao @HyukjinKwon @LuciferYang this is a regression introduced by #31776 since 3.2.0 |
4311815 to
b71d2b6
Compare
|
Can one of the admins verify this patch? |
sunchao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. cc @cloud-fan @sadikovi too.
| TimestampNTZType | ||
| } | ||
| // SPARK-40819: NANOS are not supported as a Timestamp, convert to LongType without | ||
| // timezone awareness to address behaviour regression introduced by SPARK-34661 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do a truncation and still read it as timestamp type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this is a good idea as the precision will be lost which is extremely important for high frequency time series.
I haven’t verified but end users/developers would still be able to .cast(Timestamp) which I believe would truncate the timestamp; allowing developers to make that decision makes more sense then forcing the loss of precision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the mere purpose of this exercise is to get access to the nano precision.
|
Shall we make it a generate strategy? If we don't recognize the logical parquet type, just ignore it and read as the underlying physical type. cc @sadikovi One problem I can see is bad backward compatibility. Once we support a new logical type in the future, we must make a breaking change as the reader will return a different data type. Personally, I think requiring user-specified schema for unsupported logical types makes sense. It's error-prone if we allow schema merging during schema inference for unsupported logical types. For example, it doesn't make sense to merge |
Meaning once we make Spark 3.x return longs here, supporting nano Timestamps must be moved to Spark 4.x? But with a configuration flag like |
| } | ||
|
|
||
| class ParquetSchemaInferenceSuite extends ParquetSchemaTest { | ||
| testSchemaInference[Tuple1[Long]]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this case run passed before Spark 3.2, in my impression, Parquet 1.10.1 used by Spark 3.1 does not support nanos type, does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This particular case doesn’t pass, neither does similar tests for TIMESTAMP(MILLIS,true) etc https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala#L2234 due to No enum constant org.apache.parquet.schema.OriginalType.TIMESTAMP when trying to convert the message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then is this really a regression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I've been looking further into it, it's because the message is different between 1.10.1 and 1.12.3 - meaning the test would need to be different.
In 1.10.1 the message is
message schema {
required int64 attribute;
}
where as 1.12.3 the message is the same as the unit test
message schema {
required int64 attribute (TIMESTAMP(NANOS,true));
}
So in Spark 3.1.0 you end up hitting this block with returns a LongType https://github.com/apache/spark/blob/branch-3.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L146
where as since 3.2 you hit https://github.com/apache/spark/blob/branch-3.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L174 because a case for TimeUnit.NANOS is not covered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan moving Parquet from 1.10.1 to 1.12.3 introduced this regression where Spark 3.1 returned LongType and Spark 3.2 fails on illegal type.
|
@cloud-fan @LuciferYang any update/response regarding this? |
|
@cloud-fan where do we stand with this? Is this a regression? How do we proceed? |
|
@awdavidson I would like to understand the use case a bit better. Is the parquet file was written by an earlier Spark (version < 3.2) and does the error comes when that parquet file is read back with a latter Spark? If yes this is clearly regression. Still in this case can you please show us how we can reproduce it manually (a small example code for write/read)? If it was written by another tool can we got an example parquet file with sample data where the old version works and the new version fails? |
@attilapiros so the parquet file is being wrote by another process. Spark uses this data to run aggregations and analysis over different time horizons where the nanosecond precision is required. Currently, when using earlier Spark versions (< 3.2) the Whilst I understand timestamps with nanosecond precision are not fully supported, this change in behaviour will prevent users from migrating to the latest spark version Update: an example parquet file can be found here https://github.com/awdavidson/file-upload/blob/main/users.parquet Update: Additional test case has been added using the example parquet file |
| } | ||
|
|
||
| test("SPARK-40819 - ability to read parquet file with TIMESTAMP(NANOS, true)") { | ||
| val testDataPath = getClass.getResource("/test-data/timestamp-nanos.parquet") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reuse
spark/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala
Lines 462 to 467 in f01a8db
| /** | |
| * Returns full path to the given file in the resource folder | |
| */ | |
| protected def testFile(fileName: String): String = { | |
| Thread.currentThread().getContextClassLoader.getResource(fileName).toString | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 updated
| val data = spark.read.parquet(testDataPath.toString).select("birthday") | ||
|
|
||
| assert(data.schema.fields.head.dataType == LongType) | ||
| assert(data.take(1).head.getAs[Long](0) == 1668537129000000000L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we sort the read dataframe to be guarantee we compare the first row (Unless all rows have the same timestamp)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All rows have the same timestamp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have true nano second values to be sure they are not zeroed by the loader but loaded as in the file, e.g. 1665532812012345678L.
| } | ||
| } | ||
|
|
||
| test("SPARK-40819 - ability to read parquet file with TIMESTAMP(NANOS, true)") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: other tests in this file follow the naming convention "SPARK-XXXXX: ..."
| test("SPARK-40819 - ability to read parquet file with TIMESTAMP(NANOS, true)") { | |
| test("SPARK-40819: ability to read parquet file with TIMESTAMP(NANOS, true)") { |
EnricoMi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
EnricoMi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolving conflicts isn't easy, but code should at least compile locally ;-)
| inferTimestampNTZ: Boolean = SQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED.defaultValue.get) { | ||
| nanosAsLong: Boolean = SQLConf.LEGACY_PARQUET_NANOS_AS_LONG.defaultValue.get) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| inferTimestampNTZ: Boolean = SQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED.defaultValue.get) { | |
| nanosAsLong: Boolean = SQLConf.LEGACY_PARQUET_NANOS_AS_LONG.defaultValue.get) { | |
| inferTimestampNTZ: Boolean = SQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED.defaultValue.get, | |
| nanosAsLong: Boolean = SQLConf.LEGACY_PARQUET_NANOS_AS_LONG.defaultValue.get) { |
| inferTimestampNTZ = inferTimestampNTZ) | ||
| nanosAsLong = nanosAsLong) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| inferTimestampNTZ = inferTimestampNTZ) | |
| nanosAsLong = nanosAsLong) | |
| inferTimestampNTZ = inferTimestampNTZ, | |
| nanosAsLong = nanosAsLong) |
| inferTimestampNTZ = inferTimestampNTZ) | ||
| nanosAsLong = nanosAsLong) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| inferTimestampNTZ = inferTimestampNTZ) | |
| nanosAsLong = nanosAsLong) | |
| inferTimestampNTZ = inferTimestampNTZ, | |
| nanosAsLong = nanosAsLong) |
### What changes were proposed in this pull request? Handle `TimeUnit.NANOS` for parquet `Timestamps` addressing a regression in behaviour since 3.2 ### Why are the changes needed? Since version 3.2 reading parquet files that contain attributes with type `TIMESTAMP(NANOS,true)` is not possible as ParquetSchemaConverter returns ``` Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true)) ``` https://issues.apache.org/jira/browse/SPARK-34661 introduced a change matching on the `LogicalTypeAnnotation` which only covers Timestamp cases for `TimeUnit.MILLIS` and `TimeUnit.MICROS` meaning `TimeUnit.NANOS` would return `illegalType()` Prior to 3.2 the matching used the `originalType` which for `TIMESTAMP(NANOS,true)` return `null` and therefore resulted to a `LongType`, the change proposed is too consider `TimeUnit.NANOS` and return `LongType` making behaviour the same as before. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test covering this scenario. Internally deployed to read parquet files that contain `TIMESTAMP(NANOS,true)` Closes #38312 from awdavidson/ts-nanos-fix. Lead-authored-by: alfreddavidson <alfie.davidson9@gmail.com> Co-authored-by: Attila Zsolt Piros <2017933+attilapiros@users.noreply.github.com> Co-authored-by: awdavidson <54780428+awdavidson@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit ceccda0) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
|
Merged to master and branch-3.4. It has a conflict w/ branch-3.3 and branch-3.2. Would you mind creating backporting PRs? |
Sure, I’ll do this asap |
As per HyukjinKwon request on #38312 to backport fix into 3.3 ### What changes were proposed in this pull request? Handle `TimeUnit.NANOS` for parquet `Timestamps` addressing a regression in behaviour since 3.2 ### Why are the changes needed? Since version 3.2 reading parquet files that contain attributes with type `TIMESTAMP(NANOS,true)` is not possible as ParquetSchemaConverter returns ``` Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true)) ``` https://issues.apache.org/jira/browse/SPARK-34661 introduced a change matching on the `LogicalTypeAnnotation` which only covers Timestamp cases for `TimeUnit.MILLIS` and `TimeUnit.MICROS` meaning `TimeUnit.NANOS` would return `illegalType()` Prior to 3.2 the matching used the `originalType` which for `TIMESTAMP(NANOS,true)` return `null` and therefore resulted to a `LongType`, the change proposed is too consider `TimeUnit.NANOS` and return `LongType` making behaviour the same as before. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test covering this scenario. Internally deployed to read parquet files that contain `TIMESTAMP(NANOS,true)` Closes #39904 from awdavidson/ts-nanos-fix-3.3. Lead-authored-by: alfreddavidson <alfie.davidson9@gmail.com> Co-authored-by: awdavidson <54780428+awdavidson@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
As per HyukjinKwon request on #38312 to backport fix into 3.2 ### What changes were proposed in this pull request? Handle `TimeUnit.NANOS` for parquet `Timestamps` addressing a regression in behaviour since 3.2 ### Why are the changes needed? Since version 3.2 reading parquet files that contain attributes with type `TIMESTAMP(NANOS,true)` is not possible as ParquetSchemaConverter returns ``` Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true)) ``` https://issues.apache.org/jira/browse/SPARK-34661 introduced a change matching on the `LogicalTypeAnnotation` which only covers Timestamp cases for `TimeUnit.MILLIS` and `TimeUnit.MICROS` meaning `TimeUnit.NANOS` would return `illegalType()` Prior to 3.2 the matching used the `originalType` which for `TIMESTAMP(NANOS,true)` return `null` and therefore resulted to a `LongType`, the change proposed is too consider `TimeUnit.NANOS` and return `LongType` making behaviour the same as before. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test covering this scenario. Internally deployed to read parquet files that contain `TIMESTAMP(NANOS,true)` Closes #39905 from awdavidson/ts-nanos-fix-3.2. Authored-by: alfreddavidson <alfie.davidson9@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
| val LEGACY_PARQUET_NANOS_AS_LONG = buildConf("spark.sql.legacy.parquet.nanosAsLong") | ||
| .internal() | ||
| .doc("When true, the Parquet's nanos precision timestamps are converted to SQL long values.") | ||
| .version("3.2.3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@awdavidson I realised that we already released Spark 3.2.3.
Can you make a PR to fix this to 3.2.4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
…onfiguration As requested by HyukjinKwon in #38312 NB: This change needs to be backported ### What changes were proposed in this pull request? Update version set for "spark.sql.legacy.parquet.nanosAsLong" configuration in SqlConf. This update is required because the previous PR set version to `3.2.3` which has already been released. Updating to version `3.2.4` will correctly reflect when this configuration element was added ### Why are the changes needed? Correctness and to complete SPARK-40819 ### Does this PR introduce _any_ user-facing change? No, this is merely so this configuration element has the correct version ### How was this patch tested? N/A Closes #39943 from awdavidson/SPARK-40819_sql-conf. Authored-by: awdavidson <54780428+awdavidson@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…onfiguration As requested by HyukjinKwon in #38312 NB: This change needs to be backported ### What changes were proposed in this pull request? Update version set for "spark.sql.legacy.parquet.nanosAsLong" configuration in SqlConf. This update is required because the previous PR set version to `3.2.3` which has already been released. Updating to version `3.2.4` will correctly reflect when this configuration element was added ### Why are the changes needed? Correctness and to complete SPARK-40819 ### Does this PR introduce _any_ user-facing change? No, this is merely so this configuration element has the correct version ### How was this patch tested? N/A Closes #39943 from awdavidson/SPARK-40819_sql-conf. Authored-by: awdavidson <54780428+awdavidson@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 409c661) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…onfiguration As requested by HyukjinKwon in #38312 NB: This change needs to be backported ### What changes were proposed in this pull request? Update version set for "spark.sql.legacy.parquet.nanosAsLong" configuration in SqlConf. This update is required because the previous PR set version to `3.2.3` which has already been released. Updating to version `3.2.4` will correctly reflect when this configuration element was added ### Why are the changes needed? Correctness and to complete SPARK-40819 ### Does this PR introduce _any_ user-facing change? No, this is merely so this configuration element has the correct version ### How was this patch tested? N/A Closes #39943 from awdavidson/SPARK-40819_sql-conf. Authored-by: awdavidson <54780428+awdavidson@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 409c661) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…onfiguration As requested by HyukjinKwon in #38312 NB: This change needs to be backported ### What changes were proposed in this pull request? Update version set for "spark.sql.legacy.parquet.nanosAsLong" configuration in SqlConf. This update is required because the previous PR set version to `3.2.3` which has already been released. Updating to version `3.2.4` will correctly reflect when this configuration element was added ### Why are the changes needed? Correctness and to complete SPARK-40819 ### Does this PR introduce _any_ user-facing change? No, this is merely so this configuration element has the correct version ### How was this patch tested? N/A Closes #39943 from awdavidson/SPARK-40819_sql-conf. Authored-by: awdavidson <54780428+awdavidson@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 409c661) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
As per HyukjinKwon request on apache#38312 to backport fix into 3.2 ### What changes were proposed in this pull request? Handle `TimeUnit.NANOS` for parquet `Timestamps` addressing a regression in behaviour since 3.2 ### Why are the changes needed? Since version 3.2 reading parquet files that contain attributes with type `TIMESTAMP(NANOS,true)` is not possible as ParquetSchemaConverter returns ``` Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true)) ``` https://issues.apache.org/jira/browse/SPARK-34661 introduced a change matching on the `LogicalTypeAnnotation` which only covers Timestamp cases for `TimeUnit.MILLIS` and `TimeUnit.MICROS` meaning `TimeUnit.NANOS` would return `illegalType()` Prior to 3.2 the matching used the `originalType` which for `TIMESTAMP(NANOS,true)` return `null` and therefore resulted to a `LongType`, the change proposed is too consider `TimeUnit.NANOS` and return `LongType` making behaviour the same as before. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test covering this scenario. Internally deployed to read parquet files that contain `TIMESTAMP(NANOS,true)` Closes apache#39905 from awdavidson/ts-nanos-fix-3.2. Authored-by: alfreddavidson <alfie.davidson9@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…onfiguration As requested by HyukjinKwon in apache#38312 NB: This change needs to be backported ### What changes were proposed in this pull request? Update version set for "spark.sql.legacy.parquet.nanosAsLong" configuration in SqlConf. This update is required because the previous PR set version to `3.2.3` which has already been released. Updating to version `3.2.4` will correctly reflect when this configuration element was added ### Why are the changes needed? Correctness and to complete SPARK-40819 ### Does this PR introduce _any_ user-facing change? No, this is merely so this configuration element has the correct version ### How was this patch tested? N/A Closes apache#39943 from awdavidson/SPARK-40819_sql-conf. Authored-by: awdavidson <54780428+awdavidson@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 409c661) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? Handle `TimeUnit.NANOS` for parquet `Timestamps` addressing a regression in behaviour since 3.2 ### Why are the changes needed? Since version 3.2 reading parquet files that contain attributes with type `TIMESTAMP(NANOS,true)` is not possible as ParquetSchemaConverter returns ``` Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true)) ``` https://issues.apache.org/jira/browse/SPARK-34661 introduced a change matching on the `LogicalTypeAnnotation` which only covers Timestamp cases for `TimeUnit.MILLIS` and `TimeUnit.MICROS` meaning `TimeUnit.NANOS` would return `illegalType()` Prior to 3.2 the matching used the `originalType` which for `TIMESTAMP(NANOS,true)` return `null` and therefore resulted to a `LongType`, the change proposed is too consider `TimeUnit.NANOS` and return `LongType` making behaviour the same as before. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test covering this scenario. Internally deployed to read parquet files that contain `TIMESTAMP(NANOS,true)` Closes apache#38312 from awdavidson/ts-nanos-fix. Lead-authored-by: alfreddavidson <alfie.davidson9@gmail.com> Co-authored-by: Attila Zsolt Piros <2017933+attilapiros@users.noreply.github.com> Co-authored-by: awdavidson <54780428+awdavidson@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit ceccda0) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…onfiguration As requested by HyukjinKwon in apache#38312 NB: This change needs to be backported ### What changes were proposed in this pull request? Update version set for "spark.sql.legacy.parquet.nanosAsLong" configuration in SqlConf. This update is required because the previous PR set version to `3.2.3` which has already been released. Updating to version `3.2.4` will correctly reflect when this configuration element was added ### Why are the changes needed? Correctness and to complete SPARK-40819 ### Does this PR introduce _any_ user-facing change? No, this is merely so this configuration element has the correct version ### How was this patch tested? N/A Closes apache#39943 from awdavidson/SPARK-40819_sql-conf. Authored-by: awdavidson <54780428+awdavidson@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 409c661) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
|
Can I use the option
|
|
This should be available in 3.2.4, 3.3.2, 3.4.0 and above. |
What changes were proposed in this pull request?
Handle
TimeUnit.NANOSfor parquetTimestampsaddressing a regression in behaviour since 3.2Why are the changes needed?
Since version 3.2 reading parquet files that contain attributes with type
TIMESTAMP(NANOS,true)is not possible as ParquetSchemaConverter returnshttps://issues.apache.org/jira/browse/SPARK-34661 introduced a change matching on the
LogicalTypeAnnotationwhich only covers Timestamp cases forTimeUnit.MILLISandTimeUnit.MICROSmeaningTimeUnit.NANOSwould returnillegalType()Prior to 3.2 the matching used the
originalTypewhich forTIMESTAMP(NANOS,true)returnnulland therefore resulted to aLongType, the change proposed is too considerTimeUnit.NANOSand returnLongTypemaking behaviour the same as before.Does this PR introduce any user-facing change?
No
How was this patch tested?
Added unit test covering this scenario.
Internally deployed to read parquet files that contain
TIMESTAMP(NANOS,true)