Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-40819][SQL][3.3] Timestamp nanos behaviour regression #39904

Closed
wants to merge 2 commits into from

Conversation

awdavidson
Copy link
Contributor

As per @HyukjinKwon request on #38312 to backport fix into 3.3

What changes were proposed in this pull request?

Handle TimeUnit.NANOS for parquet Timestamps addressing a regression in behaviour since 3.2

Why are the changes needed?

Since version 3.2 reading parquet files that contain attributes with type TIMESTAMP(NANOS,true) is not possible as ParquetSchemaConverter returns

Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true))

https://issues.apache.org/jira/browse/SPARK-34661 introduced a change matching on the LogicalTypeAnnotation which only covers Timestamp cases for TimeUnit.MILLIS and TimeUnit.MICROS meaning TimeUnit.NANOS would return illegalType()

Prior to 3.2 the matching used the originalType which for TIMESTAMP(NANOS,true) return null and therefore resulted to a LongType, the change proposed is too consider TimeUnit.NANOS and return LongType making behaviour the same as before.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit test covering this scenario.
Internally deployed to read parquet files that contain TIMESTAMP(NANOS,true)

@github-actions github-actions bot added the SQL label Feb 6, 2023
@HyukjinKwon HyukjinKwon changed the title [SPARK-40819][SQL] Backport Timestamp nanos regression to 3.3 [SPARK-40819][SQL][3.3] Backport Timestamp nanos regression to 3.3 Feb 6, 2023
@awdavidson awdavidson marked this pull request as ready for review February 6, 2023 20:14
@HyukjinKwon
Copy link
Member

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Pending CIs).

cc @viirya since he is the release manager of Apache Spark 3.3.2.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-40819][SQL][3.3] Backport Timestamp nanos regression to 3.3 [SPARK-40819][SQL][3.3] Timestamp nanos behaviour regression Feb 7, 2023
Comment on lines +3462 to +3467
val LEGACY_PARQUET_NANOS_AS_LONG = buildConf("spark.sql.legacy.parquet.nanosAsLong")
.internal()
.doc("When true, the Parquet's nanos precision timestamps are converted to SQL long values.")
.version("3.2.3")
.booleanConf
.createWithDefault(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this is worth documenting it in migration guide?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look at adding to the guide

@HyukjinKwon
Copy link
Member

Merged to branch-3.3.

HyukjinKwon pushed a commit that referenced this pull request Feb 8, 2023
As per HyukjinKwon request on #38312 to backport fix into 3.3
### What changes were proposed in this pull request?

Handle `TimeUnit.NANOS` for parquet `Timestamps` addressing a regression in behaviour since 3.2

### Why are the changes needed?

Since version 3.2 reading parquet files that contain attributes with type `TIMESTAMP(NANOS,true)` is not possible as ParquetSchemaConverter returns
```
Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true))
```
https://issues.apache.org/jira/browse/SPARK-34661 introduced a change matching on the `LogicalTypeAnnotation` which only covers Timestamp cases for `TimeUnit.MILLIS` and `TimeUnit.MICROS` meaning `TimeUnit.NANOS` would return `illegalType()`

Prior to 3.2 the matching used the `originalType` which for `TIMESTAMP(NANOS,true)` return `null` and therefore resulted to a `LongType`, the change proposed is too consider `TimeUnit.NANOS` and return `LongType` making behaviour the same as before.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit test covering this scenario.
Internally deployed to read parquet files that contain `TIMESTAMP(NANOS,true)`

Closes #39904 from awdavidson/ts-nanos-fix-3.3.

Lead-authored-by: alfreddavidson <alfie.davidson9@gmail.com>
Co-authored-by: awdavidson <54780428+awdavidson@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@HyukjinKwon HyukjinKwon closed this Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants