Skip to content

Conversation

@awdavidson
Copy link
Contributor

What changes were proposed in this pull request?

Handle TimeUnit.NANOS for parquet Timestamps addressing a regression in behaviour since 3.2

Why are the changes needed?

Since version 3.2 reading parquet files that contain attributes with type TIMESTAMP(NANOS,true) is not possible as ParquetSchemaConverter returns

Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true))

https://issues.apache.org/jira/browse/SPARK-34661 introduced a change matching on the LogicalTypeAnnotation which only covers Timestamp cases for TimeUnit.MILLIS and TimeUnit.MICROS meaning TimeUnit.NANOS would return illegalType()

Prior to 3.2 the matching used the originalType which for TIMESTAMP(NANOS,true) return null and therefore resulted to a LongType, the change proposed is too consider TimeUnit.NANOS and return LongType making behaviour the same as before.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit test covering this scenario.
Internally deployed to read parquet files that contain TIMESTAMP(NANOS,true)

@github-actions github-actions bot added the SQL label Oct 19, 2022
} else {
TimestampNTZType
}
case timestamp: TimestampLogicalTypeAnnotation if timestamp.getUnit == TimeUnit.NANOS =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, this case would be merged with above case, but that would require TimestampType and TimestampNTZType to support nanos, which is a bigger change.

This case deserves a comment that nanos are not supported as TimestampType but as LongType, without any timezone awareness.

Supporting nanos as TimestampType in the future looks like a breaking change then (Spark 4.x?). Or another TimestampNSType like TimstampNTZType could be introduced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment has been added.

I agree, full support is a breaking change and my aim for now is to address the regression and not introduce new functionality/support as that would require further validation and testing of other components e.g. time based functions etc. However, definitely something that should be considered in future development

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, how is this sufficient? we also should handle the long (nanos) accordingly in the Parquet reader, is that right? e.g.: ParquetVectorUpdaterFactory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunchao so for type such as

message root {
  required int64 _1 (TIMESTAMP(NANOS,true));
}

the descriptor.getPrimitiveType().getPrimitiveTypeName() is INT64; as the sparkType is LongType it returns new LongUpdate();

https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java#L105

Therefore no change is required from this perspective as it continues to be handled in the same way prior to #31776

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It's a bit weird though that we read nanos as long, but not timestamp. I'm not sure whether this should be considered as a regression, or that the previous behavior before 3.2 is merely unintended.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's weird and if it is unintended behaviour, from what I know it's been unintended since spark 2.3 so quite a long time (I have not checked versions before 2.3). However, will let you guys take the call on whether is should be considered a regression.

I assume the change to support nanos will be a breaking change; I am completely for the support of nanos but would be nice if the original behaviour still existed up until the support of nanos is developed and released.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the (earlier) existing behaviour was useful and should be restored until properly supported as typed nano timestamp. Unless a workaround can be found that restores the earlier behaviour.

Does providing a schema (spark.read.schema(...)) with long type override the parquet timestamp type from the parquet file? Would that be a workaround?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EnricoMi so it is possible to use spark.read.schema(..) as a workaround, however, you end up loosing functionality like mergeSchema which will automatically handle schema evolution etc and potentially will need to know the entire schema up front if all columns are required. Also for other consumers/users, especially in the exploratory analysis space, it will require them to have a better understanding of the underlying data structure before they can use it and this gets more difficult when the file is extremely wide.

I can imagine developers creating otherways to avoid the nuisance; which seems a bit crazy considering that functionality already exists.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is not a valid workaround. @sunchao how is your feeling about restoring earlier useful behaviour?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. I'm OK with it.

@EnricoMi
Copy link
Contributor

@sunchao @HyukjinKwon @LuciferYang this is a regression introduced by #31776 since 3.2.0

@awdavidson awdavidson changed the title [SPARK-40819][SQL] address timestamp nanos behaviour regression [SPARK-40819][SQL] Timestamp nanos behaviour regression Oct 20, 2022
@awdavidson awdavidson closed this Oct 20, 2022
@awdavidson awdavidson reopened this Oct 20, 2022
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. cc @cloud-fan @sadikovi too.

TimestampNTZType
}
// SPARK-40819: NANOS are not supported as a Timestamp, convert to LongType without
// timezone awareness to address behaviour regression introduced by SPARK-34661
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do a truncation and still read it as timestamp type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this is a good idea as the precision will be lost which is extremely important for high frequency time series.

I haven’t verified but end users/developers would still be able to .cast(Timestamp) which I believe would truncate the timestamp; allowing developers to make that decision makes more sense then forcing the loss of precision.

Copy link
Contributor

@EnricoMi EnricoMi Oct 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the mere purpose of this exercise is to get access to the nano precision.

@cloud-fan
Copy link
Contributor

cloud-fan commented Oct 24, 2022

Shall we make it a generate strategy? If we don't recognize the logical parquet type, just ignore it and read as the underlying physical type. cc @sadikovi

One problem I can see is bad backward compatibility. Once we support a new logical type in the future, we must make a breaking change as the reader will return a different data type.

Personally, I think requiring user-specified schema for unsupported logical types makes sense. It's error-prone if we allow schema merging during schema inference for unsupported logical types. For example, it doesn't make sense to merge TIMESTAMP(NANOS) and INT32 to INT64.

@EnricoMi
Copy link
Contributor

One problem I can see is bad backward compatibility.

Meaning once we make Spark 3.x return longs here, supporting nano Timestamps must be moved to Spark 4.x?

But with a configuration flag like spark.sql.parquet.timestampNTZ.enabled, the user can control the break and Spark 3.x is free to support this in the future.

}

class ParquetSchemaInferenceSuite extends ParquetSchemaTest {
testSchemaInference[Tuple1[Long]](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this case run passed before Spark 3.2, in my impression, Parquet 1.10.1 used by Spark 3.1 does not support nanos type, does it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then is this really a regression?

Copy link
Contributor Author

@awdavidson awdavidson Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've been looking further into it, it's because the message is different between 1.10.1 and 1.12.3 - meaning the test would need to be different.

In 1.10.1 the message is

message schema {
  required int64 attribute;
}

where as 1.12.3 the message is the same as the unit test

message schema {
  required int64 attribute (TIMESTAMP(NANOS,true));
}

So in Spark 3.1.0 you end up hitting this block with returns a LongType https://github.com/apache/spark/blob/branch-3.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L146

where as since 3.2 you hit https://github.com/apache/spark/blob/branch-3.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L174 because a case for TimeUnit.NANOS is not covered

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan moving Parquet from 1.10.1 to 1.12.3 introduced this regression where Spark 3.1 returned LongType and Spark 3.2 fails on illegal type.

@awdavidson
Copy link
Contributor Author

@cloud-fan @LuciferYang any update/response regarding this?

@EnricoMi
Copy link
Contributor

EnricoMi commented Nov 7, 2022

@cloud-fan where do we stand with this? Is this a regression? How do we proceed?

@attilapiros
Copy link
Contributor

@awdavidson I would like to understand the use case a bit better. Is the parquet file was written by an earlier Spark (version < 3.2) and does the error comes when that parquet file is read back with a latter Spark? If yes this is clearly regression. Still in this case can you please show us how we can reproduce it manually (a small example code for write/read)?

If it was written by another tool can we got an example parquet file with sample data where the old version works and the new version fails?

@awdavidson
Copy link
Contributor Author

awdavidson commented Nov 15, 2022

@awdavidson I would like to understand the use case a bit better. Is the parquet file was written by an earlier Spark (version < 3.2) and does the error comes when that parquet file is read back with a latter Spark? If yes this is clearly regression. Still in this case can you please show us how we can reproduce it manually (a small example code for write/read)?

If it was written by another tool can we got an example parquet file with sample data where the old version works and the new version fails?

@attilapiros so the parquet file is being wrote by another process. Spark uses this data to run aggregations and analysis over different time horizons where the nanosecond precision is required. Currently, when using earlier Spark versions (< 3.2) the TIMESTAMP(NANOS, true) in the parquet schema is automatically converted to a LongType, however, since the moving from parquet 1.10.1 to 1.12.3 and the changes to ParquetSchemaConverter an illegalType() is thrown. As soon as I have access this evening I will provide an example parquet file.

Whilst I understand timestamps with nanosecond precision are not fully supported, this change in behaviour will prevent users from migrating to the latest spark version

Update: an example parquet file can be found here https://github.com/awdavidson/file-upload/blob/main/users.parquet
I am also going to look at adding additional tests with this file for any future changes

Update: Additional test case has been added using the example parquet file

}

test("SPARK-40819 - ability to read parquet file with TIMESTAMP(NANOS, true)") {
val testDataPath = getClass.getResource("/test-data/timestamp-nanos.parquet")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse

/**
* Returns full path to the given file in the resource folder
*/
protected def testFile(fileName: String): String = {
Thread.currentThread().getContextClassLoader.getResource(fileName).toString
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 updated

val data = spark.read.parquet(testDataPath.toString).select("birthday")

assert(data.schema.fields.head.dataType == LongType)
assert(data.take(1).head.getAs[Long](0) == 1668537129000000000L)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we sort the read dataframe to be guarantee we compare the first row (Unless all rows have the same timestamp)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All rows have the same timestamp

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have true nano second values to be sure they are not zeroed by the loader but loaded as in the file, e.g. 1665532812012345678L.

}
}

test("SPARK-40819 - ability to read parquet file with TIMESTAMP(NANOS, true)") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: other tests in this file follow the naming convention "SPARK-XXXXX: ..."

Suggested change
test("SPARK-40819 - ability to read parquet file with TIMESTAMP(NANOS, true)") {
test("SPARK-40819: ability to read parquet file with TIMESTAMP(NANOS, true)") {

Copy link
Contributor

@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolving conflicts isn't easy, but code should at least compile locally ;-)

Comment on lines 58 to 59
inferTimestampNTZ: Boolean = SQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED.defaultValue.get) {
nanosAsLong: Boolean = SQLConf.LEGACY_PARQUET_NANOS_AS_LONG.defaultValue.get) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
inferTimestampNTZ: Boolean = SQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED.defaultValue.get) {
nanosAsLong: Boolean = SQLConf.LEGACY_PARQUET_NANOS_AS_LONG.defaultValue.get) {
inferTimestampNTZ: Boolean = SQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED.defaultValue.get,
nanosAsLong: Boolean = SQLConf.LEGACY_PARQUET_NANOS_AS_LONG.defaultValue.get) {

Comment on lines 469 to 470
inferTimestampNTZ = inferTimestampNTZ)
nanosAsLong = nanosAsLong)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
inferTimestampNTZ = inferTimestampNTZ)
nanosAsLong = nanosAsLong)
inferTimestampNTZ = inferTimestampNTZ,
nanosAsLong = nanosAsLong)

Comment on lines 77 to 78
inferTimestampNTZ = inferTimestampNTZ)
nanosAsLong = nanosAsLong)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
inferTimestampNTZ = inferTimestampNTZ)
nanosAsLong = nanosAsLong)
inferTimestampNTZ = inferTimestampNTZ,
nanosAsLong = nanosAsLong)

HyukjinKwon pushed a commit that referenced this pull request Feb 6, 2023
### What changes were proposed in this pull request?

Handle `TimeUnit.NANOS` for parquet `Timestamps` addressing a regression in behaviour since 3.2

### Why are the changes needed?

Since version 3.2 reading parquet files that contain attributes with type `TIMESTAMP(NANOS,true)` is not possible as ParquetSchemaConverter returns
```
Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true))
```
https://issues.apache.org/jira/browse/SPARK-34661 introduced a change matching on the `LogicalTypeAnnotation` which only covers Timestamp cases for `TimeUnit.MILLIS` and `TimeUnit.MICROS` meaning `TimeUnit.NANOS` would return `illegalType()`

Prior to 3.2 the matching used the `originalType` which for `TIMESTAMP(NANOS,true)` return `null` and therefore resulted to a `LongType`, the change proposed is too consider `TimeUnit.NANOS` and return `LongType` making behaviour the same as before.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit test covering this scenario.
Internally deployed to read parquet files that contain `TIMESTAMP(NANOS,true)`

Closes #38312 from awdavidson/ts-nanos-fix.

Lead-authored-by: alfreddavidson <alfie.davidson9@gmail.com>
Co-authored-by: Attila Zsolt Piros <2017933+attilapiros@users.noreply.github.com>
Co-authored-by: awdavidson <54780428+awdavidson@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit ceccda0)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@HyukjinKwon
Copy link
Member

Merged to master and branch-3.4.

It has a conflict w/ branch-3.3 and branch-3.2. Would you mind creating backporting PRs?

@awdavidson
Copy link
Contributor Author

Merged to master and branch-3.4.

It has a conflict w/ branch-3.3 and branch-3.2. Would you mind creating backporting PRs?

Sure, I’ll do this asap

HyukjinKwon pushed a commit that referenced this pull request Feb 8, 2023
As per HyukjinKwon request on #38312 to backport fix into 3.3
### What changes were proposed in this pull request?

Handle `TimeUnit.NANOS` for parquet `Timestamps` addressing a regression in behaviour since 3.2

### Why are the changes needed?

Since version 3.2 reading parquet files that contain attributes with type `TIMESTAMP(NANOS,true)` is not possible as ParquetSchemaConverter returns
```
Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true))
```
https://issues.apache.org/jira/browse/SPARK-34661 introduced a change matching on the `LogicalTypeAnnotation` which only covers Timestamp cases for `TimeUnit.MILLIS` and `TimeUnit.MICROS` meaning `TimeUnit.NANOS` would return `illegalType()`

Prior to 3.2 the matching used the `originalType` which for `TIMESTAMP(NANOS,true)` return `null` and therefore resulted to a `LongType`, the change proposed is too consider `TimeUnit.NANOS` and return `LongType` making behaviour the same as before.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit test covering this scenario.
Internally deployed to read parquet files that contain `TIMESTAMP(NANOS,true)`

Closes #39904 from awdavidson/ts-nanos-fix-3.3.

Lead-authored-by: alfreddavidson <alfie.davidson9@gmail.com>
Co-authored-by: awdavidson <54780428+awdavidson@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Feb 8, 2023
As per HyukjinKwon request on #38312 to backport fix into 3.2
### What changes were proposed in this pull request?

Handle `TimeUnit.NANOS` for parquet `Timestamps` addressing a regression in behaviour since 3.2

### Why are the changes needed?

Since version 3.2 reading parquet files that contain attributes with type `TIMESTAMP(NANOS,true)` is not possible as ParquetSchemaConverter returns
```
Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true))
```
https://issues.apache.org/jira/browse/SPARK-34661 introduced a change matching on the `LogicalTypeAnnotation` which only covers Timestamp cases for `TimeUnit.MILLIS` and `TimeUnit.MICROS` meaning `TimeUnit.NANOS` would return `illegalType()`

Prior to 3.2 the matching used the `originalType` which for `TIMESTAMP(NANOS,true)` return `null` and therefore resulted to a `LongType`, the change proposed is too consider `TimeUnit.NANOS` and return `LongType` making behaviour the same as before.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit test covering this scenario.
Internally deployed to read parquet files that contain `TIMESTAMP(NANOS,true)`

Closes #39905 from awdavidson/ts-nanos-fix-3.2.

Authored-by: alfreddavidson <alfie.davidson9@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
val LEGACY_PARQUET_NANOS_AS_LONG = buildConf("spark.sql.legacy.parquet.nanosAsLong")
.internal()
.doc("When true, the Parquet's nanos precision timestamps are converted to SQL long values.")
.version("3.2.3")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awdavidson I realised that we already released Spark 3.2.3.

Can you make a PR to fix this to 3.2.4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

HyukjinKwon pushed a commit that referenced this pull request Feb 9, 2023
…onfiguration

As requested by HyukjinKwon in #38312  NB: This change needs to be backported

### What changes were proposed in this pull request?

Update version set for "spark.sql.legacy.parquet.nanosAsLong"  configuration in SqlConf. This update is required because the previous PR set version to `3.2.3` which has already been released. Updating to version `3.2.4` will correctly reflect when this configuration element was added

### Why are the changes needed?

Correctness and to complete SPARK-40819

### Does this PR introduce _any_ user-facing change?

No, this is merely so this configuration element has the correct version

### How was this patch tested?

N/A

Closes #39943 from awdavidson/SPARK-40819_sql-conf.

Authored-by: awdavidson <54780428+awdavidson@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Feb 9, 2023
…onfiguration

As requested by HyukjinKwon in #38312  NB: This change needs to be backported

### What changes were proposed in this pull request?

Update version set for "spark.sql.legacy.parquet.nanosAsLong"  configuration in SqlConf. This update is required because the previous PR set version to `3.2.3` which has already been released. Updating to version `3.2.4` will correctly reflect when this configuration element was added

### Why are the changes needed?

Correctness and to complete SPARK-40819

### Does this PR introduce _any_ user-facing change?

No, this is merely so this configuration element has the correct version

### How was this patch tested?

N/A

Closes #39943 from awdavidson/SPARK-40819_sql-conf.

Authored-by: awdavidson <54780428+awdavidson@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 409c661)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Feb 9, 2023
…onfiguration

As requested by HyukjinKwon in #38312  NB: This change needs to be backported

### What changes were proposed in this pull request?

Update version set for "spark.sql.legacy.parquet.nanosAsLong"  configuration in SqlConf. This update is required because the previous PR set version to `3.2.3` which has already been released. Updating to version `3.2.4` will correctly reflect when this configuration element was added

### Why are the changes needed?

Correctness and to complete SPARK-40819

### Does this PR introduce _any_ user-facing change?

No, this is merely so this configuration element has the correct version

### How was this patch tested?

N/A

Closes #39943 from awdavidson/SPARK-40819_sql-conf.

Authored-by: awdavidson <54780428+awdavidson@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 409c661)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Feb 9, 2023
…onfiguration

As requested by HyukjinKwon in #38312  NB: This change needs to be backported

### What changes were proposed in this pull request?

Update version set for "spark.sql.legacy.parquet.nanosAsLong"  configuration in SqlConf. This update is required because the previous PR set version to `3.2.3` which has already been released. Updating to version `3.2.4` will correctly reflect when this configuration element was added

### Why are the changes needed?

Correctness and to complete SPARK-40819

### Does this PR introduce _any_ user-facing change?

No, this is merely so this configuration element has the correct version

### How was this patch tested?

N/A

Closes #39943 from awdavidson/SPARK-40819_sql-conf.

Authored-by: awdavidson <54780428+awdavidson@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 409c661)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
As per HyukjinKwon request on apache#38312 to backport fix into 3.2
### What changes were proposed in this pull request?

Handle `TimeUnit.NANOS` for parquet `Timestamps` addressing a regression in behaviour since 3.2

### Why are the changes needed?

Since version 3.2 reading parquet files that contain attributes with type `TIMESTAMP(NANOS,true)` is not possible as ParquetSchemaConverter returns
```
Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true))
```
https://issues.apache.org/jira/browse/SPARK-34661 introduced a change matching on the `LogicalTypeAnnotation` which only covers Timestamp cases for `TimeUnit.MILLIS` and `TimeUnit.MICROS` meaning `TimeUnit.NANOS` would return `illegalType()`

Prior to 3.2 the matching used the `originalType` which for `TIMESTAMP(NANOS,true)` return `null` and therefore resulted to a `LongType`, the change proposed is too consider `TimeUnit.NANOS` and return `LongType` making behaviour the same as before.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit test covering this scenario.
Internally deployed to read parquet files that contain `TIMESTAMP(NANOS,true)`

Closes apache#39905 from awdavidson/ts-nanos-fix-3.2.

Authored-by: alfreddavidson <alfie.davidson9@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
…onfiguration

As requested by HyukjinKwon in apache#38312  NB: This change needs to be backported

### What changes were proposed in this pull request?

Update version set for "spark.sql.legacy.parquet.nanosAsLong"  configuration in SqlConf. This update is required because the previous PR set version to `3.2.3` which has already been released. Updating to version `3.2.4` will correctly reflect when this configuration element was added

### Why are the changes needed?

Correctness and to complete SPARK-40819

### Does this PR introduce _any_ user-facing change?

No, this is merely so this configuration element has the correct version

### How was this patch tested?

N/A

Closes apache#39943 from awdavidson/SPARK-40819_sql-conf.

Authored-by: awdavidson <54780428+awdavidson@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 409c661)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
### What changes were proposed in this pull request?

Handle `TimeUnit.NANOS` for parquet `Timestamps` addressing a regression in behaviour since 3.2

### Why are the changes needed?

Since version 3.2 reading parquet files that contain attributes with type `TIMESTAMP(NANOS,true)` is not possible as ParquetSchemaConverter returns
```
Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true))
```
https://issues.apache.org/jira/browse/SPARK-34661 introduced a change matching on the `LogicalTypeAnnotation` which only covers Timestamp cases for `TimeUnit.MILLIS` and `TimeUnit.MICROS` meaning `TimeUnit.NANOS` would return `illegalType()`

Prior to 3.2 the matching used the `originalType` which for `TIMESTAMP(NANOS,true)` return `null` and therefore resulted to a `LongType`, the change proposed is too consider `TimeUnit.NANOS` and return `LongType` making behaviour the same as before.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit test covering this scenario.
Internally deployed to read parquet files that contain `TIMESTAMP(NANOS,true)`

Closes apache#38312 from awdavidson/ts-nanos-fix.

Lead-authored-by: alfreddavidson <alfie.davidson9@gmail.com>
Co-authored-by: Attila Zsolt Piros <2017933+attilapiros@users.noreply.github.com>
Co-authored-by: awdavidson <54780428+awdavidson@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit ceccda0)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…onfiguration

As requested by HyukjinKwon in apache#38312  NB: This change needs to be backported

### What changes were proposed in this pull request?

Update version set for "spark.sql.legacy.parquet.nanosAsLong"  configuration in SqlConf. This update is required because the previous PR set version to `3.2.3` which has already been released. Updating to version `3.2.4` will correctly reflect when this configuration element was added

### Why are the changes needed?

Correctness and to complete SPARK-40819

### Does this PR introduce _any_ user-facing change?

No, this is merely so this configuration element has the correct version

### How was this patch tested?

N/A

Closes apache#39943 from awdavidson/SPARK-40819_sql-conf.

Authored-by: awdavidson <54780428+awdavidson@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 409c661)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@wenijinew
Copy link

Can I use the option spark.sql.legacy.parquet.nanosAsLong in any version newer than 3.2.4 or only the version 3.2.4?
I am using 3.3.1, and I set the option as below. It doesn't work.

SparkSession.builder.appName("app").config("spark.sql.legacy.parquet.nanosAsLong", "true").getOrCreate()

@EnricoMi
Copy link
Contributor

This should be available in 3.2.4, 3.3.2, 3.4.0 and above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants