-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug fixes and improved support for Parquet TIMESTAMP #4801
Conversation
0870338
to
a48ea8d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Python changes LGTM.
return LogicalTypeAnnotation.timeType(true, LogicalTypeAnnotation.TimeUnit.MICROS); | ||
case TIMESTAMP_MILLIS: | ||
// TODO(deephaven-core#976) Assuming that time is adjusted to UTC | ||
return LogicalTypeAnnotation.timestampType(true, LogicalTypeAnnotation.TimeUnit.MILLIS); | ||
// Converted type doesn't have isAdjustedToUTC parameter, so use the information from logical type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an ugly hack that I had to do because of the code inside ParquetFileReader::buildChildren
which looks like this:
if (schemaElement.isSetConverted_type()) {
LogicalTypeAnnotation originalType =
getLogicalTypeAnnotation(schemaElement.converted_type, schemaElement);
LogicalTypeAnnotation newOriginalType = schemaElement.isSetLogicalType()
&& getLogicalTypeAnnotation(schemaElement.logicalType) != null
? getLogicalTypeAnnotation(schemaElement.logicalType)
: null;
if (!originalType.equals(newOriginalType)) {
((Types.Builder) childBuilder).as(originalType);
}
}
Mainly we are trying to deduce the type of the column from schemaElement.converted_type
and schemaElement.logicalType
and in case of mismatch, we go with type deduced from schemaElement.converted_type
.
The problem is that schemaElement.converted_type
doesn't have any information about adjustments to UTC. So to deduce it correctly, I had to read the isAdjustedToUTC
from the logical type.
Ideally, I would have liked to change the above code to prefer the type deduced from schemaElement.logicalType
over schemaElement.converted_type
because converted_type
is deprecated and superseded by logicalType
(based on the Javadoc for converted_type).
But this would be a much bigger change and impact all parquet types. So I added this hack which just impacts the TIMESTAMP type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the side effect of switching to use the logical type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a test commit here and started the nightly jobs. Let me see how this work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nightly jobs run fine.
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/transfer/DateTransfer.java
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ParquetFileReader.java
Outdated
Show resolved
Hide resolved
return LogicalTypeAnnotation.timeType(true, LogicalTypeAnnotation.TimeUnit.MICROS); | ||
case TIMESTAMP_MILLIS: | ||
// TODO(deephaven-core#976) Assuming that time is adjusted to UTC | ||
return LogicalTypeAnnotation.timestampType(true, LogicalTypeAnnotation.TimeUnit.MILLIS); | ||
// Converted type doesn't have isAdjustedToUTC parameter, so use the information from logical type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the side effect of switching to use the logical type?
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/TypeInfos.java
Outdated
Show resolved
Hide resolved
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/transfer/DateTransfer.java
Show resolved
Hide resolved
@@ -981,6 +982,21 @@ public static long epochNanos(@Nullable final ZonedDateTime dateTime) { | |||
return safeComputeNanos(dateTime.toEpochSecond(), dateTime.getNano()); | |||
} | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case my comments seem strong, this library is very exposed to users, so it needs to be ultra currated. Functions should only get added when there is a compelling reason.
- I am not a fan of having hard-coded methods for any specific timezone.
- The methods should accept a time zone as an input.
- If methods are added for
LocalDateTime
,LocalDateTime
signatures should be added to all relevant methods.
As part of #4421, we started throwing an exception on reading Parquet TIMESTAMP fields with isAdjustedToUTC set as false. After this change:
java.time.LocalDateTime
java.time.LocalDateTime
columns will be written as Parquet TIMESTAMP fields withisAdjustedToUTC=false
. Earlier they were written as binary data with a codec.Also, this PR fixes the bugs introduced in #4775 and #4755 that can lead to a
ClassCastException
in some cases on reading Parquet DATE and TIME columns.Related to #976