-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support vectorized reading int96 timestamps in imported data #6962
Conversation
@rdblue @aokolnychyi @gustavoatt If you have time, please take a look, thanks |
@nastra can you take a look at this? |
@yabola I'll try to review it this week |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks mostly good to me
|
||
private static final long UNIX_EPOCH_JULIAN = 2_440_588L; | ||
|
||
public static long extractTimestampInt96(ByteBuffer buffer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we adjust Sparks version of this in TimestampInt96Reader
to reuse TimestampUtil
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced all timestampint96 related places, please take a look~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels like something that could live in ParquetUtil
, instead of creating a new class?
ValuesAsBytesReader valuesReader, | ||
int typeWidth, | ||
byte[] byteArray) { | ||
ByteBuffer buffer = valuesReader.getBuffer(12); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a comment what the 12 means here (8 bytes = time of day in nanos / 4 bytes = julian day)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -1763,6 +1771,71 @@ public void testAllManifestTableSnapshotFiltering() throws Exception { | |||
} | |||
} | |||
|
|||
@Test | |||
public void testTableWithInt96Timestamp() throws IOException { | |||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need the try-catch here? I think it would work fine without it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is only the finally method here, it does a cleanup operation and I imitate what other UT wrote
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we always have a parquet_table
that might need to be dropped. We could add that to @After
and do DROP TABLE IF EXISTS
if (!catalog.tableExists(currentIdentifier)) { | ||
return; | ||
} | ||
dropTable(currentIdentifier); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: whitespace missing after }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like the newline is not added? Overall we encourage newline after any logical blocks like if, for, while, try, etc.
@aokolnychyi could you review this one please? |
I'll try this week but can't promise. cc @flyrain |
@nastra @aokolnychyi @rdblue @flyrain Hi~ I have updated my PR, if you have time, please take a look, thank you |
@nastra emmm, looks like there are no more comments, could you help take a look again? |
sorry for the lack of reviews, I will take a look during the weekend |
@@ -1763,6 +1771,71 @@ public void testAllManifestTableSnapshotFiltering() throws Exception { | |||
} | |||
} | |||
|
|||
@Test | |||
public void testTableWithInt96Timestamp() throws IOException { | |||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we always have a parquet_table
that might need to be dropped. We could add that to @After
and do DROP TABLE IF EXISTS
|
||
private static final long UNIX_EPOCH_JULIAN = 2_440_588L; | ||
|
||
public static long extractTimestampInt96(ByteBuffer buffer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels like something that could live in ParquetUtil
, instead of creating a new class?
|
||
return TimeUnit.DAYS.toMicros(julianDay - UNIX_EPOCH_JULIAN) | ||
+ TimeUnit.NANOSECONDS.toMicros(timeOfDayNanos); | ||
return TimestampUtil.extractTimestampInt96(byteBuffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the refactoring!
if (!catalog.tableExists(currentIdentifier)) { | ||
return; | ||
} | ||
dropTable(currentIdentifier); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like the newline is not added? Overall we encourage newline after any logical blocks like if, for, while, try, etc.
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactory.java
Show resolved
Hide resolved
|
||
private static final long UNIX_EPOCH_JULIAN = 2_440_588L; | ||
|
||
public static long extractTimestampInt96(ByteBuffer buffer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add javadoc for public util method.
@jackye1995 Thank you for your review, I have updated my PR, please take a look~ |
It seems that the PR check is not triggered automatically and needs to be triggered manually... |
Running CI |
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactory.java
Outdated
Show resolved
Hide resolved
...java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good to me! @nastra could you take another look?
...java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM! Thank you for the contribution. To provide some additional feedback, I conducted some further testing on the reading of iceberg tables after migration from Delta Lake tables that contain an INT96 timestamp column. Based on my observations, the vectorized reading appears to be functioning perfectly.
BTW, I think this PR can close #4200.
.load(loadLocation(tableIdentifier)) | ||
.select("tmp_col") | ||
.collectAsList(); | ||
Assert.assertEquals("Rows must match", expected, actual); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assert.assertEquals("Rows must match", expected, actual); | |
Assertions.assertThat(actual).containsExactlyInAnyOrderElementsOf(expected); |
How about using org.assertj.core.api.Assertions
here? I think AsserJ
can provide a more organized error message (with new line for each element of the list) when actual
and expected
differ. Also, based on the discussion here #7160 (comment), we may want to move from Junit4 to Junit5 + AsserJ in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. Thank you very much for testing on Delta !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Thanks for the work! And thanks for the review @nastra @JonasJ-ap |
Add vectorized read support for Parquet INT96 timestamps ( fix comment and vectorized reads are enabled by default after that PR).
Before there is only non-vectorized reading support(see #1184).
This is needed so that parquet files written by Spark, that used INT96 timestamps, are able to be read by Iceberg without having to rewrite these files. This is specially useful for migrations from spark or delta lake.