Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support vectorized reading int96 timestamps in imported data #6962

Merged
merged 12 commits into from
Apr 11, 2023

Conversation

yabola
Copy link
Contributor

@yabola yabola commented Mar 1, 2023

Add vectorized read support for Parquet INT96 timestamps ( fix comment and vectorized reads are enabled by default after that PR).
Before there is only non-vectorized reading support(see #1184).
This is needed so that parquet files written by Spark, that used INT96 timestamps, are able to be read by Iceberg without having to rewrite these files. This is specially useful for migrations from spark or delta lake.

@yabola yabola changed the title Support vectorized read for timestampInt96 Support vectorized read for Int96 Mar 1, 2023
@yabola yabola changed the title Support vectorized read for Int96 Support vectorized read for timestamp Mar 1, 2023
@yabola yabola changed the title Support vectorized read for timestamp Support vectorized reading int96 timestamps in imported data Mar 1, 2023
@yabola yabola changed the title Support vectorized reading int96 timestamps in imported data Support vectorized reading int96 timestamps Mar 1, 2023
@yabola yabola changed the title Support vectorized reading int96 timestamps Support vectorized reading int96 timestamps in imported data Mar 1, 2023
@yabola
Copy link
Contributor Author

yabola commented Mar 1, 2023

@rdblue @aokolnychyi @gustavoatt If you have time, please take a look, thanks

@rdblue
Copy link
Contributor

rdblue commented Mar 3, 2023

@nastra can you take a look at this?

@nastra
Copy link
Contributor

nastra commented Mar 7, 2023

@yabola I'll try to review it this week

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks mostly good to me


private static final long UNIX_EPOCH_JULIAN = 2_440_588L;

public static long extractTimestampInt96(ByteBuffer buffer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we adjust Sparks version of this in TimestampInt96Reader to reuse TimestampUtil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced all timestampint96 related places, please take a look~

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like something that could live in ParquetUtil, instead of creating a new class?

ValuesAsBytesReader valuesReader,
int typeWidth,
byte[] byteArray) {
ByteBuffer buffer = valuesReader.getBuffer(12);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a comment what the 12 means here (8 bytes = time of day in nanos / 4 bytes = julian day)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -1763,6 +1771,71 @@ public void testAllManifestTableSnapshotFiltering() throws Exception {
}
}

@Test
public void testTableWithInt96Timestamp() throws IOException {
try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the try-catch here? I think it would work fine without it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only the finally method here, it does a cleanup operation and I imitate what other UT wrote

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we always have a parquet_table that might need to be dropped. We could add that to @After and do DROP TABLE IF EXISTS

if (!catalog.tableExists(currentIdentifier)) {
return;
}
dropTable(currentIdentifier);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: whitespace missing after }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a space

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like the newline is not added? Overall we encourage newline after any logical blocks like if, for, while, try, etc.

@nastra nastra requested a review from aokolnychyi March 13, 2023 13:58
@nastra
Copy link
Contributor

nastra commented Mar 13, 2023

@aokolnychyi could you review this one please?

@aokolnychyi
Copy link
Contributor

I'll try this week but can't promise.

cc @flyrain

@yabola
Copy link
Contributor Author

yabola commented Mar 20, 2023

@nastra @aokolnychyi @rdblue @flyrain Hi~ I have updated my PR, if you have time, please take a look, thank you

@yabola
Copy link
Contributor Author

yabola commented Mar 29, 2023

@nastra emmm, looks like there are no more comments, could you help take a look again?

@jackye1995
Copy link
Contributor

sorry for the lack of reviews, I will take a look during the weekend

@@ -1763,6 +1771,71 @@ public void testAllManifestTableSnapshotFiltering() throws Exception {
}
}

@Test
public void testTableWithInt96Timestamp() throws IOException {
try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we always have a parquet_table that might need to be dropped. We could add that to @After and do DROP TABLE IF EXISTS


private static final long UNIX_EPOCH_JULIAN = 2_440_588L;

public static long extractTimestampInt96(ByteBuffer buffer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like something that could live in ParquetUtil, instead of creating a new class?


return TimeUnit.DAYS.toMicros(julianDay - UNIX_EPOCH_JULIAN)
+ TimeUnit.NANOSECONDS.toMicros(timeOfDayNanos);
return TimestampUtil.extractTimestampInt96(byteBuffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the refactoring!

if (!catalog.tableExists(currentIdentifier)) {
return;
}
dropTable(currentIdentifier);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like the newline is not added? Overall we encourage newline after any logical blocks like if, for, while, try, etc.


private static final long UNIX_EPOCH_JULIAN = 2_440_588L;

public static long extractTimestampInt96(ByteBuffer buffer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add javadoc for public util method.

@yabola
Copy link
Contributor Author

yabola commented Apr 3, 2023

@jackye1995 Thank you for your review, I have updated my PR, please take a look~

@yabola
Copy link
Contributor Author

yabola commented Apr 8, 2023

It seems that the PR check is not triggered automatically and needs to be triggered manually...

@jackye1995
Copy link
Contributor

Running CI

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me! @nastra could you take another look?

@jackye1995 jackye1995 requested a review from nastra April 9, 2023 04:14
Copy link
Contributor

@JonasJ-ap JonasJ-ap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! Thank you for the contribution. To provide some additional feedback, I conducted some further testing on the reading of iceberg tables after migration from Delta Lake tables that contain an INT96 timestamp column. Based on my observations, the vectorized reading appears to be functioning perfectly.

BTW, I think this PR can close #4200.

.load(loadLocation(tableIdentifier))
.select("tmp_col")
.collectAsList();
Assert.assertEquals("Rows must match", expected, actual);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Assert.assertEquals("Rows must match", expected, actual);
Assertions.assertThat(actual).containsExactlyInAnyOrderElementsOf(expected);

How about using org.assertj.core.api.Assertions here? I think AsserJ can provide a more organized error message (with new line for each element of the list) when actual and expected differ. Also, based on the discussion here #7160 (comment), we may want to move from Junit4 to Junit5 + AsserJ in the future.

Copy link
Contributor Author

@yabola yabola Apr 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. Thank you very much for testing on Delta !

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@jackye1995
Copy link
Contributor

Thanks for the work! And thanks for the review @nastra @JonasJ-ap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants