Support vectorized reading int96 timestamps in imported data #6962

yabola · 2023-03-01T00:03:19Z

Add vectorized read support for Parquet INT96 timestamps ( fix comment and vectorized reads are enabled by default after that PR).
Before there is only non-vectorized reading support(see #1184).
This is needed so that parquet files written by Spark, that used INT96 timestamps, are able to be read by Iceberg without having to rewrite these files. This is specially useful for migrations from spark or delta lake.

yabola · 2023-03-01T02:18:31Z

@rdblue @aokolnychyi @gustavoatt If you have time, please take a look, thanks

rdblue · 2023-03-03T22:09:28Z

@nastra can you take a look at this?

nastra · 2023-03-07T08:32:17Z

@yabola I'll try to review it this week

nastra

looks mostly good to me

nastra · 2023-03-13T13:34:48Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/TimestampUtil.java

+
+  private static final long UNIX_EPOCH_JULIAN = 2_440_588L;
+
+  public static long extractTimestampInt96(ByteBuffer buffer) {


can we adjust Sparks version of this in TimestampInt96Reader to reuse TimestampUtil?

I replaced all timestampint96 related places, please take a look~

this feels like something that could live in ParquetUtil, instead of creating a new class?

nastra · 2023-03-13T13:46:20Z

...java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java

+        ValuesAsBytesReader valuesReader,
+        int typeWidth,
+        byte[] byteArray) {
+      ByteBuffer buffer = valuesReader.getBuffer(12);


maybe add a comment what the 12 means here (8 bytes = time of day in nanos / 4 bytes = julian day)

nastra · 2023-03-13T13:51:49Z

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java

@@ -1763,6 +1771,71 @@ public void testAllManifestTableSnapshotFiltering() throws Exception {
    }
  }

+  @Test
+  public void testTableWithInt96Timestamp() throws IOException {
+    try {


do we need the try-catch here? I think it would work fine without it

There is only the finally method here, it does a cleanup operation and I imitate what other UT wrote

Looks like we always have a parquet_table that might need to be dropped. We could add that to @After and do DROP TABLE IF EXISTS

nastra · 2023-03-13T13:53:55Z

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceHiveTables.java

+    if (!catalog.tableExists(currentIdentifier)) {
+      return;
+    }
+    dropTable(currentIdentifier);


nit: whitespace missing after }

Added a space

seems like the newline is not added? Overall we encourage newline after any logical blocks like if, for, while, try, etc.

nastra · 2023-03-13T13:59:25Z

@aokolnychyi could you review this one please?

aokolnychyi · 2023-03-14T21:06:18Z

I'll try this week but can't promise.

cc @flyrain

yabola · 2023-03-20T14:10:46Z

@nastra @aokolnychyi @rdblue @flyrain Hi~ I have updated my PR, if you have time, please take a look, thank you

yabola · 2023-03-29T13:15:41Z

@nastra emmm, looks like there are no more comments, could you help take a look again?

jackye1995 · 2023-03-31T04:44:42Z

sorry for the lack of reviews, I will take a look during the weekend

jackye1995 · 2023-04-03T05:55:30Z

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java

@@ -1763,6 +1771,71 @@ public void testAllManifestTableSnapshotFiltering() throws Exception {
    }
  }

+  @Test
+  public void testTableWithInt96Timestamp() throws IOException {
+    try {


Looks like we always have a parquet_table that might need to be dropped. We could add that to @After and do DROP TABLE IF EXISTS

jackye1995 · 2023-04-03T06:02:34Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/TimestampUtil.java

+
+  private static final long UNIX_EPOCH_JULIAN = 2_440_588L;
+
+  public static long extractTimestampInt96(ByteBuffer buffer) {


this feels like something that could live in ParquetUtil, instead of creating a new class?

jackye1995 · 2023-04-03T06:07:08Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetReaders.java

-
-      return TimeUnit.DAYS.toMicros(julianDay - UNIX_EPOCH_JULIAN)
-          + TimeUnit.NANOSECONDS.toMicros(timeOfDayNanos);
+      return TimestampUtil.extractTimestampInt96(byteBuffer);


thanks for the refactoring!

jackye1995 · 2023-04-03T06:08:05Z

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceHiveTables.java

+    if (!catalog.tableExists(currentIdentifier)) {
+      return;
+    }
+    dropTable(currentIdentifier);


seems like the newline is not added? Overall we encourage newline after any logical blocks like if, for, while, try, etc.

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactory.java

jackye1995 · 2023-04-03T06:10:12Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/TimestampUtil.java

+
+  private static final long UNIX_EPOCH_JULIAN = 2_440_588L;
+
+  public static long extractTimestampInt96(ByteBuffer buffer) {


nit: add javadoc for public util method.

yabola · 2023-04-03T16:14:20Z

@jackye1995 Thank you for your review, I have updated my PR, please take a look~

yabola · 2023-04-08T09:57:53Z

It seems that the PR check is not triggered automatically and needs to be triggered manually...

jackye1995 · 2023-04-08T14:50:04Z

Running CI

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactory.java

...java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java

jackye1995

Mostly looks good to me! @nastra could you take another look?

...java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java

JonasJ-ap

Overall LGTM! Thank you for the contribution. To provide some additional feedback, I conducted some further testing on the reading of iceberg tables after migration from Delta Lake tables that contain an INT96 timestamp column. Based on my observations, the vectorized reading appears to be functioning perfectly.

BTW, I think this PR can close #4200.

JonasJ-ap · 2023-04-09T05:56:17Z

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java

+                .load(loadLocation(tableIdentifier))
+                .select("tmp_col")
+                .collectAsList();
+        Assert.assertEquals("Rows must match", expected, actual);


Suggested change

Assert.assertEquals("Rows must match", expected, actual);

Assertions.assertThat(actual).containsExactlyInAnyOrderElementsOf(expected);

How about using org.assertj.core.api.Assertions here? I think AsserJ can provide a more organized error message (with new line for each element of the list) when actual and expected differ. Also, based on the discussion here #7160 (comment), we may want to move from Junit4 to Junit5 + AsserJ in the future.

done. Thank you very much for testing on Delta !

jackye1995

Looks good to me!

jackye1995 · 2023-04-11T14:31:51Z

Thanks for the work! And thanks for the review @nastra @JonasJ-ap

…pache#6962)

github-actions bot added arrow spark labels Mar 1, 2023

yabola changed the title ~~Support vectorized read for timestampInt96~~ Support vectorized read for Int96 Mar 1, 2023

yabola changed the title ~~Support vectorized read for Int96~~ Support vectorized read for timestamp Mar 1, 2023

yabola changed the title ~~Support vectorized read for timestamp~~ Support vectorized reading int96 timestamps in imported data Mar 1, 2023

yabola changed the title ~~Support vectorized reading int96 timestamps in imported data~~ Support vectorized reading int96 timestamps Mar 1, 2023

yabola changed the title ~~Support vectorized reading int96 timestamps~~ Support vectorized reading int96 timestamps in imported data Mar 1, 2023

yabola force-pushed the newice branch from d0f53a1 to c1d8391 Compare March 1, 2023 05:45

nastra self-requested a review March 1, 2023 07:15

nastra reviewed Mar 13, 2023

View reviewed changes

nastra requested a review from aokolnychyi March 13, 2023 13:58

yabola added 3 commits March 14, 2023 22:52

Support vectorized read for timestampInt96

4cbc007

Support vectorized reading int96 timestamps in imported data

ef7a744

fix comments

2695f25

yabola force-pushed the newice branch from c1d8391 to 2695f25 Compare March 14, 2023 15:14

code style

11f0a71

import util

ab008f6

jackye1995 reviewed Apr 3, 2023

View reviewed changes

update comments

d7a0745

github-actions bot added the parquet label Apr 3, 2023

update comments

5191bcd

code style

d0f0fb5

yabola added 2 commits April 9, 2023 00:00

fix spark 2.4 & 3.1

5d38dc7

fix import error in spark3.2

b2f75ba

jackye1995 reviewed Apr 9, 2023

View reviewed changes

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactory.java Outdated Show resolved Hide resolved

jackye1995 reviewed Apr 9, 2023

View reviewed changes

...java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java Outdated Show resolved Hide resolved

jackye1995 reviewed Apr 9, 2023

View reviewed changes

fix comments

a92561f

jackye1995 requested a review from nastra April 9, 2023 04:14

jackye1995 reviewed Apr 9, 2023

View reviewed changes

...java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java Outdated Show resolved Hide resolved

JonasJ-ap reviewed Apr 9, 2023

View reviewed changes

fix comments

80f5063

jackye1995 approved these changes Apr 9, 2023

View reviewed changes

JonasJ-ap approved these changes Apr 9, 2023

View reviewed changes

nastra approved these changes Apr 11, 2023

View reviewed changes

jackye1995 merged commit 0123862 into apache:master Apr 11, 2023

ericlgoodman pushed a commit to ericlgoodman/iceberg that referenced this pull request Apr 12, 2023

Arrow: Support vectorized read of INT96 timestamp in imported data (a…

a0ff2f3

…pache#6962)

manisin pushed a commit to Snowflake-Labs/iceberg that referenced this pull request May 9, 2023

Arrow: Support vectorized read of INT96 timestamp in imported data (a…

d26ff22

…pache#6962)

marcinsbd mentioned this pull request Jul 5, 2023

Support timestamp type in Iceberg migrate procedure trinodb/trino#17391

Merged

nastra mentioned this pull request Oct 30, 2023

Long overflow when Iceberg reading INT96 timestamp column from Spark parquet table #8949

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support vectorized reading int96 timestamps in imported data #6962

Support vectorized reading int96 timestamps in imported data #6962

yabola commented Mar 1, 2023 •

edited

Loading

yabola commented Mar 1, 2023

rdblue commented Mar 3, 2023

nastra commented Mar 7, 2023

nastra left a comment •

edited

Loading

nastra Mar 13, 2023

yabola Mar 14, 2023

jackye1995 Apr 3, 2023

nastra Mar 13, 2023

yabola Mar 14, 2023

nastra Mar 13, 2023

yabola Mar 14, 2023

jackye1995 Apr 3, 2023

nastra Mar 13, 2023

yabola Mar 14, 2023

jackye1995 Apr 3, 2023

nastra commented Mar 13, 2023

aokolnychyi commented Mar 14, 2023

yabola commented Mar 20, 2023 •

edited

Loading

yabola commented Mar 29, 2023

jackye1995 commented Mar 31, 2023

jackye1995 Apr 3, 2023

jackye1995 Apr 3, 2023

jackye1995 Apr 3, 2023

jackye1995 Apr 3, 2023

jackye1995 Apr 3, 2023

yabola commented Apr 3, 2023

yabola commented Apr 8, 2023

jackye1995 commented Apr 8, 2023

jackye1995 left a comment

JonasJ-ap left a comment •

edited

Loading

JonasJ-ap Apr 9, 2023

yabola Apr 9, 2023 •

edited

Loading

jackye1995 left a comment

jackye1995 commented Apr 11, 2023


		private static final long UNIX_EPOCH_JULIAN = 2_440_588L;

		public static long extractTimestampInt96(ByteBuffer buffer) {

	Assert.assertEquals("Rows must match", expected, actual);
	Assertions.assertThat(actual).containsExactlyInAnyOrderElementsOf(expected);

Support vectorized reading int96 timestamps in imported data #6962

Support vectorized reading int96 timestamps in imported data #6962

Conversation

yabola commented Mar 1, 2023 • edited Loading

yabola commented Mar 1, 2023

rdblue commented Mar 3, 2023

nastra commented Mar 7, 2023

nastra left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nastra commented Mar 13, 2023

aokolnychyi commented Mar 14, 2023

yabola commented Mar 20, 2023 • edited Loading

yabola commented Mar 29, 2023

jackye1995 commented Mar 31, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yabola commented Apr 3, 2023

yabola commented Apr 8, 2023

jackye1995 commented Apr 8, 2023

jackye1995 left a comment

Choose a reason for hiding this comment

JonasJ-ap left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yabola Apr 9, 2023 • edited Loading

Choose a reason for hiding this comment

jackye1995 left a comment

Choose a reason for hiding this comment

jackye1995 commented Apr 11, 2023

yabola commented Mar 1, 2023 •

edited

Loading

nastra left a comment •

edited

Loading

yabola commented Mar 20, 2023 •

edited

Loading

JonasJ-ap left a comment •

edited

Loading

yabola Apr 9, 2023 •

edited

Loading