-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet: Use native getRowIndexOffset support instead of calculating it #11520
Conversation
@szehon-ho @flyrain can you please review? |
@@ -28,5 +28,14 @@ public interface ParquetValueReader<T> { | |||
|
|||
List<TripleIterator<?>> columns(); | |||
|
|||
/** | |||
* @deprecated since 1.6.0, will be removed in 1.7.0; use setPageSource(PageReadStore) instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @deprecated since 1.6.0, will be removed in 1.7.0; use setPageSource(PageReadStore) instead. | |
* @deprecated since 1.8.0, will be removed in 1.9.0 or 2.0.0; use setPageSource(PageReadStore) instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my change, I put down 1.9.0. If there is no 1.9.0, and the methods are removed in 2.0.0 instead, I don't think that would be a problem. On the other hand, we don't want to give the impression that the methods could still exist in a 1.9.0 and might be removed in 2.0.0 instead.
@@ -43,10 +43,25 @@ public interface VectorizedReader<T> { | |||
* @param pages row group information for all the columns | |||
* @param metadata map of {@link ColumnPath} -> {@link ColumnChunkMetaData} for the row group | |||
* @param rowPosition the row group's row offset in the parquet file | |||
* @deprecated since 1.6.0, will be removed in 1.7.0; use setRowGroupInfo(PageReadStore, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @deprecated since 1.6.0, will be removed in 1.7.0; use setRowGroupInfo(PageReadStore, | |
* @deprecated since 1.8.0, will be removed in 1.9.0 or 2.0.0; use setRowGroupInfo(PageReadStore, |
@Fokko thanks for reviewing! |
public void setRowGroupInfo( | ||
PageReadStore pageStore, Map<ColumnPath, ColumnChunkMetaData> metaData) { | ||
super.setRowGroupInfo(pageStore, metaData); | ||
this.rowStartPosInBatch = pageStore.getRowIndexOffset().orElse(0L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if pageStore.getRowIndexOffset()
is empty, does it mean getRowIndexOffset() returns a negative value? Shall we throw Exception instead of default it to 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good question.
As I understand it, the PageReadStore
implementation (ColumnChunkPageReadStore
) is normally constructed with the rowIndexOffset
, but if the offset is not available then it is constructed with -1 for the rowIndexOffset
. PageReadStore::getRowIndexOffset()
will not return a negative value; it will return Optional.empty()
in that case.
I suppose we can throw an IllegalArgumentException instead in such a situation, instead of setting rowStartPosInBatch
to 0.
@flyrain do you have an opinion on this?
Is there someone who knows Parquet well who can confirm that in normal operation, PageReadStore::getRowIndexOffset()
should not return Optional.empty()
?
@huaxingao @Fokko I have updated the PR; please review again. |
.getRowIndexOffset() | ||
.orElseThrow( | ||
() -> | ||
new IllegalArgumentException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Is there a better Exception than IllegalArgumentException
? Is IllegalStateException
a bit better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see why you might consider IllegalStateException
. I do think IllegalArgumentException
is appropriate, because the PageReadStore
is an argument to the method being called, and the problem is with the PageReadStore
. I think IllegalStateException
is typically used to indicate an internal inconsistency in the module.
Consider if we use Guava's Preconditions
to check a condition here. The condition would be that source.getRowIndexOffset().isPresent()
. The checkArgument
methods throw IllegalArgumentException
and "Ensures the truth of an expression involving one or more parameters to the calling method." The checkState
methods throw IllegalStateException
and "Ensures the truth of an expression involving the state of the calling instance, but not involving any parameters to the calling method." (my emphasis)
Of course, that just expresses the opinions of the authors of Guava. There are others who might argue that IllegalStateException
is appropriate here, or neither IllegalStateException
nor IllegalArgumentException
. It really doesn't matter too much. I can just throw RuntimeException
if you do not agree with IllegalArgumentException
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@Fokko can you help merge this if you have no further feedback? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @wypoon for picking it up!
Hi @Fokko, do you have any further feedback? |
@Fokko @flyrain Should we merge this PR? I am waiting for it to be merged so I can clean up my temporary code |
I will merge it if there is no new comment by EOD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very small comment, lgtm otherwise
@@ -28,5 +28,14 @@ public interface ParquetValueReader<T> { | |||
|
|||
List<TripleIterator<?>> columns(); | |||
|
|||
/** | |||
* @deprecated since 1.8.0, will be removed in 1.9.0; use setPageSource(PageReadStore) instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: typically we use a link?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
0a7f821
to
661b264
Compare
Thanks @wypoon for working on it. Thanks @huaxingao @Fokko @szehon-ho for the review. |
There are two Iceberg PRs that "broke" NesQuEIT: * apache/iceberg#11478 caused `testRewriteManifests` to fail due to the changed outcome of the `rewrite_manifests` procedure * apache/iceberg#11520 caused a class-path issue w/ Scala 2.13
Workaround for apache/iceberg#11520 that caused a class-path issue w/ Scala 2.13
There are two Iceberg PRs that "broke" NesQuEIT: * apache/iceberg#11478 caused `testRewriteManifests` to fail due to the changed outcome of the `rewrite_manifests` procedure * apache/iceberg#11520 caused a class-path issue w/ Scala 2.13
Workaround for apache/iceberg#11520 that caused a class-path issue w/ Scala 2.13
No description provided.