Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: Use native getRowIndexOffset support instead of calculating it #11520

Merged
merged 8 commits into from
Nov 21, 2024

Conversation

wypoon
Copy link
Contributor

@wypoon wypoon commented Nov 12, 2024

No description provided.

@wypoon wypoon changed the title Parquet use get row index offset Parquet: Use native getRowIndexOffset support instead of calculating it Nov 12, 2024
@wypoon
Copy link
Contributor Author

wypoon commented Nov 12, 2024

@szehon-ho @flyrain can you please review?
cc @huaxingao

@@ -28,5 +28,14 @@ public interface ParquetValueReader<T> {

List<TripleIterator<?>> columns();

/**
* @deprecated since 1.6.0, will be removed in 1.7.0; use setPageSource(PageReadStore) instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @deprecated since 1.6.0, will be removed in 1.7.0; use setPageSource(PageReadStore) instead.
* @deprecated since 1.8.0, will be removed in 1.9.0 or 2.0.0; use setPageSource(PageReadStore) instead.

Copy link
Contributor Author

@wypoon wypoon Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my change, I put down 1.9.0. If there is no 1.9.0, and the methods are removed in 2.0.0 instead, I don't think that would be a problem. On the other hand, we don't want to give the impression that the methods could still exist in a 1.9.0 and might be removed in 2.0.0 instead.

@@ -43,10 +43,25 @@ public interface VectorizedReader<T> {
* @param pages row group information for all the columns
* @param metadata map of {@link ColumnPath} -&gt; {@link ColumnChunkMetaData} for the row group
* @param rowPosition the row group's row offset in the parquet file
* @deprecated since 1.6.0, will be removed in 1.7.0; use setRowGroupInfo(PageReadStore,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @deprecated since 1.6.0, will be removed in 1.7.0; use setRowGroupInfo(PageReadStore,
* @deprecated since 1.8.0, will be removed in 1.9.0 or 2.0.0; use setRowGroupInfo(PageReadStore,

@wypoon
Copy link
Contributor Author

wypoon commented Nov 12, 2024

@Fokko thanks for reviewing!
I'd actually updated the deprecation comments locally but failed to commit the change before I pushed.

public void setRowGroupInfo(
PageReadStore pageStore, Map<ColumnPath, ColumnChunkMetaData> metaData) {
super.setRowGroupInfo(pageStore, metaData);
this.rowStartPosInBatch = pageStore.getRowIndexOffset().orElse(0L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if pageStore.getRowIndexOffset() is empty, does it mean getRowIndexOffset() returns a negative value? Shall we throw Exception instead of default it to 0?

Copy link
Contributor Author

@wypoon wypoon Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good question.
As I understand it, the PageReadStore implementation (ColumnChunkPageReadStore) is normally constructed with the rowIndexOffset, but if the offset is not available then it is constructed with -1 for the rowIndexOffset. PageReadStore::getRowIndexOffset() will not return a negative value; it will return Optional.empty() in that case.
I suppose we can throw an IllegalArgumentException instead in such a situation, instead of setting rowStartPosInBatch to 0.
@flyrain do you have an opinion on this?
Is there someone who knows Parquet well who can confirm that in normal operation, PageReadStore::getRowIndexOffset() should not return Optional.empty()?

@wypoon
Copy link
Contributor Author

wypoon commented Nov 14, 2024

@huaxingao @Fokko I have updated the PR; please review again.

.getRowIndexOffset()
.orElseThrow(
() ->
new IllegalArgumentException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is there a better Exception than IllegalArgumentException? Is IllegalStateException a bit better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see why you might consider IllegalStateException. I do think IllegalArgumentException is appropriate, because the PageReadStore is an argument to the method being called, and the problem is with the PageReadStore. I think IllegalStateException is typically used to indicate an internal inconsistency in the module.
Consider if we use Guava's Preconditions to check a condition here. The condition would be that source.getRowIndexOffset().isPresent(). The checkArgument methods throw IllegalArgumentException and "Ensures the truth of an expression involving one or more parameters to the calling method." The checkState methods throw IllegalStateException and "Ensures the truth of an expression involving the state of the calling instance, but not involving any parameters to the calling method." (my emphasis)
Of course, that just expresses the opinions of the authors of Guava. There are others who might argue that IllegalStateException is appropriate here, or neither IllegalStateException nor IllegalArgumentException. It really doesn't matter too much. I can just throw RuntimeException if you do not agree with IllegalArgumentException.

Copy link
Contributor

@huaxingao huaxingao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wypoon
Copy link
Contributor Author

wypoon commented Nov 18, 2024

@Fokko can you help merge this if you have no further feedback?

Copy link
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @wypoon for picking it up!

@flyrain
Copy link
Contributor

flyrain commented Nov 19, 2024

Hi @Fokko, do you have any further feedback?

@huaxingao
Copy link
Contributor

@Fokko @flyrain Should we merge this PR? I am waiting for it to be merged so I can clean up my temporary code

@flyrain
Copy link
Contributor

flyrain commented Nov 20, 2024

I will merge it if there is no new comment by EOD.

Copy link
Collaborator

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very small comment, lgtm otherwise

@@ -28,5 +28,14 @@ public interface ParquetValueReader<T> {

List<TripleIterator<?>> columns();

/**
* @deprecated since 1.8.0, will be removed in 1.9.0; use setPageSource(PageReadStore) instead.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: typically we use a link?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@wypoon wypoon force-pushed the parquet_use_getRowIndexOffset branch from 0a7f821 to 661b264 Compare November 21, 2024 00:25
@flyrain flyrain merged commit 93a0633 into apache:main Nov 21, 2024
49 checks passed
@flyrain
Copy link
Contributor

flyrain commented Nov 21, 2024

Thanks @wypoon for working on it. Thanks @huaxingao @Fokko @szehon-ho for the review.

nastra pushed a commit to nastra/iceberg that referenced this pull request Nov 21, 2024
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
snazy added a commit to snazy/nessie that referenced this pull request Jan 10, 2025
There are two Iceberg PRs that "broke" NesQuEIT:

* apache/iceberg#11478 caused `testRewriteManifests` to fail due to the changed outcome of the `rewrite_manifests` procedure
* apache/iceberg#11520 caused a class-path issue w/ Scala 2.13
snazy added a commit to snazy/query-engine-integration-tests that referenced this pull request Jan 10, 2025
Workaround for apache/iceberg#11520 that caused a class-path issue w/ Scala 2.13
snazy added a commit to projectnessie/nessie that referenced this pull request Jan 10, 2025
There are two Iceberg PRs that "broke" NesQuEIT:

* apache/iceberg#11478 caused `testRewriteManifests` to fail due to the changed outcome of the `rewrite_manifests` procedure
* apache/iceberg#11520 caused a class-path issue w/ Scala 2.13
snazy added a commit to projectnessie/query-engine-integration-tests that referenced this pull request Jan 13, 2025
Workaround for apache/iceberg#11520 that caused a class-path issue w/ Scala 2.13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants