Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-40592: [C++][Parquet] Implement SizeStatistics #40594

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Mar 16, 2024

Rationale for this change

Parquet format 2.10.0 has introduced SizeStatistics. parquet-mr has also implemented this: apache/parquet-java#1177. Now it is time for parquet-cpp to pick the ball.

What changes are included in this PR?

Implement reading and writing size statistics for parquet-cpp.

Are these changes tested?

Yes, a bunch of test cases have been added.

Are there any user-facing changes?

Yes, now parquet users are able to read and write size statistics.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Mar 17, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 19, 2024
@wgtmac wgtmac marked this pull request as ready for review April 5, 2024 15:39
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Apr 10, 2024
Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few high level questions/suggestions.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 6, 2024
Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I'm OK with this as long as @pitrou is thank you for driving this.

cpp/src/parquet/column_page.h Outdated Show resolved Hide resolved
cpp/src/parquet/column_writer.cc Outdated Show resolved Hide resolved
cpp/src/parquet/size_statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/size_statistics.cc Outdated Show resolved Hide resolved
@wgtmac
Copy link
Member Author

wgtmac commented Aug 7, 2024

@emkornfield @mapleFU Thanks for the feedback! I haven't addressed all comments from @pitrou yet. Will let you know once ready for review again.

@wgtmac
Copy link
Member Author

wgtmac commented Nov 12, 2024

@pitrou @emkornfield @mapleFU Gentle ping :)

page_statistics_->Update(*referenced_dictionary, /*update_counts=*/false);
}
if (page_size_stats_builder_) {
page_size_stats_builder_->AddValues(*referenced_dictionary);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are dictionary-decoding the entire array just to run basic statistics? This seems incredibly wasteful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike the approach as you did. It seems that it has been used for collecting page statistics already for a long time. Do you think it is better to fix it in a separate PR or just do it in this one altogether?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either would be fine with me, so whatever is more practical to you :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the history for the current approach: https://issues.apache.org/jira/browse/ARROW-12513. It pays to collect precise min/max stats from a dictionary-encoded arrow array. This reason also applies to size stats.

@emkornfield may provide more context.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is generally ill-designed and would deserve a rethink to avoid glaring inefficiencies. -1 from me on this PR.

(see the comments I posted above for more focussed complaints)

@wgtmac
Copy link
Member Author

wgtmac commented Nov 17, 2024

I have adopted your suggestion to let encoders take the responsibility of counting unencoded data bytes and simplified SizeStatistics interface to a great extent. Please take a look again. Thanks! @pitrou

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants