-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-40592: [C++][Parquet] Implement SizeStatistics #40594
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few high level questions/suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I'm OK with this as long as @pitrou is thank you for driving this.
@emkornfield @mapleFU Thanks for the feedback! I haven't addressed all comments from @pitrou yet. Will let you know once ready for review again. |
0449426
to
a83ed41
Compare
@pitrou @emkornfield @mapleFU Gentle ping :) |
cpp/src/parquet/column_writer.cc
Outdated
page_statistics_->Update(*referenced_dictionary, /*update_counts=*/false); | ||
} | ||
if (page_size_stats_builder_) { | ||
page_size_stats_builder_->AddValues(*referenced_dictionary); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we are dictionary-decoding the entire array just to run basic statistics? This seems incredibly wasteful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dislike the approach as you did. It seems that it has been used for collecting page statistics already for a long time. Do you think it is better to fix it in a separate PR or just do it in this one altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either would be fine with me, so whatever is more practical to you :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the history for the current approach: https://issues.apache.org/jira/browse/ARROW-12513. It pays to collect precise min/max stats from a dictionary-encoded arrow array. This reason also applies to size stats.
@emkornfield may provide more context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is generally ill-designed and would deserve a rethink to avoid glaring inefficiencies. -1 from me on this PR.
(see the comments I posted above for more focussed complaints)
I have adopted your suggestion to let encoders take the responsibility of counting unencoded data bytes and simplified SizeStatistics interface to a great extent. Please take a look again. Thanks! @pitrou |
Rationale for this change
Parquet format 2.10.0 has introduced SizeStatistics. parquet-mr has also implemented this: apache/parquet-java#1177. Now it is time for parquet-cpp to pick the ball.
What changes are included in this PR?
Implement reading and writing size statistics for parquet-cpp.
Are these changes tested?
Yes, a bunch of test cases have been added.
Are there any user-facing changes?
Yes, now parquet users are able to read and write size statistics.