-
Notifications
You must be signed in to change notification settings - Fork 853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
set sum of uncompressed column size as row group size for parquet files #3531
Conversation
parquet/src/file/writer.rs
Outdated
let row_group_metadata = RowGroupMetaData::builder(self.descr.clone()) | ||
.set_column_metadata(column_chunks) | ||
.set_total_byte_size(self.total_bytes_written as i64) | ||
.set_total_byte_size(total_bytes_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this is the right way. One alternative is to set the self.total_bytes_written to *total_bytes_written += r.metadata.uncompressed_size();
in the on_close closure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to get a test of this, mainly so we don't accidentally break it in future. Perhaps as part of the layout integration test? I'd be happy to bash something out if you prefer?
I have added a test but seems to pass both old and new versions. It will be great if you could help here. Also i have changed the implementation to track the written columns only (as the total written bytes does). So if any columns are skipped, we still write the correct size. |
Compression isn't enabled by default, perhaps you need to enable it for that test? Otherwise I can take a look tomorrow |
I have added compression tests to boolean and int writer tests. These tests fail on current master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thank you, it looks like some unrelated files made it into this PR, could you possibly remove them and I can then get this one in 😄
Edit: I took the liberty of pushing a commit to remove them
Benchmark runs are scheduled for baseline = ec18146 and contender = c906fbf. c906fbf is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #3530
Rationale for this change
Fix the row group total_byte_size of the parquet format. This is currently set to the compressed size. As per the spec https://github.com/apache/parquet-format/blob/613a1cf4475c662457a0fd81a894ce4709799e3b/src/main/thrift/parquet.thrift#L814
this should be the uncompressed size
What changes are included in this PR?
update the parquet file row group to sum of uncompressed column sizes
Are there any user-facing changes?
The row group size of parquet file will change and correctly reflect the sum of the uncompressed column sizes