Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set sum of uncompressed column size as row group size for parquet files #3531

Merged
merged 4 commits into from
Jan 17, 2023

Conversation

sidred
Copy link
Contributor

@sidred sidred commented Jan 14, 2023

Which issue does this PR close?

Closes #3530

Rationale for this change

Fix the row group total_byte_size of the parquet format. This is currently set to the compressed size. As per the spec https://github.com/apache/parquet-format/blob/613a1cf4475c662457a0fd81a894ce4709799e3b/src/main/thrift/parquet.thrift#L814
this should be the uncompressed size

What changes are included in this PR?

update the parquet file row group to sum of uncompressed column sizes

Are there any user-facing changes?

The row group size of parquet file will change and correctly reflect the sum of the uncompressed column sizes

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jan 14, 2023
let row_group_metadata = RowGroupMetaData::builder(self.descr.clone())
.set_column_metadata(column_chunks)
.set_total_byte_size(self.total_bytes_written as i64)
.set_total_byte_size(total_bytes_size)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is the right way. One alternative is to set the self.total_bytes_written to *total_bytes_written += r.metadata.uncompressed_size(); in the on_close closure

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to get a test of this, mainly so we don't accidentally break it in future. Perhaps as part of the layout integration test? I'd be happy to bash something out if you prefer?

@sidred
Copy link
Contributor Author

sidred commented Jan 15, 2023

I have added a test but seems to pass both old and new versions. It will be great if you could help here.

Also i have changed the implementation to track the written columns only (as the total written bytes does). So if any columns are skipped, we still write the correct size.

@tustvold
Copy link
Contributor

It will be great if you could help here

Compression isn't enabled by default, perhaps you need to enable it for that test? Otherwise I can take a look tomorrow

@sidred
Copy link
Contributor Author

sidred commented Jan 17, 2023

I have added compression tests to boolean and int writer tests. These tests fail on current master.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thank you, it looks like some unrelated files made it into this PR, could you possibly remove them and I can then get this one in 😄

Edit: I took the liberty of pushing a commit to remove them

@tustvold tustvold merged commit c906fbf into apache:master Jan 17, 2023
@ursabot
Copy link

ursabot commented Jan 17, 2023

Benchmark runs are scheduled for baseline = ec18146 and contender = c906fbf. c906fbf is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect row group total_byte_size written to parquet file
3 participants