Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeltaBitPackEncoder Pads Miniblock BitWidths With Arbitrary Values #1416

Closed
tustvold opened this issue Mar 10, 2022 · 0 comments · Fixed by #1418
Closed

DeltaBitPackEncoder Pads Miniblock BitWidths With Arbitrary Values #1416

tustvold opened this issue Mar 10, 2022 · 0 comments · Fixed by #1418
Labels
bug parquet Changes to the parquet crate

Comments

@tustvold
Copy link
Contributor

Describe the bug

https://github.com/apache/arrow-rs/blob/master/parquet/src/encodings/encoding.rs#L577 skips over the miniblock bit widths, and then only goes back and writes a value for the miniblocks that contain a non-zero number of values. The empty miniblocks are left with whatever value happens to be in the encoder's buffer.

To Reproduce

This is one of the underlying bugs behind apache/datafusion#1976

Expected behavior

Whilst the specification technically allows for arbitrary padding, it seems like a good idea to avoid non-deterministic output where possible

@tustvold tustvold added the bug label Mar 10, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 10, 2022
Ignore non-zero padded bit widths in DeltaBitPackDecoder (apache#1417)
tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 10, 2022
Ignore non-zero padded bit widths in DeltaBitPackDecoder (apache#1417)
tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 10, 2022
Ignore non-zero padded bit widths in DeltaBitPackDecoder (apache#1417)
tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 10, 2022
Ignore non-zero padded bit widths in DeltaBitPackDecoder (apache#1417)
alamb pushed a commit that referenced this issue Mar 14, 2022
* Consistent DeltaBitPackEncoder bit width padding (#1416)

Ignore non-zero padded bit widths in DeltaBitPackDecoder (#1417)

* chore: review feedback

* Add test of DeltaBitPackDecoder padding

* Revert formatting
@alamb alamb added the parquet Changes to the parquet crate label Mar 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants