Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzz test different parquet encodings #1156

Merged
merged 1 commit into from
Jan 12, 2022

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Relates to #1053

Rationale for this change

Improved test coverage

What changes are included in this PR?

#1110 extended the fuzz tests to support different array types, nulls and multiple page row groups. This builds on this to also test different encodings

Are there any user-facing changes?

No, this PR only adds more tests

let encodings = &[
Encoding::PLAIN,
Encoding::RLE_DICTIONARY,
//Encoding::DELTA_LENGTH_BYTE_ARRAY,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to not be supported at the moment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/tustvold/arrow-rs/blob/fuzz-different-encoding/parquet/src/arrow/arrow_array_reader.rs#L543

Looks like that came in with the original array reader in #384 from @yordan-pavlov

Do you think it would be a good exercise to support DELTA_LENGTH_BYTE_ARRAY in the reader? If so, I can file a ticket and see if anyone else is interested in picking it up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1082 will add support for it

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jan 11, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

ConvertedType::NONE,
None,
&converter,
&[Encoding::PLAIN, Encoding::RLE_DICTIONARY],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reason not to include DELTA_BINARY_PACKED here that that encoding is not supported for FixedLengthByteArrays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was at least my interpretation of https://github.com/apache/parquet-format/blob/master/Encodings.md

FIXED_LEN_BYTE_ARRAY only seems support PLAIN and dictionary encoding

let encodings = &[
Encoding::PLAIN,
Encoding::RLE_DICTIONARY,
//Encoding::DELTA_LENGTH_BYTE_ARRAY,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/tustvold/arrow-rs/blob/fuzz-different-encoding/parquet/src/arrow/arrow_array_reader.rs#L543

Looks like that came in with the original array reader in #384 from @yordan-pavlov

Do you think it would be a good exercise to support DELTA_LENGTH_BYTE_ARRAY in the reader? If so, I can file a ticket and see if anyone else is interested in picking it up

max_data_page_size: 1024 * 1024,
max_dict_page_size: 1024 * 1024,
writer_version: WriterVersion::PARQUET_1_0,
..Default::default()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked that these are the same values as in impl Default for TestOptions 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants