Extends parquet fuzz tests to also tests nulls, dictionaries and row groups with multiple pages (#1053) #1110

tustvold · 2021-12-29T22:27:39Z

Which issue does this PR close?

Closes #1053.

Rationale for this change

See ticket

What changes are included in this PR?

This extends the parquet fuzz tests to also tests nulls, dictionaries and row groups with multiple pages. Currently this runs into what appears to be a bug in the null handling for ArrowArrayReader. This is likely the same as in apache/datafusion#1441 - I have temporarily switched back to ComplexObjectArrayReader to get the test to pass, and will look into a fix prior to marking this ready for review. This has been fixed by #1130

Are there any user-facing changes?

No, this only adds tests

tustvold · 2021-12-29T22:38:57Z

parquet/src/arrow/arrow_reader.rs

-        /// Total number of batches to attempt to read.
-        /// `record_batch_size` * `num_iterations` should be greater
-        /// than `num_rows` to ensure the data can be read back completely
-        num_iterations: usize,


This didn't seem to serve a purpose, as it was always set in such a way as to read all the data, so I removed it

I agree that it is redundant when record_batch_size is provided (which means the data is not all read in one big chunk, but is read in record_batch_size chunks)

codecov-commenter · 2021-12-29T22:49:52Z

Codecov Report

Merging #1110 (ec79c43) into master (719096b) will increase coverage by 0.01%.
The diff coverage is 94.16%.

❗ Current head ec79c43 differs from pull request most recent head 87ea9a1. Consider uploading reports for the commit 87ea9a1 to get more accurate results

@@            Coverage Diff             @@
##           master    #1110      +/-   ##
==========================================
+ Coverage   82.55%   82.56%   +0.01%     
==========================================
  Files         169      169              
  Lines       50456    50535      +79     
==========================================
+ Hits        41655    41726      +71     
- Misses       8801     8809       +8

Impacted Files	Coverage Δ
parquet/src/arrow/arrow_reader.rs	`89.93% <94.06%> (+0.61%)`	⬆️
parquet/src/util/test_common/rand_gen.rs	`82.69% <100.00%> (+0.87%)`	⬆️
arrow/src/datatypes/field.rs	`53.79% <0.00%> (-0.31%)`	⬇️
parquet_derive/src/parquet_field.rs	`66.21% <0.00%> (-0.23%)`	⬇️
arrow/src/array/transform/mod.rs	`85.56% <0.00%> (-0.14%)`	⬇️
parquet/src/arrow/converter.rs	`69.56% <0.00%> (+0.86%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 719096b...87ea9a1. Read the comment docs.

tustvold · 2022-01-10T20:37:46Z

Thanks to @yordan-pavlov 's work on #1130 this now passes on master 🎉

alamb

LGTM

Nice work @tustvold

alamb · 2022-01-10T22:02:13Z

parquet/src/arrow/arrow_reader.rs

-        /// Total number of batches to attempt to read.
-        /// `record_batch_size` * `num_iterations` should be greater
-        /// than `num_rows` to ensure the data can be read back completely
-        num_iterations: usize,


I agree that it is redundant when record_batch_size is provided (which means the data is not all read in one big chunk, but is read in record_batch_size chunks)

github-actions bot added the parquet Changes to the parquet crate label Dec 29, 2021

tustvold mentioned this pull request Dec 29, 2021

Incorrect results in datafusion apache/datafusion#1441

Closed

tustvold force-pushed the parquet-fuzz-tests branch from c5597b2 to 7838585 Compare December 29, 2021 22:37

tustvold commented Dec 29, 2021

View reviewed changes

Parquet fuzz tests (apache#1053)

0baa151

tustvold force-pushed the parquet-fuzz-tests branch from 7838585 to 0baa151 Compare December 29, 2021 22:39

tustvold mentioned this pull request Dec 30, 2021

ArrowArrayReader Reads Too Many Values From Bit-Packed Runs #1111

Closed

Test multiple WriterVersions

8b98d0e

tustvold force-pushed the parquet-fuzz-tests branch from ff604e0 to 8b98d0e Compare January 4, 2022 22:38

tustvold added 2 commits January 10, 2022 20:34

Merge remote-tracking branch 'upstream/master' into parquet-fuzz-tests

cf5b4a6

Revert array_reader change

87ea9a1

tustvold marked this pull request as ready for review January 10, 2022 20:37

alamb changed the title ~~Parquet fuzz tests (#1053)~~ Extends parquet fuzz tests to also tests nulls, dictionaries and row groups with multiple pages (#1053) Jan 10, 2022

alamb approved these changes Jan 10, 2022

View reviewed changes

alamb merged commit f8ff7fe into apache:master Jan 11, 2022

tustvold mentioned this pull request Jan 11, 2022

Fuzz test different parquet encodings #1156

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extends parquet fuzz tests to also tests nulls, dictionaries and row groups with multiple pages (#1053) #1110

Extends parquet fuzz tests to also tests nulls, dictionaries and row groups with multiple pages (#1053) #1110

tustvold commented Dec 29, 2021 •

edited

Loading

tustvold Dec 29, 2021

alamb Jan 10, 2022

codecov-commenter commented Dec 29, 2021 •

edited

Loading

tustvold commented Jan 10, 2022

alamb left a comment

alamb Jan 10, 2022

Extends parquet fuzz tests to also tests nulls, dictionaries and row groups with multiple pages (#1053) #1110

Extends parquet fuzz tests to also tests nulls, dictionaries and row groups with multiple pages (#1053) #1110

Conversation

tustvold commented Dec 29, 2021 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold Dec 29, 2021

Choose a reason for hiding this comment

alamb Jan 10, 2022

Choose a reason for hiding this comment

codecov-commenter commented Dec 29, 2021 • edited Loading

Codecov Report

tustvold commented Jan 10, 2022

alamb left a comment

Choose a reason for hiding this comment

alamb Jan 10, 2022

Choose a reason for hiding this comment

tustvold commented Dec 29, 2021 •

edited

Loading

codecov-commenter commented Dec 29, 2021 •

edited

Loading