-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extends parquet fuzz tests to also tests nulls, dictionaries and row groups with multiple pages (#1053) #1110
Conversation
c5597b2
to
7838585
Compare
/// Total number of batches to attempt to read. | ||
/// `record_batch_size` * `num_iterations` should be greater | ||
/// than `num_rows` to ensure the data can be read back completely | ||
num_iterations: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This didn't seem to serve a purpose, as it was always set in such a way as to read all the data, so I removed it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it is redundant when record_batch_size
is provided (which means the data is not all read in one big chunk, but is read in record_batch_size
chunks)
7838585
to
0baa151
Compare
Codecov Report
@@ Coverage Diff @@
## master #1110 +/- ##
==========================================
+ Coverage 82.55% 82.56% +0.01%
==========================================
Files 169 169
Lines 50456 50535 +79
==========================================
+ Hits 41655 41726 +71
- Misses 8801 8809 +8
Continue to review full report at Codecov.
|
ff604e0
to
8b98d0e
Compare
Thanks to @yordan-pavlov 's work on #1130 this now passes on master 🎉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Nice work @tustvold
/// Total number of batches to attempt to read. | ||
/// `record_batch_size` * `num_iterations` should be greater | ||
/// than `num_rows` to ensure the data can be read back completely | ||
num_iterations: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it is redundant when record_batch_size
is provided (which means the data is not all read in one big chunk, but is read in record_batch_size
chunks)
Which issue does this PR close?
Closes #1053.
Rationale for this change
See ticket
What changes are included in this PR?
This extends the parquet fuzz tests to also tests nulls, dictionaries and row groups with multiple pages.
Currently this runs into what appears to be a bug in the null handling for ArrowArrayReader. This is likely the same as in apache/datafusion#1441 - I have temporarily switched back to ComplexObjectArrayReader to get the test to pass, and will look into a fix prior to marking this ready for review.This has been fixed by #1130Are there any user-facing changes?
No, this only adds tests