-
Notifications
You must be signed in to change notification settings - Fork 875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet Reader/writer for fixed-size list arrays #4267
Parquet Reader/writer for fixed-size list arrays #4267
Conversation
The implementation still needs tests. The implementation uses a new `write_fixed_size_list` method instead of `write_list`. This is done to avoid the overhead of needlessly calculating list offsets.
The implementation still needs tests.
Fixed bugs in implementation found via tests.
Fixed bugs in implementation found via tests.
Writer now emits the correct definition levels for empty lists. Added empty list unit test.
Reader now handles empty list definition levels correctly. Added empty list unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this, I've left some comments. I think the level encoding is currently not quite correct, as it doesn't correctly delimit the row starts.
I think the test coverage of this is pretty good, but I think the following would highlight where the current implementation is not quite right (or show that it is and I am mistaken 😄 ):
- A test reading a RecordBatch containing both a FixedSizeList column and a primitive column, this will test the record delimiting logic works correctly - as written I think it will error with length mismatch as it won't read enough data from the FixedSizeList child
- A test reading a parquet file with skip_arrow_metadata, so it will infer the list as a regular List - this can then be used to verify that the data was encoded correctly - and would read correctly with a reader that doesn't understand FixedSizeList
child_data_builder.extend_nulls(self.fixed_size); | ||
|
||
if let Some(validity) = validity.as_mut() { | ||
// Valid if empty list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an empty list should be an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was unclear to me whether the Arrow spec allows a fixed-size list of length zero. It seemed like it's supported, since pyarrow will read and write zero length lists and a zero length fixed-size list encodes to the same def/rep levels as an empty dynamically sized list.
However, I'd be happy to change the implementation to emit an error if that behavior isn't desired!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah this is a good point, a zero-sized fixed size list is certainly a peculiar construction, but not technically invalid I don't think
// Mark the start of each list in the child array | ||
for i in 0..num_items { | ||
let idx = start + i * fixed_size; | ||
rep_levels[idx] = ctx.rep_level - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to be correct, only the first element should be ctx.rep_level - 1
, the rest should be ctx.rep_level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but my understanding is that the child elements' write methods should handle writing the leaf repetition levels So the fixed-size list writer should only need to set the repetition level to encode the start of each row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah, I misread this the first time. I believe the logic isn't quite correct where you have a repeated child within the FixedSizeList.
In particular it is making the erroneous assumption that there will be fixed_size
leaves.
I wonder if we could make write_list
have the signature
fn write_list<I, O: OffsetSizeTrait>(
&mut self,
ranges: I,
nulls: Option<&NullBuffer>,
values: &dyn Array,
) where I: IntoIterator<Item=(O, O)>
And use this to allow sharing the same logic 🤔
let mut start_idx = None; | ||
for idx in range.clone() { | ||
if nulls.is_valid(idx) { | ||
// Start a run of valid rows if not already inside of one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think coercing the rows like this is correct, as the repetition of each row must be different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the repetition level inside each row be handled by the child elements? My implementation seems to match the same logic used by write_struct
, but maybe I'm missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are quite correct, it has been a while since I've had to touch this code 😄
|
||
let expected_level = LevelInfo { | ||
def_levels: Some(vec![0, 0, 3, 3]), | ||
rep_levels: Some(vec![0, 0, 0, 1]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rep_levels: Some(vec![0, 0, 0, 1]), | |
rep_levels: Some(vec![0, 1, 0, 1]), |
As written it is not correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that the original is correct since the levels are generated by builder.write(&a, 1..4)
.
So the levels should correspond to the 1st, 2nd, and 3rd rows:
- def
0
, rep0
(row 1, null) - def
0
, rep0
(row 2, null) - def
3
, rep0
(row 3, child 0) - def
3
, rep1
(row 3, child 1)
Is that not right?
@tustvold thanks for the feedback! I implemented most of your corrections, and I added the two tests you suggested. Following those changes, I believe the rep/def level writer logic is correct as-is, but it's possible I'm missing something. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this, I think this is mostly good to go. I don't think it correctly encodes levels for FixedSizeList with repeated children, but I'm also happy for that to be fixed as a follow on PR if you would prefer. Stellar work on this 💪
// Mark the start of each list in the child array | ||
for i in 0..num_items { | ||
let idx = start + i * fixed_size; | ||
rep_levels[idx] = ctx.rep_level - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah, I misread this the first time. I believe the logic isn't quite correct where you have a repeated child within the FixedSizeList.
In particular it is making the erroneous assumption that there will be fixed_size
leaves.
I wonder if we could make write_list
have the signature
fn write_list<I, O: OffsetSizeTrait>(
&mut self,
ranges: I,
nulls: Option<&NullBuffer>,
values: &dyn Array,
) where I: IntoIterator<Item=(O, O)>
And use this to allow sharing the same logic 🤔
let mut start_idx = None; | ||
for idx in range.clone() { | ||
if nulls.is_valid(idx) { | ||
// Start a run of valid rows if not already inside of one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are quite correct, it has been a while since I've had to touch this code 😄
child_data_builder.extend_nulls(self.fixed_size); | ||
|
||
if let Some(validity) = validity.as_mut() { | ||
// Valid if empty list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah this is a good point, a zero-sized fixed size list is certainly a peculiar construction, but not technically invalid I don't think
Converted the check to return an error instead of panicking.
Writer now correctly handles child arrays with variable row length. Added new unit test to verify the new behavior is correct.
Test verifies that reader handles child arrays with variable length correctly.
@tustvold I've finally understood the encoding logic error you were pointing out 😅. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thank you
* Initial implementation for writing fixed-size lists to Parquet. The implementation still needs tests. The implementation uses a new `write_fixed_size_list` method instead of `write_list`. This is done to avoid the overhead of needlessly calculating list offsets. * Initial implementation for reading fixed-size lists from Parquet. The implementation still needs tests. * Added tests for fixed-size list writer. Fixed bugs in implementation found via tests. * Added tests for fixed-size list reader. Fixed bugs in implementation found via tests. * Added correct behavior for writing empty fixed-length lists. Writer now emits the correct definition levels for empty lists. Added empty list unit test. * Added correct behavior for reading empty fixed-length lists. Reader now handles empty list definition levels correctly. Added empty list unit test. * Fixed linter warnings. * Added license header to fixed_size_list_array.rs * Added fixed-size list reader tests from PR review. * Added fixed-size reader row length sanity checks. * Simplified fixed-size list case in LevelInfoBuilder constructor. * Removed dynamic dispatch inside fixed-length list writer. * Expanded list of structs test for fixed-size list writer. * Reverted expected levels in fixed-size list writer test. * Fixed linter warnings. * Updated list size check in fixed-size list reader. Converted the check to return an error instead of panicking. * Small tweak to row length check in fixed-size list reader. * Fixed bug in fixed-size list level encoding. Writer now correctly handles child arrays with variable row length. Added new unit test to verify the new behavior is correct. * Added fixed-size list reader test. Test verifies that reader handles child arrays with variable length correctly.
Which issue does this PR close?
Closes #4214
What changes are included in this PR?
Reader and writer both include unit tests.
I also sanity-checked the reader and writer against
pyarrow
(although I wasn't able to verify writing lists with nulls against pyarrow due to this outstanding issue).Are there any user-facing changes?
None other than the added type support