Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet Reader/writer for fixed-size list arrays #4267

Conversation

sydduckworth
Copy link

Which issue does this PR close?

Closes #4214

What changes are included in this PR?

  • Parquet writer support for fixed-size lists
  • Parquet reader support for fixed-size lists

Reader and writer both include unit tests.
I also sanity-checked the reader and writer against pyarrow (although I wasn't able to verify writing lists with nulls against pyarrow due to this outstanding issue).

Are there any user-facing changes?

None other than the added type support

Dexter Duckworth added 7 commits May 22, 2023 14:07
The implementation still needs tests.
The implementation uses a new `write_fixed_size_list` method instead of `write_list`.
This is done to avoid the overhead of needlessly calculating list offsets.
Fixed bugs in implementation found via tests.
Fixed bugs in implementation found via tests.
Writer now emits the correct definition levels for empty lists.
Added empty list unit test.
Reader now handles empty list definition levels correctly.
Added empty list unit test.
@github-actions github-actions bot added the parquet Changes to the parquet crate label May 23, 2023
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this, I've left some comments. I think the level encoding is currently not quite correct, as it doesn't correctly delimit the row starts.

I think the test coverage of this is pretty good, but I think the following would highlight where the current implementation is not quite right (or show that it is and I am mistaken 😄 ):

  • A test reading a RecordBatch containing both a FixedSizeList column and a primitive column, this will test the record delimiting logic works correctly - as written I think it will error with length mismatch as it won't read enough data from the FixedSizeList child
  • A test reading a parquet file with skip_arrow_metadata, so it will infer the list as a regular List - this can then be used to verify that the data was encoded correctly - and would read correctly with a reader that doesn't understand FixedSizeList

child_data_builder.extend_nulls(self.fixed_size);

if let Some(validity) = validity.as_mut() {
// Valid if empty list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an empty list should be an error?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was unclear to me whether the Arrow spec allows a fixed-size list of length zero. It seemed like it's supported, since pyarrow will read and write zero length lists and a zero length fixed-size list encodes to the same def/rep levels as an empty dynamically sized list.

However, I'd be happy to change the implementation to emit an error if that behavior isn't desired!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah this is a good point, a zero-sized fixed size list is certainly a peculiar construction, but not technically invalid I don't think

// Mark the start of each list in the child array
for i in 0..num_items {
let idx = start + i * fixed_size;
rep_levels[idx] = ctx.rep_level - 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be correct, only the first element should be ctx.rep_level - 1, the rest should be ctx.rep_level

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but my understanding is that the child elements' write methods should handle writing the leaf repetition levels So the fixed-size list writer should only need to set the repetition level to encode the start of each row.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah, I misread this the first time. I believe the logic isn't quite correct where you have a repeated child within the FixedSizeList.

In particular it is making the erroneous assumption that there will be fixed_size leaves.

I wonder if we could make write_list have the signature

fn write_list<I, O: OffsetSizeTrait>(
        &mut self,
        ranges: I,
        nulls: Option<&NullBuffer>,
        values: &dyn Array,
    ) where I: IntoIterator<Item=(O, O)>

And use this to allow sharing the same logic 🤔

let mut start_idx = None;
for idx in range.clone() {
if nulls.is_valid(idx) {
// Start a run of valid rows if not already inside of one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think coercing the rows like this is correct, as the repetition of each row must be different

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the repetition level inside each row be handled by the child elements? My implementation seems to match the same logic used by write_struct, but maybe I'm missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are quite correct, it has been a while since I've had to touch this code 😄


let expected_level = LevelInfo {
def_levels: Some(vec![0, 0, 3, 3]),
rep_levels: Some(vec![0, 0, 0, 1]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rep_levels: Some(vec![0, 0, 0, 1]),
rep_levels: Some(vec![0, 1, 0, 1]),

As written it is not correct

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the original is correct since the levels are generated by builder.write(&a, 1..4).

So the levels should correspond to the 1st, 2nd, and 3rd rows:

  • def 0, rep 0 (row 1, null)
  • def 0, rep 0 (row 2, null)
  • def 3, rep 0 (row 3, child 0)
  • def 3, rep 1 (row 3, child 1)

Is that not right?

@sydduckworth
Copy link
Author

@tustvold thanks for the feedback!

I implemented most of your corrections, and I added the two tests you suggested.
I also updated the test that writes a list of structs so that the structs contain 2 fields instead of 1, to try to catch any potential errors in the def/rep level encoding.

Following those changes, I believe the rep/def level writer logic is correct as-is, but it's possible I'm missing something.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this, I think this is mostly good to go. I don't think it correctly encodes levels for FixedSizeList with repeated children, but I'm also happy for that to be fixed as a follow on PR if you would prefer. Stellar work on this 💪

// Mark the start of each list in the child array
for i in 0..num_items {
let idx = start + i * fixed_size;
rep_levels[idx] = ctx.rep_level - 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah, I misread this the first time. I believe the logic isn't quite correct where you have a repeated child within the FixedSizeList.

In particular it is making the erroneous assumption that there will be fixed_size leaves.

I wonder if we could make write_list have the signature

fn write_list<I, O: OffsetSizeTrait>(
        &mut self,
        ranges: I,
        nulls: Option<&NullBuffer>,
        values: &dyn Array,
    ) where I: IntoIterator<Item=(O, O)>

And use this to allow sharing the same logic 🤔

let mut start_idx = None;
for idx in range.clone() {
if nulls.is_valid(idx) {
// Start a run of valid rows if not already inside of one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are quite correct, it has been a while since I've had to touch this code 😄

child_data_builder.extend_nulls(self.fixed_size);

if let Some(validity) = validity.as_mut() {
// Valid if empty list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah this is a good point, a zero-sized fixed size list is certainly a peculiar construction, but not technically invalid I don't think

Dexter Duckworth added 4 commits May 24, 2023 09:11
Converted the check to return an error instead of panicking.
Writer now correctly handles child arrays with variable row length.
Added new unit test to verify the new behavior is correct.
Test verifies that reader handles child arrays with variable length correctly.
@sydduckworth
Copy link
Author

@tustvold I've finally understood the encoding logic error you were pointing out 😅.
I added tests for both reader and writer to verify correct encoding/decoding of fixed-length lists containing variable-length lists, and the incorrect behavior on the write side should now be fixed.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you

@tustvold tustvold merged commit 741244d into apache:master May 24, 2023
alamb pushed a commit to alamb/arrow-rs that referenced this pull request May 30, 2023
* Initial implementation for writing fixed-size lists to Parquet.

The implementation still needs tests.
The implementation uses a new `write_fixed_size_list` method instead of `write_list`.
This is done to avoid the overhead of needlessly calculating list offsets.

* Initial implementation for reading fixed-size lists from Parquet.

The implementation still needs tests.

* Added tests for fixed-size list writer.

Fixed bugs in implementation found via tests.

* Added tests for fixed-size list reader.

Fixed bugs in implementation found via tests.

* Added correct behavior for writing empty fixed-length lists.

Writer now emits the correct definition levels for empty lists.
Added empty list unit test.

* Added correct behavior for reading empty fixed-length lists.

Reader now handles empty list definition levels correctly.
Added empty list unit test.

* Fixed linter warnings.

* Added license header to fixed_size_list_array.rs

* Added fixed-size list reader tests from PR review.

* Added fixed-size reader row length sanity checks.

* Simplified fixed-size list case in LevelInfoBuilder constructor.

* Removed dynamic dispatch inside fixed-length list writer.

* Expanded list of structs test for fixed-size list writer.

* Reverted expected levels in fixed-size list writer test.

* Fixed linter warnings.

* Updated list size check in fixed-size list reader.

Converted the check to return an error instead of panicking.

* Small tweak to row length check in fixed-size list reader.

* Fixed bug in fixed-size list level encoding.

Writer now correctly handles child arrays with variable row length.
Added new unit test to verify the new behavior is correct.

* Added fixed-size list reader test.

Test verifies that reader handles child arrays with variable length correctly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support writing FixedSizeList to Parquet
2 participants