Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Reading parquet file with multiple row groups and nested nullable struct types panics #1249

Closed
shaeqahmed opened this issue Sep 12, 2022 · 3 comments · Fixed by #1390
Closed
Labels
bug Something isn't working

Comments

@shaeqahmed
Copy link
Contributor

shaeqahmed commented Sep 12, 2022

File: 6570499c-3be5-4de5-beb4-73b11c15ea39.parquet.zip

File was generated after making the following fix to translate a deeply nested avro file to a corresponding parquet file using parquet2/arrow2:

#1248

Note: This file works with the official parquet-mr Java reader, but also seems to break the pyarrow reader in addition to arrow2.

error:


In [1]: import polars as pl

In [2]: pl.read_parquet("~/6570499c-3be5-4de5-beb4-73b11c15ea39.parquet")
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: OutOfSpec("The children must have an equal number of values.\n                         However, the values at index 2 have a length of 0, which is different from values at index 0, 9.")thread '', /Users/runner/.cargo/git/checkouts/arrow2-8a2ad61d97265680/0b345ae/src/array/struct_/mod.rs:<unnamed>120' panicked at ':called `Result::unwrap()` on an `Err` value: OutOfSpec("The children must have an equal number of values.\n                         However, the values at index 9 have a length of 0, which is different from values at index 0, 9.")52',
/Users/runner/.cargo/git/checkouts/arrow2-8a2ad61d97265680/0b345ae/src/array/struct_/mod.rsnote: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
:120:52
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: OutOfSpec("The children must have an equal number of values.\n                         However, the values at index 2 have a length of 9, which is different from values at index 0, 0.")', /Users/runner/.cargo/git/checkouts/arrow2-8a2ad61d97265680/0b345ae/src/array/struct_/mod.rs:120:52
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
<ipython-input-2-bd229b09d832> in <module>
----> 1 pl.read_parquet("~/6570499c-3be5-4de5-beb4-73b11c15ea39.parquet")

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/polars/io.py in read_parquet(source, columns, n_rows, use_pyarrow, memory_map, storage_options, parallel, row_count_name, row_count_offset, low_memory, **kwargs)
    933             row_count_name=row_count_name,
    934             row_count_offset=row_count_offset,
--> 935             low_memory=low_memory,
    936         )
    937

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/polars/internals/dataframe/frame.py in _read_parquet(cls, file, columns, n_rows, parallel, row_count_name, row_count_offset, low_memory)
    685             parallel,
    686             _prepare_row_count_args(row_count_name, row_count_offset),
--> 687             low_memory=low_memory,
    688         )
    689         return self

PanicException: called `Result::unwrap()` on an `Err` value: OutOfSpec("The children must have an equal number of values.\n                         However, the values at index 2 have a length of 0, which is different from values at index 0, 9.")

@shaeqahmed
Copy link
Contributor Author

From #1248:

With this change, I can read in my deeply nested nullable structs to arrow2 now, but looks like the arrow2/parquet2 parquet writer is writing corrupt files.

I see that in the code there are some explicitly unimplemented branches such as reading nulls from parquet, and the docs do mention lack of proper support for deeply nested parquet types (related issue: #1222).

I think we should 1) better document what is currently supported and not for arrow2 parquet functionality, and 2) fix the above mentioned bug so that we don't at least write incorrect files. I've already opened an issue to track this here: #1249 so let's continue the discussion there. I'd appreciate if you could take a look. Thanks for the help

@shaeqahmed
Copy link
Contributor Author

A minimal reproducible example of arrow2/parquet2 writing corrupt files:

In [125]: import pyarrow as pa

In [126]: import polars as pl

In [127]: t = pa.Table.from_pylist([{"a": {"a":{"1":["2"]}}}, {"a": {"1"
     ...: :["1"]}}])

In [128]: df = pl.from_arrow(t)

In [129]: df.write_parquet("/tmp/temp.parquet")

In [130]: df.read_parquet("/tmp/temp.parquet")

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: OutOfSpec("The children must have an equal number of values.\n                         However, the values at index 1 have a length of 2, which is different from values at index 0, 0.")', /Users/runner/.cargo/git/checkouts/arrow2-8a2ad61d97265680/0b345ae/src/array/struct_/mod.rs:120:52

In [130]: df.read_parquet("/tmp/temp.parquet", use_pyarrow=True)

OSError: Definition levels exceeded upper bound: 0 

# Both pyarrow/arrow2 readers failed to read the parquet file written by arrow2 using parquet2
# This suggests the parquet2 writer is not handling the nested struct correctly, let's validate
# by using the pyarrow writer impl., and seeing if we can read with parquet2/pyarrow readers.

In [133]: df.write_parquet("/tmp/temp.parquet", use_pyarrow=True)

In [134]: pl.read_parquet("/tmp/temp.parquet")
Out[134]:
shape: (2, 1)
┌────────────────┐
│ a              │
│ ---            │
│ struct[2]      │
╞════════════════╡
│ {null,{["2"]}} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {["1"],{null}} │
└────────────────┘


In [135]: pl.read_parquet("/tmp/temp.parquet", use_pyarrow=True)
Out[135]:
shape: (2, 1)
┌────────────────┐
│ a              │
│ ---            │
│ struct[2]      │
╞════════════════╡
│ {null,{["2"]}} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {["1"],{null}} │
└────────────────┘

# Yep, these look fine. 

@jorgecarleitao
Copy link
Owner

I believe that this is a bug in the writer. Specifically, we cannot make the call here: https://github.com/jorgecarleitao/arrow2/blob/main/src/io/parquet/write/primitive/nested.rs#L34 without identifying which elements represents null slots of a struct array.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants