Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Bug reading parquet file with struct nested in list #937

Closed
cjermain opened this issue Apr 12, 2022 · 5 comments · Fixed by #1140
Closed

Bug reading parquet file with struct nested in list #937

cjermain opened this issue Apr 12, 2022 · 5 comments · Fixed by #1140
Assignees
Labels
bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog

Comments

@cjermain
Copy link
Contributor

I discovered this bug while working with polars, but I have traced it down to the arrow2 code. It seems that a None value is not properly handled when a List-type Array has a Struct-type Array nested inside it. This is most easily reproduced by generating a parquet file as follows:

import pandas as pd

data = [                                                                                           
    {'a': [{'b': 0}]},
    {'a': [{'b': 1}]},
]

pd.DataFrame(data).to_parquet("~/test.parquet")  

The bug can be observed by adding a breakpoint for rust_panic in the following code that uses arrow2 to load the parquet file (this needs to be the most recent polars release - 0.13.21).

$ rust-gdb --args python -c "import polars; polars.read_parquet('~/test.parquet')"

I'm running into the error in the arrow2::io::parquet::read::deserialize::create_list function on line 46.

thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', /.../.cargo/git/checkouts/arrow2-9
45af624853845da/da703ae/src/io/parquet/read/deserialize/mod.rs:46:63
...
#8  0x00007ffff4c5cd0e in arrow2::io::parquet::read::deserialize::create_list (data_type=..., nested=0x7fffd05fa330, 
    values=...)
    at /.../.cargo/git/checkouts/arrow2-945af624853845da/da703ae/src/io/parquet/read/deserialize/mod.rs:46
46                  let (mut offsets, validity) = nested.nested.pop().unwrap().inner();
(gdb) l
41          nested: &mut NestedState,
42          values: Arc<dyn Array>,
43      ) -> Result<Arc<dyn Array>> {
44          Ok(match data_type {
45              DataType::List(_) => {
46                  let (mut offsets, validity) = nested.nested.pop().unwrap().inner();
47                  offsets.push(values.len() as i64);
48
49                  let offsets = offsets.iter().map(|x| *x as i32).collect::<Vec<_>>();
50                  Arc::new(ListArray::<i32>::new(

That is as far as I've had time to investigate so far. Outside of that issue I've seen nested structs working fine for the test cases I've looked at.

@jorgecarleitao, @ritchie46, what do you think?

@jorgecarleitao jorgecarleitao added the bug Something isn't working label Apr 15, 2022
@jorgecarleitao
Copy link
Owner

I agree that it is a bug ^^

@cjermain
Copy link
Contributor Author

This is happening because of a remaining TODO in struct_.rs.

NestedState::new(vec![]), // todo

@scimas
Copy link

scimas commented May 6, 2022

Possibly related,

thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', /path/to/arrow2-0.11.2/src/io/parquet/read/deserialize/mod.rs:272:44

The line in question, with context

columns_to_iter_recursive(
vec![columns.pop().unwrap()],
vec![types.pop().unwrap()],

@jorgecarleitao
Copy link
Owner

I will prioritize this.

@cjermain
Copy link
Contributor Author

cjermain commented Jul 6, 2022

Thanks @jorgecarleitao!

@jorgecarleitao jorgecarleitao added the no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog label Jul 31, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants