Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing dataframes containing List types in rust is not readable in Python #3312

Closed
mpetri opened this issue May 5, 2022 · 8 comments
Closed
Labels
bug Something isn't working

Comments

@mpetri
Copy link

mpetri commented May 5, 2022

What language are you using?

Rust -> Python

Which feature gates did you use?

parquet

Have you tried latest version of polars?

yes

What version of polars are you using?

rust: 0.21.1
python: 0.13.29

What operating system are you using polars on?

Ubuntu LTS 2022

What language version are you using

rust 1.60.0
Python 3.8.10

Describe your bug.

Writing dataframes containing List types in rust is not readable in Python

What are the steps to reproduce the behavior?

Rust code to create a df:

use std::io::Write;

use polars::prelude::NamedFrom;

fn main() {
    use polars::prelude::Series;

    let mut buf = Vec::new();
    let writer = polars::prelude::ParquetWriter::new(&mut buf);

    let first: Vec<String> = (0..100).map(|id| format!("{}", id)).collect();

    let second: Vec<Series> = (0..100).map(|_| Series::new("d", first.clone())).collect();

    let s1: Series = Series::new("s1", first);
    let s2 = Series::new("s2", second);

    let mut df = polars::prelude::DataFrame::new(vec![s1, s2]).unwrap();
    println!("{}", df.head(Some(3)));
    writer.finish(&mut df).unwrap();

    let mut file = std::fs::File::create("debug.parquet").unwrap();
    file.write_all(&buf).unwrap();
}

produces the output:

shape: (3, 2)
┌─────┬──────────────────────┐
│ s1  ┆ s2                   │
│ --- ┆ ---                  │
│ str ┆ list [str]           │
╞═════╪══════════════════════╡
│ 0   ┆ ["0", "1", ... "99"] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ ["0", "1", ... "99"] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ ["0", "1", ... "99"] │
└─────┴──────────────────────┘

and a file debug.parquet being written to disk.

Python code to read

import polars as pl
df = pl.read_parquet("debug.parquet")
print(df.head(2))
print(df.shape)

What is the actual behavior?

Produces the following error:

thread '<unnamed>' panicked at 'not yet implemented: LargeList(Field { name: "item", data_type: LargeUtf8, is_nullable: true, metadata: {} })', /github/home/.cargo/git/checkouts/arrow2-8a2ad61d97265680/3c64e7a/src/io/parquet/read/deserialize/mod.rs:242:18
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "./test.py", line 6, in <module>
    df = pl.read_parquet(file)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/polars/io.py", line 893, in read_parquet
    return DataFrame._read_parquet(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/polars/internals/frame.py", line 663, in _read_parquet
    self._df = PyDataFrame.read_parquet(

What is the expected behavior?

I would expect a file that I can write in rust to be readable in python.

@mpetri mpetri added the bug Something isn't working label May 5, 2022
@ritchie46
Copy link
Member

Can you read it in rust?

@jorgecarleitao
Copy link
Collaborator

most likely not - fixed with jorgecarleitao/arrow2#978

@mpetri
Copy link
Author

mpetri commented May 5, 2022

Can you read it in rust?

Let me find out.

From my limited understanding I think polars produces LargeList data types with the code I pasted which the arrow2 crate can't read yet. I tried casting things to regular List data type but couldn't make it work.

@cjermain
Copy link
Contributor

cjermain commented May 5, 2022

Is this related to jorgecarleitao/arrow2#937?

@mpetri
Copy link
Author

mpetri commented May 5, 2022

Can you read it in rust?

Just to confirm the same issue appears when reading the parquet file from rust:

let file = std::fs::File::open("debug.parquet").unwrap();
let reader = polars::prelude::ParquetReader::new(file);
let df = reader.finish().unwrap();
println!("{}", df.head(Some(3)));

So to make this work the fix in jorgecarleitao/arrow2#978 needs to be merged, new arrow2 version needs to be released and polars needs to bump its dependency?

@jorgecarleitao
Copy link
Collaborator

Merged and arrow2 0.11.2 released :)

@ritchie46
Copy link
Member

Merged and arrow2 0.11.2 released :)

On fire today!

@ritchie46
Copy link
Member

Fixed in #3316

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants