Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Writing of ListArray does not preserve all values #1008

Closed
ahmedriza opened this issue May 24, 2022 · 2 comments · Fixed by #1007
Closed

Writing of ListArray does not preserve all values #1008

ahmedriza opened this issue May 24, 2022 · 2 comments · Fixed by #1007
Labels
bug Something isn't working

Comments

@ahmedriza
Copy link

ahmedriza commented May 24, 2022

Not sure if this is related to the on-going work on supporting nested data (#1007). Noticed that when creating Parquet files that contain a ListArray type, the list is missing some data.

Example Parquet generation:

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

d = {
    "id": pd.Series(['t1', 't2', 't3']),
    "prices": pd.Series([
        [10.0, 20.0, 30.0, 40.0],
        [100.0, 200.0, 300.0, 400.0],
        [1000.0, 2000.0, 3000.0, 4000.0],
    ])
}

df = pd.DataFrame(d2)
table = pa.Table.from_pandas(df)
pq.write_table(table, '/tmp/demo_one_arrow.parquet')

Code to read the Parquet generated above and write out as another Parquet:

use arrow2::{
    array::*,
    chunk::Chunk as AChunk,
    io::parquet::write::{FileWriter, RowGroupIterator, WriteOptions},
};
use arrow2::{datatypes::Schema, io::parquet::read::FileReader};
use parquet2::encoding::Encoding;
use std::{fs::File, sync::Arc};

type Chunk = AChunk<Arc<dyn Array>>;
type Result<T> = arrow2::error::Result<T>;

pub fn main() -> Result<()> {
    let filename = "/tmp/demo_one_arrow.parquet";
    // Read the Parquet created by PyArrow
    let (schema, chunks) = read_parquet(filename);
    println!("Expected chunks: {:#?}", chunks);
    
    let arrow2_parquet_filename = "/tmp/demo_one_arrow2.parquet";
    // Write a new Parquet file from what what we just read
    write_parquet(arrow2_parquet_filename, schema, chunks)?;

    // Read what we just wrote
    let (_schema, _chunks) = read_parquet(arrow2_parquet_filename);
    println!("Chunks written: {:#?}", _chunks);

    Ok(())
}

fn read_parquet(filename: &str) -> (Schema, Vec<Result<Chunk>>) {
    let reader = File::open(filename).unwrap();
    let reader = FileReader::try_new(reader, None, None, None, None).unwrap();
    let schema: Schema = reader.schema().clone();
    // println!("schema: {:#?}", schema);
    let mut chunks = vec![];
    for chunk_result in reader {
        chunks.push(chunk_result);
    }
    (schema, chunks)
}

fn write_parquet(path: &str, schema: Schema, chunks: Vec<Result<Chunk>>) -> Result<()> {
    let options = WriteOptions {
        write_statistics: true,
        version: parquet2::write::Version::V2,
        compression: parquet2::compression::CompressionOptions::Snappy,
    };

    let iter = chunks.into_iter();
    let encodings = schema.fields.iter().map(|_| Encoding::Plain).collect();
    let row_groups = RowGroupIterator::try_new(iter, &schema, options, encodings)?;

    let file = std::fs::File::create(path)?;
    let mut writer = FileWriter::try_new(file, schema, options)?;
    writer.start()?;
    for group in row_groups {
        writer.write(group?)?;
    }
    writer.end(None)?;
    Ok(())
}

Output:

Expected chunks: [
    Ok(
        Chunk {
            arrays: [
                Utf8Array[t1, t2, t3],
                ListArray[[10, 20, 30, 40], [100, 200, 300, 400], [1000, 2000, 3000, 4000]],
            ],
        },
    ),
]
Chunks written: [
    Ok(
        Chunk {
            arrays: [
                Utf8Array[t1, t2, t3],
                ListArray[[10, 20, 30, 40], [100, 200], None],
            ],
        },
    ),
]

Hence, as we see, when we write the new Parquet file, it's not writing the ListArray properly. Perhaps this is already being addressed in #1007.

@jorgecarleitao
Copy link
Owner

Thank you so much for the detailed report and code demo - very useful to repro!

I think that this is fixed with #1007 - the example now yields

Expected chunks: [
    Ok(
        Chunk {
            arrays: [
                Utf8Array[t1, t2, t3],
                ListArray[[10, 20, 30, 40], [100, 200, 300, 400], [1000, 2000, 3000, 4000]],
            ],
        },
    ),
]
Chunks written: [
    Ok(
        Chunk {
            arrays: [
                Utf8Array[t1, t2, t3],
                ListArray[[10, 20, 30, 40], [100, 200, 300, 400], [1000, 2000, 3000, 4000]],
            ],
        },
    ),
]

as expected :)

@jorgecarleitao jorgecarleitao changed the title Issue when writing ListArrays Writing of ListArray does not preserve all values May 26, 2022
@ahmedriza
Copy link
Author

Fantastic work. Many thanks for the quick fix.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants