Writing of `ListArray` does not preserve all values #1008

ahmedriza · 2022-05-24T14:12:08Z

Not sure if this is related to the on-going work on supporting nested data (#1007). Noticed that when creating Parquet files that contain a ListArray type, the list is missing some data.

Example Parquet generation:

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

d = {
    "id": pd.Series(['t1', 't2', 't3']),
    "prices": pd.Series([
        [10.0, 20.0, 30.0, 40.0],
        [100.0, 200.0, 300.0, 400.0],
        [1000.0, 2000.0, 3000.0, 4000.0],
    ])
}

df = pd.DataFrame(d2)
table = pa.Table.from_pandas(df)
pq.write_table(table, '/tmp/demo_one_arrow.parquet')

Code to read the Parquet generated above and write out as another Parquet:

use arrow2::{
    array::*,
    chunk::Chunk as AChunk,
    io::parquet::write::{FileWriter, RowGroupIterator, WriteOptions},
};
use arrow2::{datatypes::Schema, io::parquet::read::FileReader};
use parquet2::encoding::Encoding;
use std::{fs::File, sync::Arc};

type Chunk = AChunk<Arc<dyn Array>>;
type Result<T> = arrow2::error::Result<T>;

pub fn main() -> Result<()> {
    let filename = "/tmp/demo_one_arrow.parquet";
    // Read the Parquet created by PyArrow
    let (schema, chunks) = read_parquet(filename);
    println!("Expected chunks: {:#?}", chunks);
    
    let arrow2_parquet_filename = "/tmp/demo_one_arrow2.parquet";
    // Write a new Parquet file from what what we just read
    write_parquet(arrow2_parquet_filename, schema, chunks)?;

    // Read what we just wrote
    let (_schema, _chunks) = read_parquet(arrow2_parquet_filename);
    println!("Chunks written: {:#?}", _chunks);

    Ok(())
}

fn read_parquet(filename: &str) -> (Schema, Vec<Result<Chunk>>) {
    let reader = File::open(filename).unwrap();
    let reader = FileReader::try_new(reader, None, None, None, None).unwrap();
    let schema: Schema = reader.schema().clone();
    // println!("schema: {:#?}", schema);
    let mut chunks = vec![];
    for chunk_result in reader {
        chunks.push(chunk_result);
    }
    (schema, chunks)
}

fn write_parquet(path: &str, schema: Schema, chunks: Vec<Result<Chunk>>) -> Result<()> {
    let options = WriteOptions {
        write_statistics: true,
        version: parquet2::write::Version::V2,
        compression: parquet2::compression::CompressionOptions::Snappy,
    };

    let iter = chunks.into_iter();
    let encodings = schema.fields.iter().map(|_| Encoding::Plain).collect();
    let row_groups = RowGroupIterator::try_new(iter, &schema, options, encodings)?;

    let file = std::fs::File::create(path)?;
    let mut writer = FileWriter::try_new(file, schema, options)?;
    writer.start()?;
    for group in row_groups {
        writer.write(group?)?;
    }
    writer.end(None)?;
    Ok(())
}

Output:

Expected chunks: [
    Ok(
        Chunk {
            arrays: [
                Utf8Array[t1, t2, t3],
                ListArray[[10, 20, 30, 40], [100, 200, 300, 400], [1000, 2000, 3000, 4000]],
            ],
        },
    ),
]
Chunks written: [
    Ok(
        Chunk {
            arrays: [
                Utf8Array[t1, t2, t3],
                ListArray[[10, 20, 30, 40], [100, 200], None],
            ],
        },
    ),
]

Hence, as we see, when we write the new Parquet file, it's not writing the ListArray properly. Perhaps this is already being addressed in #1007.

The text was updated successfully, but these errors were encountered:

jorgecarleitao · 2022-05-26T18:27:15Z

Thank you so much for the detailed report and code demo - very useful to repro!

I think that this is fixed with #1007 - the example now yields

Expected chunks: [
    Ok(
        Chunk {
            arrays: [
                Utf8Array[t1, t2, t3],
                ListArray[[10, 20, 30, 40], [100, 200, 300, 400], [1000, 2000, 3000, 4000]],
            ],
        },
    ),
]
Chunks written: [
    Ok(
        Chunk {
            arrays: [
                Utf8Array[t1, t2, t3],
                ListArray[[10, 20, 30, 40], [100, 200, 300, 400], [1000, 2000, 3000, 4000]],
            ],
        },
    ),
]

as expected :)

ahmedriza · 2022-05-26T21:51:41Z

Fantastic work. Many thanks for the quick fix.

jorgecarleitao added the bug Something isn't working label May 25, 2022

jorgecarleitao mentioned this issue May 26, 2022

Added support to write nested parquet #1007

Merged

jorgecarleitao changed the title ~~Issue when writing ListArrays~~ Writing of ListArray does not preserve all values May 26, 2022

jorgecarleitao closed this as completed in #1007 May 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing of `ListArray` does not preserve all values #1008

Writing of `ListArray` does not preserve all values #1008

ahmedriza commented May 24, 2022 •

edited

Loading

jorgecarleitao commented May 26, 2022

ahmedriza commented May 26, 2022

Writing of ListArray does not preserve all values #1008

Writing of ListArray does not preserve all values #1008

Comments

ahmedriza commented May 24, 2022 • edited Loading

jorgecarleitao commented May 26, 2022

ahmedriza commented May 26, 2022

Writing of `ListArray` does not preserve all values #1008

Writing of `ListArray` does not preserve all values #1008

ahmedriza commented May 24, 2022 •

edited

Loading