You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.
Not sure if this is related to the on-going work on supporting nested data (#1007). Noticed that when creating Parquet files that contain a ListArray type, the list is missing some data.
Example Parquet generation:
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
d = {
"id": pd.Series(['t1', 't2', 't3']),
"prices": pd.Series([
[10.0, 20.0, 30.0, 40.0],
[100.0, 200.0, 300.0, 400.0],
[1000.0, 2000.0, 3000.0, 4000.0],
])
}
df = pd.DataFrame(d2)
table = pa.Table.from_pandas(df)
pq.write_table(table, '/tmp/demo_one_arrow.parquet')
Code to read the Parquet generated above and write out as another Parquet:
use arrow2::{
array::*,
chunk::Chunk as AChunk,
io::parquet::write::{FileWriter, RowGroupIterator, WriteOptions},
};
use arrow2::{datatypes::Schema, io::parquet::read::FileReader};
use parquet2::encoding::Encoding;
use std::{fs::File, sync::Arc};
type Chunk = AChunk<Arc<dyn Array>>;
type Result<T> = arrow2::error::Result<T>;
pub fn main() -> Result<()> {
let filename = "/tmp/demo_one_arrow.parquet";
// Read the Parquet created by PyArrow
let (schema, chunks) = read_parquet(filename);
println!("Expected chunks: {:#?}", chunks);
let arrow2_parquet_filename = "/tmp/demo_one_arrow2.parquet";
// Write a new Parquet file from what what we just read
write_parquet(arrow2_parquet_filename, schema, chunks)?;
// Read what we just wrote
let (_schema, _chunks) = read_parquet(arrow2_parquet_filename);
println!("Chunks written: {:#?}", _chunks);
Ok(())
}
fn read_parquet(filename: &str) -> (Schema, Vec<Result<Chunk>>) {
let reader = File::open(filename).unwrap();
let reader = FileReader::try_new(reader, None, None, None, None).unwrap();
let schema: Schema = reader.schema().clone();
// println!("schema: {:#?}", schema);
let mut chunks = vec![];
for chunk_result in reader {
chunks.push(chunk_result);
}
(schema, chunks)
}
fn write_parquet(path: &str, schema: Schema, chunks: Vec<Result<Chunk>>) -> Result<()> {
let options = WriteOptions {
write_statistics: true,
version: parquet2::write::Version::V2,
compression: parquet2::compression::CompressionOptions::Snappy,
};
let iter = chunks.into_iter();
let encodings = schema.fields.iter().map(|_| Encoding::Plain).collect();
let row_groups = RowGroupIterator::try_new(iter, &schema, options, encodings)?;
let file = std::fs::File::create(path)?;
let mut writer = FileWriter::try_new(file, schema, options)?;
writer.start()?;
for group in row_groups {
writer.write(group?)?;
}
writer.end(None)?;
Ok(())
}
Not sure if this is related to the on-going work on supporting nested data (#1007). Noticed that when creating Parquet files that contain a
ListArray
type, the list is missing some data.Example Parquet generation:
Code to read the Parquet generated above and write out as another Parquet:
Output:
Hence, as we see, when we write the new Parquet file, it's not writing the
ListArray
properly. Perhaps this is already being addressed in #1007.The text was updated successfully, but these errors were encountered: