You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug AsyncArrowWriter to fs will lead to footer corruption if there is large column. However it works if we would use sync version of writer
See MRE below with a comment
To Reproduce
#[tokio::test]asyncfntest_async_writer_to_file(){let col = Arc::new(Int64Array::from_iter_values([1,2,3]))asArrayRef;// this column with random large strings will cause corruption of footerlet col2 = Arc::new(StringArray::from(vec![generate_random_string(500000), generate_random_string(500000), generate_random_string(500000)]))asArrayRef;// but this will work (column size is smaller)// let col2 = Arc::new(StringArray::from(vec![generate_random_string(50000), generate_random_string(50000), generate_random_string(50000)])) as ArrayRef;let to_write = RecordBatch::try_from_iter([("col", col),("col2", col2)]).unwrap();let file = tokio::fs::File::create("/path/to/file/test.parquet").await.unwrap();letmut writer =
AsyncArrowWriter::try_new(file, to_write.schema(),0,None).unwrap();
writer.write(&to_write).await.unwrap();
writer.close().await.unwrap();let file = std::fs::File::open("/path/to/file/test.parquet").unwrap();letmut reader = ParquetRecordBatchReaderBuilder::try_new(file).unwrap().build().unwrap();let read = reader.next().unwrap().unwrap();assert_eq!(to_write, read);}fngenerate_random_string(length:usize) -> String{thread_rng().sample_iter(&Alphanumeric).take(length).map(char::from).collect()}
Reader fails with:
called `Result::unwrap()` on an `Err` value: General("Invalid Parquet file. Corrupt footer")
That's definitely something with writer, because I've tested other tools (e.g parquet-fromcsv) and the same files were written well. In addition, looks like sync writer also works fine.
Expected behavior
The output file shouldn't be corrupted, like with sync writer
Additional context
The text was updated successfully, but these errors were encountered:
Describe the bug
AsyncArrowWriter
tofs
will lead to footer corruption if there is large column. However it works if we would usesync
version of writerSee MRE below with a comment
To Reproduce
Reader fails with:
That's definitely something with writer, because I've tested other tools (e.g
parquet-fromcsv
) and the same files were written well. In addition, looks like sync writer also works fine.Expected behavior
The output file shouldn't be corrupted, like with sync writer
Additional context
The text was updated successfully, but these errors were encountered: