Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: AsyncArrowWriter to a file corrupts the footer for large columns #4526

Closed
DDtKey opened this issue Jul 14, 2023 · 1 comment · Fixed by #4527
Closed

Parquet: AsyncArrowWriter to a file corrupts the footer for large columns #4526

DDtKey opened this issue Jul 14, 2023 · 1 comment · Fixed by #4527
Labels
bug parquet Changes to the parquet crate

Comments

@DDtKey
Copy link
Contributor

DDtKey commented Jul 14, 2023

Describe the bug
AsyncArrowWriter to fs will lead to footer corruption if there is large column. However it works if we would use sync version of writer
See MRE below with a comment

To Reproduce

    #[tokio::test]
    async fn test_async_writer_to_file() {
        let col = Arc::new(Int64Array::from_iter_values([1, 2, 3])) as ArrayRef;
        // this column with random large strings will cause corruption of footer
        let col2 = Arc::new(StringArray::from(vec![generate_random_string(500000), generate_random_string(500000), generate_random_string(500000)])) as ArrayRef;
        // but this will work (column size is smaller)
        // let col2 = Arc::new(StringArray::from(vec![generate_random_string(50000), generate_random_string(50000), generate_random_string(50000)])) as ArrayRef;
        let to_write = RecordBatch::try_from_iter([("col", col), ("col2", col2)]).unwrap();

        let file = tokio::fs::File::create("/path/to/file/test.parquet").await.unwrap();
        let mut writer =
            AsyncArrowWriter::try_new(file, to_write.schema(), 0, None).unwrap();
        writer.write(&to_write).await.unwrap();
        writer.close().await.unwrap();

        let file = std::fs::File::open("/path/to/file/test.parquet").unwrap();
        let mut reader = ParquetRecordBatchReaderBuilder::try_new(file)
            .unwrap()
            .build()
            .unwrap();
        let read = reader.next().unwrap().unwrap();

        assert_eq!(to_write, read);
    }
    
    fn generate_random_string(length: usize) -> String {
        thread_rng()
            .sample_iter(&Alphanumeric)
            .take(length)
            .map(char::from)
            .collect()
    }

Reader fails with:

called `Result::unwrap()` on an `Err` value: General("Invalid Parquet file. Corrupt footer")

That's definitely something with writer, because I've tested other tools (e.g parquet-fromcsv) and the same files were written well. In addition, looks like sync writer also works fine.

Expected behavior
The output file shouldn't be corrupted, like with sync writer

Additional context

@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'parquet'} from #4527

@tustvold tustvold added the parquet Changes to the parquet crate label Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants