Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Cannot read parquet file with arrow2 but can with pyarrow #1473

Closed
twitu opened this issue Apr 25, 2023 · 3 comments
Closed

Cannot read parquet file with arrow2 but can with pyarrow #1473

twitu opened this issue Apr 25, 2023 · 3 comments

Comments

@twitu
Copy link

twitu commented Apr 25, 2023

I have this parquet file where the arrow2-0.17.0 parquet file reader does not return any data.

I have created the file using pyarrow. And I have double checked that pyarrow and datafusion can read it. I've also checked that the metadata and schema are loaded correctly by arrow2 reader. But no chunks are returned by the reader.

Unlike #1370 the schema for my file is pretty simple.

Schema {
    fields: [
        Field {
            name: "bid",
            data_type: Int64,
            is_nullable: true,
            metadata: {},
        },
        Field {
            name: "ask",
            data_type: Int64,
            is_nullable: true,
            metadata: {},
        },
        Field {
            name: "bid_size",
            data_type: UInt64,
            is_nullable: true,
            metadata: {},
        },
        Field {
            name: "ask_size",
            data_type: UInt64,
            is_nullable: true,
            metadata: {},
        },
        Field {
            name: "ts_event",
            data_type: UInt64,
            is_nullable: true,
            metadata: {},
        },
        Field {
            name: "ts_init",
            data_type: UInt64,
            is_nullable: true,
            metadata: {},
        },
    ],

And even the row groups are read correctly.

&row_groups = [
    RowGroupMetaData {
        columns: [
            ColumnChunkMetaData {
                column_chunk: ColumnChunk {
                    file_path: None,
                    file_offset: 6590314,
                    meta_data: Some(
                        ColumnMetaData {
                            type_: Type(
                                2,
                            ),
                            encodings: [
                                Encoding(
                                    8,
                                ),
                                Encoding(
                                    0,
                                ),
                                Encoding(
                                    3,
                                ),
                            ],
                            path_in_schema: [
                                "bid",
                            ],
                            codec: CompressionCodec(
                                1,
                            ),
                            num_values: 9689614,
                            total_uncompressed_size: 7215695,
                            total_compressed_size: 6590310,
                            key_value_metadata: None,
                            data_page_offset: 227,
                            index_page_offset: None,
                            dictionary_page_offset: Some(
                                4,
                            ),

I'm not sure what is going wrong here. Do you have any suggestions?

@twitu
Copy link
Author

twitu commented Apr 26, 2023

Here's the test data file with the same schema and 10 records.

test_data.parquet.zip

@twitu
Copy link
Author

twitu commented Apr 29, 2023

This test fails.

#[test]
fn arrow2_test() {
    let mut reader = File::open("test_data.parquet").expect("Unable to open given file");
    let metadata = read::read_metadata(&mut reader).expect("Unable to read metadata");
    let schema = read::infer_schema(&metadata).expect("Unable to infer schema");
    let mut fr = FileReader::new(
        reader,
        metadata.row_groups,
        schema,
        Some(1000),
        None,
        None,
    );
    assert!(fr.next().is_some())
}

@twitu twitu changed the title Cannot read parquet file with arrow2 but can with pyarrow and datafusion Cannot read parquet file with arrow2 but can with pyarrow Apr 29, 2023
@twitu
Copy link
Author

twitu commented Apr 29, 2023

This is a non-issue. I had to enable the io_parquet_compression feature to get this working.

@twitu twitu closed this as completed Apr 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant