Skip to content

Parquet SQL Benchmarks Broken  #1976

Closed
apache/arrow-rs
#1418
@tustvold

Description

@tustvold

Describe the bug

The parquet SQL benchmarks no longer run cleanly, in particular the following query returns an error

select string_optional from t where dict_10_required = 'prefix#1' and dict_1000_required = 'prefix#1';
 Parquet argument error: Parquet error: 'block_size' must be a multiple of 128, got 90") for files: [PartitionedFile { file_meta: FileMeta { sized_file: SizedFile { path: "/tmp/parquet_query_sql20TObt.parquet", size: 201093448 }, last_modified: Some(2022-03-10T12:17:51.953953953Z) }, partition_values: [] }]

I suspected this related to apache/arrow-rs#1284 which was included in the 9.1 release of arrow, but rolling back to before this upgrade just alters the error message

Parquet argument error: EOF: eof decoding byte array") for files: [PartitionedFile { file_meta: FileMeta { sized_file: SizedFile { path: "/tmp/parquet_query_sqlg368pa.parquet", size: 200927005 }, last_modified: Some(2022-03-10T12:29:15.693863589Z) }, partition_values: [] }]

It is unclear at this stage if the problem is that the encoder is writing gibberish, or if the code has introduced a bug in the decoder. Either way, we should have caught this upstream in arrow-rs, if it is an upstream bug.

Unfortunately my go to tool of using alternative tools has not thus far yielded fruit. I guess I need to go work out how to get spark running...

>>> pq.read_table('/home/raphael/Downloads/borked.parquet', columns=['string_optional'])
OSError: Not yet implemented: Unsupported encoding.

>>> duckdb.query(f"select string_optional from '/home/raphael/Downloads/borked.parquet'").fetchall()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Unsupported page encoding

To Reproduce

Run the SQL benchmarks

Expected behavior

They run without errors

Additional context

There is a broader question that perhaps we should be running this benchmark suite as part of some nightly CI job or something, potentially relates to #1377

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions