Description
Describe the bug
The parquet SQL benchmarks no longer run cleanly, in particular the following query returns an error
select string_optional from t where dict_10_required = 'prefix#1' and dict_1000_required = 'prefix#1';
Parquet argument error: Parquet error: 'block_size' must be a multiple of 128, got 90") for files: [PartitionedFile { file_meta: FileMeta { sized_file: SizedFile { path: "/tmp/parquet_query_sql20TObt.parquet", size: 201093448 }, last_modified: Some(2022-03-10T12:17:51.953953953Z) }, partition_values: [] }]
I suspected this related to apache/arrow-rs#1284 which was included in the 9.1 release of arrow, but rolling back to before this upgrade just alters the error message
Parquet argument error: EOF: eof decoding byte array") for files: [PartitionedFile { file_meta: FileMeta { sized_file: SizedFile { path: "/tmp/parquet_query_sqlg368pa.parquet", size: 200927005 }, last_modified: Some(2022-03-10T12:29:15.693863589Z) }, partition_values: [] }]
It is unclear at this stage if the problem is that the encoder is writing gibberish, or if the code has introduced a bug in the decoder. Either way, we should have caught this upstream in arrow-rs, if it is an upstream bug.
Unfortunately my go to tool of using alternative tools has not thus far yielded fruit. I guess I need to go work out how to get spark running...
>>> pq.read_table('/home/raphael/Downloads/borked.parquet', columns=['string_optional'])
OSError: Not yet implemented: Unsupported encoding.
>>> duckdb.query(f"select string_optional from '/home/raphael/Downloads/borked.parquet'").fetchall()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: Unsupported page encoding
To Reproduce
Run the SQL benchmarks
Expected behavior
They run without errors
Additional context
There is a broader question that perhaps we should be running this benchmark suite as part of some nightly CI job or something, potentially relates to #1377