Parquet SQL Benchmarks Broken 

**Describe the bug**

The parquet SQL benchmarks no longer run cleanly, in particular the following query returns an error

```
select string_optional from t where dict_10_required = 'prefix#1' and dict_1000_required = 'prefix#1';
```

```
 Parquet argument error: Parquet error: 'block_size' must be a multiple of 128, got 90") for files: [PartitionedFile { file_meta: FileMeta { sized_file: SizedFile { path: "/tmp/parquet_query_sql20TObt.parquet", size: 201093448 }, last_modified: Some(2022-03-10T12:17:51.953953953Z) }, partition_values: [] }]
```

I suspected this related to https://github.com/apache/arrow-rs/pull/1284 which was included in the 9.1 release of arrow, but rolling back to before this upgrade just alters the error message

```
Parquet argument error: EOF: eof decoding byte array") for files: [PartitionedFile { file_meta: FileMeta { sized_file: SizedFile { path: "/tmp/parquet_query_sqlg368pa.parquet", size: 200927005 }, last_modified: Some(2022-03-10T12:29:15.693863589Z) }, partition_values: [] }]
```

It is unclear at this stage if the problem is that the encoder is writing gibberish, or if the code has introduced a bug in the decoder. Either way, we should have caught this upstream in arrow-rs, if it is an upstream bug.

Unfortunately my go to tool of using alternative tools has not thus far yielded fruit. I guess I need to go work out how to get spark running...

```
>>> pq.read_table('/home/raphael/Downloads/borked.parquet', columns=['string_optional'])
OSError: Not yet implemented: Unsupported encoding.

>>> duckdb.query(f"select string_optional from '/home/raphael/Downloads/borked.parquet'").fetchall()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Unsupported page encoding
```

**To Reproduce**

Run the SQL benchmarks

**Expected behavior**

They run without errors

**Additional context**

There is a broader question that perhaps we should be running this benchmark suite as part of some nightly CI job or something, potentially relates to #1377


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parquet SQL Benchmarks Broken #1976

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parquet SQL Benchmarks Broken #1976

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions