Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet SQL Benchmarks Broken #1976

Closed
tustvold opened this issue Mar 10, 2022 · 1 comment · Fixed by apache/arrow-rs#1418
Closed

Parquet SQL Benchmarks Broken #1976

tustvold opened this issue Mar 10, 2022 · 1 comment · Fixed by apache/arrow-rs#1418
Labels
bug Something isn't working

Comments

@tustvold
Copy link
Contributor

tustvold commented Mar 10, 2022

Describe the bug

The parquet SQL benchmarks no longer run cleanly, in particular the following query returns an error

select string_optional from t where dict_10_required = 'prefix#1' and dict_1000_required = 'prefix#1';
 Parquet argument error: Parquet error: 'block_size' must be a multiple of 128, got 90") for files: [PartitionedFile { file_meta: FileMeta { sized_file: SizedFile { path: "/tmp/parquet_query_sql20TObt.parquet", size: 201093448 }, last_modified: Some(2022-03-10T12:17:51.953953953Z) }, partition_values: [] }]

I suspected this related to apache/arrow-rs#1284 which was included in the 9.1 release of arrow, but rolling back to before this upgrade just alters the error message

Parquet argument error: EOF: eof decoding byte array") for files: [PartitionedFile { file_meta: FileMeta { sized_file: SizedFile { path: "/tmp/parquet_query_sqlg368pa.parquet", size: 200927005 }, last_modified: Some(2022-03-10T12:29:15.693863589Z) }, partition_values: [] }]

It is unclear at this stage if the problem is that the encoder is writing gibberish, or if the code has introduced a bug in the decoder. Either way, we should have caught this upstream in arrow-rs, if it is an upstream bug.

Unfortunately my go to tool of using alternative tools has not thus far yielded fruit. I guess I need to go work out how to get spark running...

>>> pq.read_table('/home/raphael/Downloads/borked.parquet', columns=['string_optional'])
OSError: Not yet implemented: Unsupported encoding.

>>> duckdb.query(f"select string_optional from '/home/raphael/Downloads/borked.parquet'").fetchall()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Unsupported page encoding

To Reproduce

Run the SQL benchmarks

Expected behavior

They run without errors

Additional context

There is a broader question that perhaps we should be running this benchmark suite as part of some nightly CI job or something, potentially relates to #1377

@tustvold tustvold added the bug Something isn't working label Mar 10, 2022
@tustvold
Copy link
Contributor Author

Foiled by a lock file, downgrading to parquet 9.0.2 does resolve this issue, so apache/arrow-rs#1284 is likely related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant