This repository has been archived by the owner on Feb 18, 2024. It is now read-only.
Reading empty parquet leads to divide by zero. #1060
Labels
bug
Something isn't working
no-changelog
Issues whose changes are covered by a PR and thus should not be shown in the changelog
Copied from pola-rs/polars#3565
What language are you using?
Python
Have you tried latest version of polars?
What version of polars are you using?
0.13.40
What operating system are you using polars on?
Windows 10
What language version are you using
Python 3.8.10
Describe your bug.
When trying to read or scan a parquet file with 0 rows (only metadata) with a column of (logical) type
Null
, a PanicException is thrown. This DataFrame could be created e.g. by saving an empty pandas DataFrame that contains at least one string (or other object) column (tested using pyarrow).What are the steps to reproduce the behavior?
What is the actual behavior?
What is the expected behavior?
Read an empty DataFrame. Pandas can read this empty parquet file just fine.
Additional Information
This is how empty pandas DataFrame with object columns are saved to parquet by default. The object columns are saved as
INT32
physical type andNull
logical type in the parquet schema:the arrow schema has them as NULL field as well:
and only the additional pandas metadata has the "correct"
object
numpy type:So, I know this seems a bit obscure, but I had this happen in an ETL pipeline where sometimes, a batch could be empty. To still have a file for that batch, I created an empty DataFrame with the same columns as the expected output, and saved that as parquet file. However, I forgot to also specify the parquet schema while writing, so the (pandas) string columns got turned into Null columns (since pandas string columns are actually just object columns, I suppose). The next job in the pipeline was using polars and then crashed on read.
The text was updated successfully, but these errors were encountered: