Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Reading empty parquet leads to divide by zero. #1060

Closed
ritchie46 opened this issue Jun 9, 2022 · 0 comments · Fixed by #1062
Closed

Reading empty parquet leads to divide by zero. #1060

ritchie46 opened this issue Jun 9, 2022 · 0 comments · Fixed by #1062
Labels
bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog

Comments

@ritchie46
Copy link
Collaborator

Copied from pola-rs/polars#3565

What language are you using?

Python

Have you tried latest version of polars?

  • [yes]

What version of polars are you using?

0.13.40

What operating system are you using polars on?

Windows 10

What language version are you using

Python 3.8.10

Describe your bug.

When trying to read or scan a parquet file with 0 rows (only metadata) with a column of (logical) type Null, a PanicException is thrown. This DataFrame could be created e.g. by saving an empty pandas DataFrame that contains at least one string (or other object) column (tested using pyarrow).

What are the steps to reproduce the behavior?

import polars as pl
import pandas as pd

filepath = "/tmp/empty.parquet"
df = pd.DataFrame({"a": []}, dtype="str")
df.to_parquet(filepath)
pl.read_parquet(filepath)

What is the actual behavior?

thread '<unnamed>' panicked at 'attempt to divide by zero', /github/home/.cargo/git/checkouts/arrow2-945af624853845da/f7c3daf/src/io/parquet/read/deserialize/null.rs:21:27

---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
/tmp/ipykernel_3424347/749936191.py in <module>
      3 df = pd.DataFrame({"a": []}, dtype="str")
      4 df.to_parquet(filepath)
----> 5 pl.read_parquet(filepath)

~/projects/jupyter/venv/lib/python3.8/site-packages/polars/io.py in read_parquet(source, columns, n_rows, use_pyarrow, memory_map, storage_options, parallel, row_count_name, row_count_offset, **kwargs)
    919             )
    920 
--> 921         return DataFrame._read_parquet(
    922             source_prep,
    923             columns=columns,

~/projects/jupyter/venv/lib/python3.8/site-packages/polars/internals/frame.py in _read_parquet(cls, file, columns, n_rows, parallel, row_count_name, row_count_offset)
    661         projection, columns = handle_projection_columns(columns)
    662         self = cls.__new__(cls)
--> 663         self._df = PyDataFrame.read_parquet(
    664             file,
    665             columns,

PanicException: attempt to divide by zero

What is the expected behavior?

Read an empty DataFrame. Pandas can read this empty parquet file just fine.

Additional Information

This is how empty pandas DataFrame with object columns are saved to parquet by default. The object columns are saved as INT32 physical type and Null logical type in the parquet schema:

>> import pyarrow.parquet as pq
>> pf = pq.ParquetFile(filepath)
>> for col in pf.schema:
>>     print(col)
<ParquetColumnSchema>
  name: a
  path: a
  max_definition_level: 1
  max_repetition_level: 0
  physical_type: INT32
  logical_type: Null
  converted_type (legacy): NONE

the arrow schema has them as NULL field as well:

>> for col in pf.schema_arrow:
>>    print(col)
pyarrow.Field<a: null>

and only the additional pandas metadata has the "correct" object numpy type:

>> print(pf.schema_arrow.pandas_metadata)
{'index_columns': [{'kind': 'range',
   'name': None,
   'start': 0,
   'stop': 0,
   'step': 1}],
 'column_indexes': [{'name': None,
   'field_name': None,
   'pandas_type': 'unicode',
   'numpy_type': 'object',
   'metadata': {'encoding': 'UTF-8'}}],
 'columns': [{'name': 'a',
   'field_name': 'a',
   'pandas_type': 'empty',
   'numpy_type': 'object',
   'metadata': None}],
 'creator': {'library': 'pyarrow', 'version': '4.0.1'},
 'pandas_version': '1.4.1'}

So, I know this seems a bit obscure, but I had this happen in an ETL pipeline where sometimes, a batch could be empty. To still have a file for that batch, I created an empty DataFrame with the same columns as the expected output, and saved that as parquet file. However, I forgot to also specify the parquet schema while writing, so the (pandas) string columns got turned into Null columns (since pandas string columns are actually just object columns, I suppose). The next job in the pipeline was using polars and then crashed on read.

@jorgecarleitao jorgecarleitao added bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog labels Jun 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants