Reading empty parquet leads to divide by zero. #1060

ritchie46 · 2022-06-09T07:40:11Z

Copied from pola-rs/polars#3565

What language are you using?

Python

Have you tried latest version of polars?

[yes]

What version of polars are you using?

0.13.40

What operating system are you using polars on?

Windows 10

What language version are you using

Python 3.8.10

Describe your bug.

When trying to read or scan a parquet file with 0 rows (only metadata) with a column of (logical) type Null, a PanicException is thrown. This DataFrame could be created e.g. by saving an empty pandas DataFrame that contains at least one string (or other object) column (tested using pyarrow).

What are the steps to reproduce the behavior?

import polars as pl
import pandas as pd

filepath = "/tmp/empty.parquet"
df = pd.DataFrame({"a": []}, dtype="str")
df.to_parquet(filepath)
pl.read_parquet(filepath)

What is the actual behavior?

thread '<unnamed>' panicked at 'attempt to divide by zero', /github/home/.cargo/git/checkouts/arrow2-945af624853845da/f7c3daf/src/io/parquet/read/deserialize/null.rs:21:27

---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
/tmp/ipykernel_3424347/749936191.py in <module>
      3 df = pd.DataFrame({"a": []}, dtype="str")
      4 df.to_parquet(filepath)
----> 5 pl.read_parquet(filepath)

~/projects/jupyter/venv/lib/python3.8/site-packages/polars/io.py in read_parquet(source, columns, n_rows, use_pyarrow, memory_map, storage_options, parallel, row_count_name, row_count_offset, **kwargs)
    919             )
    920 
--> 921         return DataFrame._read_parquet(
    922             source_prep,
    923             columns=columns,

~/projects/jupyter/venv/lib/python3.8/site-packages/polars/internals/frame.py in _read_parquet(cls, file, columns, n_rows, parallel, row_count_name, row_count_offset)
    661         projection, columns = handle_projection_columns(columns)
    662         self = cls.__new__(cls)
--> 663         self._df = PyDataFrame.read_parquet(
    664             file,
    665             columns,

PanicException: attempt to divide by zero

What is the expected behavior?

Read an empty DataFrame. Pandas can read this empty parquet file just fine.

Additional Information

This is how empty pandas DataFrame with object columns are saved to parquet by default. The object columns are saved as INT32 physical type and Null logical type in the parquet schema:

>> import pyarrow.parquet as pq
>> pf = pq.ParquetFile(filepath)
>> for col in pf.schema:
>>     print(col)
<ParquetColumnSchema>
  name: a
  path: a
  max_definition_level: 1
  max_repetition_level: 0
  physical_type: INT32
  logical_type: Null
  converted_type (legacy): NONE

the arrow schema has them as NULL field as well:

>> for col in pf.schema_arrow:
>>    print(col)
pyarrow.Field<a: null>

and only the additional pandas metadata has the "correct" object numpy type:

>> print(pf.schema_arrow.pandas_metadata)
{'index_columns': [{'kind': 'range',
   'name': None,
   'start': 0,
   'stop': 0,
   'step': 1}],
 'column_indexes': [{'name': None,
   'field_name': None,
   'pandas_type': 'unicode',
   'numpy_type': 'object',
   'metadata': {'encoding': 'UTF-8'}}],
 'columns': [{'name': 'a',
   'field_name': 'a',
   'pandas_type': 'empty',
   'numpy_type': 'object',
   'metadata': None}],
 'creator': {'library': 'pyarrow', 'version': '4.0.1'},
 'pandas_version': '1.4.1'}

So, I know this seems a bit obscure, but I had this happen in an ETL pipeline where sometimes, a batch could be empty. To still have a file for that batch, I created an empty DataFrame with the same columns as the expected output, and saved that as parquet file. However, I forgot to also specify the parquet schema while writing, so the (pandas) string columns got turned into Null columns (since pandas string columns are actually just object columns, I suppose). The next job in the pipeline was using polars and then crashed on read.

The text was updated successfully, but these errors were encountered:

jorgecarleitao added bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog labels Jun 10, 2022

jorgecarleitao mentioned this issue Jun 10, 2022

Fixed divide by zero on reading empty row group #1062

Merged

jorgecarleitao closed this as completed in #1062 Jun 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading empty parquet leads to divide by zero. #1060

Reading empty parquet leads to divide by zero. #1060

ritchie46 commented Jun 9, 2022

Reading empty parquet leads to divide by zero. #1060

Reading empty parquet leads to divide by zero. #1060

Comments

ritchie46 commented Jun 9, 2022

What language are you using?

Have you tried latest version of polars?

What version of polars are you using?

What operating system are you using polars on?

What language version are you using

Describe your bug.

What are the steps to reproduce the behavior?

What is the actual behavior?

What is the expected behavior?

Additional Information