BUG: `DataFrame().to_parquet()` does not write Parquet compliant data for nested arrays #43689

judahrand · 2021-09-21T21:13:43Z

Reproducible Example

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd


df = pd.DataFrame({'int_array_col': [[[1,2,3]], [[4,5,6]]]})
df.to_parquet('/tmp/test', engine='pyarrow')

pandas_parquet_table = pq.read_table('/tmp/test')

pyarrow_table = pa.Table.from_pandas(df)
writer = pa.BufferOutputStream()
pq.write_table(
    pyarrow_table,
    writer,
    use_compliant_nested_type=True
)
reader = pa.BufferReader(writer.getvalue())
parquet_table = pq.read_table(reader)

print("Pandas:", pandas_parquet_table.schema.types)
print("Non-compliant Parquet:", pyarrow_table.schema.types)
print("Compliant Parquet:", parquet_table.schema.types)
assert pandas_parquet_table.schema.types == pyarrow_table.types
assert pandas_parquet_table.schema.types == parquet_table.schema.types


```python-traceback
Pandas: [ListType(list<item: list<item: int64>>)]
Non-compliant Parquet: [ListType(list<item: list<item: int64>>)]
Compliant Parquet: [ListType(list<element: list<element: int64>>)]
Traceback (most recent call last):
  File "/Users/judahrand/test_dir/pandas_parquet.py", line 25, in <module>
    assert pandas_parquet_table.schema.types == parquet_table.schema.types
AssertionError

Issue Description

This method currently does not write adherent Parquet Logical Types for nested arrays as defined here. This can cause problems when trying to Parquet as in intermediate format, for example loading data into BigQuery which expects adherent data.

This was an issue in PyArrow itself, however, it was fixed in ARROW-11497. I believe that this flag should be set in Pandas if we are to claim that Pandas .to_parquet() method actually outputs Parquet.

Expected Behavior

Output complaint Parquet.

The text was updated successfully, but these errors were encountered:

judahrand · 2021-09-21T21:51:52Z

I've opened an initial Pull Request to address this but it will require moving the testing version of PyArrow to 4.0.0 at a minimum... This might be something Pandas is not willing to accept to address this?

jreback · 2021-11-28T21:01:45Z

going to be addressed in pyarrow

JohnHerry · 2025-01-14T02:34:23Z

Reproducible Example

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd


df = pd.DataFrame({'int_array_col': [[[1,2,3]], [[4,5,6]]]})
df.to_parquet('/tmp/test', engine='pyarrow')

pandas_parquet_table = pq.read_table('/tmp/test')

pyarrow_table = pa.Table.from_pandas(df)
writer = pa.BufferOutputStream()
pq.write_table(
    pyarrow_table,
    writer,
    use_compliant_nested_type=True
)
reader = pa.BufferReader(writer.getvalue())
parquet_table = pq.read_table(reader)

print("Pandas:", pandas_parquet_table.schema.types)
print("Non-compliant Parquet:", pyarrow_table.schema.types)
print("Compliant Parquet:", parquet_table.schema.types)
assert pandas_parquet_table.schema.types == pyarrow_table.types
assert pandas_parquet_table.schema.types == parquet_table.schema.types


```python-traceback
Pandas: [ListType(list<item: list<item: int64>>)]
Non-compliant Parquet: [ListType(list<item: list<item: int64>>)]
Compliant Parquet: [ListType(list<element: list<element: int64>>)]
Traceback (most recent call last):
  File "/Users/judahrand/test_dir/pandas_parquet.py", line 25, in <module>
    assert pandas_parquet_table.schema.types == parquet_table.schema.types
AssertionError

Issue Description

This method currently does not write adherent Parquet Logical Types for nested arrays as defined here. This can cause problems when trying to Parquet as in intermediate format, for example loading data into BigQuery which expects adherent data.

This was an issue in PyArrow itself, however, it was fixed in ARROW-11497. I believe that this flag should be set in Pandas if we are to claim that Pandas .to_parquet() method actually outputs Parquet.

Expected Behavior

Output complaint Parquet.

Is this bug fixed? I am using pandas 2.2.2, I found that the save datatype is not the same when read from the disk. eg.

df=pd.DataFrame()
df['A'] = np.array([100, 200], dtype=np.int64)
df['B'] = np.array([22.0, 22.1])
df.to_parquet("save_file.parquet")

Then read out:

data = pq.read_table("save_file.parquet").to_pandas()
data.loc[0]["A"]   
>>> 
array([100.0, 200.0])

I saved in the int64 items but readout double float items.

judahrand added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 21, 2021

judahrand mentioned this issue Sep 21, 2021

BUG: Write compliant Parquet with pyarrow #43690

Closed

4 tasks

jreback added this to the 1.4 milestone Sep 22, 2021

jreback added IO Parquet parquet, feather and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 22, 2021

jreback modified the milestones: 1.4, Contributions Welcome Nov 28, 2021

NazyS mentioned this issue Feb 8, 2022

BUG: cannot read back columns of dtype interval[datetime64[ns]] from parquet file or pyarrow table #45881

Closed

3 tasks

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: `DataFrame().to_parquet()` does not write Parquet compliant data for nested arrays #43689

BUG: `DataFrame().to_parquet()` does not write Parquet compliant data for nested arrays #43689

judahrand commented Sep 21, 2021 •

edited by jorisvandenbossche

Loading

judahrand commented Sep 21, 2021 •

edited

Loading

jreback commented Nov 28, 2021

JohnHerry commented Jan 14, 2025

Reproducible Example

Issue Description

Expected Behavior

BUG: DataFrame().to_parquet() does not write Parquet compliant data for nested arrays #43689

BUG: DataFrame().to_parquet() does not write Parquet compliant data for nested arrays #43689

Comments

judahrand commented Sep 21, 2021 • edited by jorisvandenbossche Loading

Reproducible Example

Issue Description

Expected Behavior

judahrand commented Sep 21, 2021 • edited Loading

jreback commented Nov 28, 2021

JohnHerry commented Jan 14, 2025

Reproducible Example

Issue Description

Expected Behavior

BUG: `DataFrame().to_parquet()` does not write Parquet compliant data for nested arrays #43689

BUG: `DataFrame().to_parquet()` does not write Parquet compliant data for nested arrays #43689

judahrand commented Sep 21, 2021 •

edited by jorisvandenbossche

Loading

judahrand commented Sep 21, 2021 •

edited

Loading