Skip to content

BUG: DataFrame().to_parquet() does not write Parquet compliant data for nested arrays #43689

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
judahrand opened this issue Sep 21, 2021 · 3 comments
Labels
Bug IO Parquet parquet, feather

Comments

@judahrand
Copy link

judahrand commented Sep 21, 2021

Reproducible Example

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd


df = pd.DataFrame({'int_array_col': [[[1,2,3]], [[4,5,6]]]})
df.to_parquet('/tmp/test', engine='pyarrow')

pandas_parquet_table = pq.read_table('/tmp/test')

pyarrow_table = pa.Table.from_pandas(df)
writer = pa.BufferOutputStream()
pq.write_table(
    pyarrow_table,
    writer,
    use_compliant_nested_type=True
)
reader = pa.BufferReader(writer.getvalue())
parquet_table = pq.read_table(reader)

print("Pandas:", pandas_parquet_table.schema.types)
print("Non-compliant Parquet:", pyarrow_table.schema.types)
print("Compliant Parquet:", parquet_table.schema.types)
assert pandas_parquet_table.schema.types == pyarrow_table.types
assert pandas_parquet_table.schema.types == parquet_table.schema.types


```python-traceback
Pandas: [ListType(list<item: list<item: int64>>)]
Non-compliant Parquet: [ListType(list<item: list<item: int64>>)]
Compliant Parquet: [ListType(list<element: list<element: int64>>)]
Traceback (most recent call last):
  File "/Users/judahrand/test_dir/pandas_parquet.py", line 25, in <module>
    assert pandas_parquet_table.schema.types == parquet_table.schema.types
AssertionError

Issue Description

This method currently does not write adherent Parquet Logical Types for nested arrays as defined here. This can cause problems when trying to Parquet as in intermediate format, for example loading data into BigQuery which expects adherent data.

This was an issue in PyArrow itself, however, it was fixed in ARROW-11497. I believe that this flag should be set in Pandas if we are to claim that Pandas .to_parquet() method actually outputs Parquet.

Expected Behavior

Output complaint Parquet.

@judahrand judahrand added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 21, 2021
@judahrand
Copy link
Author

judahrand commented Sep 21, 2021

I've opened an initial Pull Request to address this but it will require moving the testing version of PyArrow to 4.0.0 at a minimum... This might be something Pandas is not willing to accept to address this?

@jreback jreback added this to the 1.4 milestone Sep 22, 2021
@jreback jreback added IO Parquet parquet, feather and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 22, 2021
@jreback jreback modified the milestones: 1.4, Contributions Welcome Nov 28, 2021
@jreback
Copy link
Contributor

jreback commented Nov 28, 2021

going to be addressed in pyarrow

@JohnHerry
Copy link

Reproducible Example

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd


df = pd.DataFrame({'int_array_col': [[[1,2,3]], [[4,5,6]]]})
df.to_parquet('/tmp/test', engine='pyarrow')

pandas_parquet_table = pq.read_table('/tmp/test')

pyarrow_table = pa.Table.from_pandas(df)
writer = pa.BufferOutputStream()
pq.write_table(
    pyarrow_table,
    writer,
    use_compliant_nested_type=True
)
reader = pa.BufferReader(writer.getvalue())
parquet_table = pq.read_table(reader)

print("Pandas:", pandas_parquet_table.schema.types)
print("Non-compliant Parquet:", pyarrow_table.schema.types)
print("Compliant Parquet:", parquet_table.schema.types)
assert pandas_parquet_table.schema.types == pyarrow_table.types
assert pandas_parquet_table.schema.types == parquet_table.schema.types


```python-traceback
Pandas: [ListType(list<item: list<item: int64>>)]
Non-compliant Parquet: [ListType(list<item: list<item: int64>>)]
Compliant Parquet: [ListType(list<element: list<element: int64>>)]
Traceback (most recent call last):
  File "/Users/judahrand/test_dir/pandas_parquet.py", line 25, in <module>
    assert pandas_parquet_table.schema.types == parquet_table.schema.types
AssertionError

Issue Description

This method currently does not write adherent Parquet Logical Types for nested arrays as defined here. This can cause problems when trying to Parquet as in intermediate format, for example loading data into BigQuery which expects adherent data.

This was an issue in PyArrow itself, however, it was fixed in ARROW-11497. I believe that this flag should be set in Pandas if we are to claim that Pandas .to_parquet() method actually outputs Parquet.

Expected Behavior

Output complaint Parquet.

Is this bug fixed? I am using pandas 2.2.2, I found that the save datatype is not the same when read from the disk. eg.

df=pd.DataFrame()
df['A'] = np.array([100, 200], dtype=np.int64)
df['B'] = np.array([22.0, 22.1])
df.to_parquet("save_file.parquet")

Then read out:

data = pq.read_table("save_file.parquet").to_pandas()
data.loc[0]["A"]   
>>> 
array([100.0, 200.0])

I saved in the int64 items but readout double float items.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants