-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: DataFrame().to_parquet()
does not write Parquet compliant data for nested arrays
#43689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I've opened an initial Pull Request to address this but it will require moving the testing version of PyArrow to |
going to be addressed in pyarrow |
Is this bug fixed? I am using pandas 2.2.2, I found that the save datatype is not the same when read from the disk. eg.
Then read out:
I saved in the int64 items but readout double float items. |
Reproducible Example
Issue Description
This method currently does not write adherent Parquet Logical Types for nested arrays as defined here. This can cause problems when trying to Parquet as in intermediate format, for example loading data into BigQuery which expects adherent data.
This was an issue in PyArrow itself, however, it was fixed in ARROW-11497. I believe that this flag should be set in Pandas if we are to claim that Pandas
.to_parquet()
method actually outputs Parquet.Expected Behavior
Output complaint Parquet.
The text was updated successfully, but these errors were encountered: