Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Partial file write to disk on calling to_parquet() with engine='pyarrow' with unsupported dtype #44914

Closed
3 tasks done
Anirudhsekar96 opened this issue Dec 15, 2021 · 1 comment · Fixed by #45480
Closed
3 tasks done
Labels
Bug IO Parquet parquet, feather
Milestone

Comments

@Anirudhsekar96
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

data = np.arange(2, 10, dtype=np.float16)
df = pd.DataFrame(data=data, columns=['fp16'])
df.to_parquet('./fp16.parquet')

Issue Description

Calling to_parquet() on a dataframe with unsupported write on pyarrow results in partial file dump to disk.

Ref:

Expected Behavior

Raise ValueError before the partial write.

Installed Versions

python : 3.8.12.final.0
python-bits : 64
OS : Linux
OS-release : 5.11.0-41-generic
Version : #45~20.04.1-Ubuntu SMP Wed Nov 10 10:20:10 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.0.dev0+1356.gb7991da361.dirty
numpy : 1.21.4
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 59.4.0
Cython : 0.29.25
pytest : 6.2.5
hypothesis : 6.31.3
sphinx : 4.3.1
blosc : None
feather : None
xlsxwriter : 3.0.2
lxml.etree : 4.6.4
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 7.30.1
pandas_datareader: None
bs4 : 4.10.0
bottleneck : 1.3.2
fsspec : 2021.11.0
fastparquet : 0.7.2
gcsfs : 2021.11.0
matplotlib : 3.5.0
numexpr : 2.8.0
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 6.0.1
pyxlsb : None
s3fs : 2021.11.0
scipy : 1.7.3
sqlalchemy : 1.4.28
tables : 3.6.1
tabulate : 0.8.9
xarray : 0.18.2
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1

@Anirudhsekar96 Anirudhsekar96 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 15, 2021
@rhshadrach rhshadrach added IO Parquet parquet, feather and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 16, 2021
@rhshadrach rhshadrach added this to the Contributions Welcome milestone Dec 16, 2021
@github-actions github-actions bot added the Stale label Jan 16, 2022
@mroeschke mroeschke removed the Stale label Jan 17, 2022
Anirudhsekar96 added a commit to Anirudhsekar96/pandas that referenced this issue Jan 19, 2022
@Anirudhsekar96
Copy link
Contributor Author

This pull request closes the issue where we see a partial file flush to disk through pandas when passing a dataframe to to_parquet function without partition columns. This issue was not replicable on pyarrow write_table function directly as the path variable passed was of string type, whereases when we pass a string like path through to_parquet it is converted to a io.BufferWriter type object before being passed to pyarrow.parquet.write_table function.

This pull request adds code to convert the path object to string type if it of io.BufferWriter type.

The issue still persists in the case when partition columns are not none leading to creation of an empty/partially populated folder incase of an error. The case of partition columns not being none was replicable on PyArrow and hence I have raised a bug report on PyArrow JIRA:
https://issues.apache.org/jira/browse/ARROW-15375

Anirudhsekar96 added a commit to Anirudhsekar96/pandas that referenced this issue Jan 20, 2022
… for typecasting to string in path object for to_parquet
@jreback jreback modified the milestones: Contributions Welcome, 1.5 Jan 22, 2022
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this issue Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Parquet parquet, feather
Projects
None yet
4 participants