-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Partial file write to disk on calling to_parquet() with engine='pyarrow' with unsupported dtype #44914
Comments
…s of io.BufferedWriter type.
This pull request closes the issue where we see a partial file flush to disk through pandas when passing a dataframe to to_parquet function without partition columns. This issue was not replicable on pyarrow write_table function directly as the path variable passed was of string type, whereases when we pass a string like path through to_parquet it is converted to a io.BufferWriter type object before being passed to pyarrow.parquet.write_table function. This pull request adds code to convert the path object to string type if it of io.BufferWriter type. The issue still persists in the case when partition columns are not none leading to creation of an empty/partially populated folder incase of an error. The case of partition columns not being none was replicable on PyArrow and hence I have raised a bug report on PyArrow JIRA: |
… for typecasting to string in path object for to_parquet
…s of io.BufferWriter (pandas-dev#45480)
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
Issue Description
Calling to_parquet() on a dataframe with unsupported write on pyarrow results in partial file dump to disk.
Ref:
Expected Behavior
Raise ValueError before the partial write.
Installed Versions
python : 3.8.12.final.0
python-bits : 64
OS : Linux
OS-release : 5.11.0-41-generic
Version : #45~20.04.1-Ubuntu SMP Wed Nov 10 10:20:10 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.4.0.dev0+1356.gb7991da361.dirty
numpy : 1.21.4
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 59.4.0
Cython : 0.29.25
pytest : 6.2.5
hypothesis : 6.31.3
sphinx : 4.3.1
blosc : None
feather : None
xlsxwriter : 3.0.2
lxml.etree : 4.6.4
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 7.30.1
pandas_datareader: None
bs4 : 4.10.0
bottleneck : 1.3.2
fsspec : 2021.11.0
fastparquet : 0.7.2
gcsfs : 2021.11.0
matplotlib : 3.5.0
numexpr : 2.8.0
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 6.0.1
pyxlsb : None
s3fs : 2021.11.0
scipy : 1.7.3
sqlalchemy : 1.4.28
tables : 3.6.1
tabulate : 0.8.9
xarray : 0.18.2
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1
The text was updated successfully, but these errors were encountered: