Skip to content

BUG: to_parquet does not accept pathlib.PosixPath if partition_cols are defined #35902

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
vfilimonov opened this issue Aug 26, 2020 · 2 comments · Fixed by #36491
Closed
2 of 3 tasks

BUG: to_parquet does not accept pathlib.PosixPath if partition_cols are defined #35902

vfilimonov opened this issue Aug 26, 2020 · 2 comments · Fixed by #36491
Labels
Bug IO Parquet parquet, feather
Milestone

Comments

@vfilimonov
Copy link
Contributor

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pathlib

df = pd.DataFrame({'A':[1,2,3,4], 'B':'C'})

df.to_parquet('tmp_path1.parquet')  # OK
df.to_parquet(pathlib.Path('tmp_path2.parquet'))  # OK

df.to_parquet('tmp_path3.parquet', partition_cols=['B'])  # OK
df.to_parquet(pathlib.Path('tmp_path4.parquet'), partition_cols=['B'])  # TypeError

Problem description

to_parquet method raises TypeError when using pathlib.Path() as an argument in case when partition_cols argument is not None. If no partition cols are provided, then pathlib.Path() is properly accepted

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-53-cae5a944d982> in <module>
      3 
      4 df.to_parquet('tmp_path3.parquet', partition_cols=['B']) # OK
----> 5 df.to_parquet(pathlib.Path('tmp_path4.parquet'), partition_cols=['B'])  # TypeError

~/miniconda3/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    197                 else:
    198                     kwargs[new_arg_name] = new_arg_value
--> 199             return func(*args, **kwargs)
    200 
    201         return cast(F, wrapper)

~/miniconda3/lib/python3.7/site-packages/pandas/core/frame.py in to_parquet(self, path, engine, compression, index, partition_cols, **kwargs)
   2370             index=index,
   2371             partition_cols=partition_cols,
-> 2372             **kwargs,
   2373         )
   2374 

~/miniconda3/lib/python3.7/site-packages/pandas/io/parquet.py in to_parquet(df, path, engine, compression, index, partition_cols, **kwargs)
    274         index=index,
    275         partition_cols=partition_cols,
--> 276         **kwargs,
    277     )
    278 

~/miniconda3/lib/python3.7/site-packages/pandas/io/parquet.py in write(self, df, path, compression, index, partition_cols, **kwargs)
    117                 compression=compression,
    118                 partition_cols=partition_cols,
--> 119                 **kwargs,
    120             )
    121         else:

~/miniconda3/lib/python3.7/site-packages/pyarrow/parquet.py in write_to_dataset(table, root_path, partition_cols, partition_filename_cb, filesystem, **kwargs)
   1790             subtable = pa.Table.from_pandas(subgroup, schema=subschema,
   1791                                             safe=False)
-> 1792             _mkdir_if_not_exists(fs, '/'.join([root_path, subdir]))
   1793             if partition_filename_cb:
   1794                 outfile = partition_filename_cb(keys)

TypeError: sequence item 0: expected str instance, PosixPath found

Output of pd.show_versions()

INSTALLED VERSIONS

commit : f2ca0a2
python : 3.7.1.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
Version : Darwin Kernel Version 18.7.0: Thu Jun 18 20:50:10 PDT 2020; root:xnu-4903.278.43~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.1
numpy : 1.18.2
pytz : 2019.3
dateutil : 2.8.1
pip : 20.2.2
setuptools : 42.0.1.post20191125
Cython : None
pytest : 5.3.0
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : 1.1
pymysql : 0.9.3
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.13.0
pandas_datareader: 0.9.0
bs4 : 4.6.3
bottleneck : None
fsspec : 0.6.0
fastparquet : 0.3.2
gcsfs : None
matplotlib : 3.3.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : 1.0.1
pytables : None
pyxlsb : None
s3fs : 0.4.2
scipy : 1.5.1
sqlalchemy : 1.3.13
tables : 3.4.4
tabulate : 0.8.7
xarray : 0.15.1
xlrd : 1.1.0
xlwt : None
numba : 0.46.0

@vfilimonov vfilimonov added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 26, 2020
@jorisvandenbossche
Copy link
Member

@vfilimonov thanks for the report!
This is actually an issue with the underlying library being used, pyarrow, so I opened an issue with your description over there: https://issues.apache.org/jira/browse/ARROW-9864

@jorisvandenbossche jorisvandenbossche added IO Parquet parquet, feather and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 26, 2020
@jorisvandenbossche
Copy link
Member

That said, we could also already on our side in pandas convert the pathlib object into a string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants