-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: read_parquet gives TypeError if dtype is a pyarrow list #57411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report. Can you give this issue a meaningful title? Currently it is |
Thanks! Title updated. |
Thanks for the report - further investigations and PRs to fix are welcome! |
take |
I have done some digging into the problem. By inspecting the stack trace: Traceback (most recent call last):
File "/Users/kinianlo/github/pandas/scripts/debug.py", line 19, in <module>
pd.read_parquet('ex.parquet', engine='auto')
File "/Users/kinianlo/.pyenv/versions/3.10.13/lib/python3.10/site-packages/pandas/io/parquet.py", line 670, in read_parquet
return impl.read(
File "/Users/kinianlo/.pyenv/versions/3.10.13/lib/python3.10/site-packages/pandas/io/parquet.py", line 279, in read
result = pa_table.to_pandas(**to_pandas_kwargs)
File "pyarrow/array.pxi", line 867, in pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/table.pxi", line 4085, in pyarrow.lib.Table._to_pandas
File "/Users/kinianlo/.pyenv/versions/3.10.13/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 764, in table_to_blockmanager
ext_columns_dtypes = _get_extension_dtypes(
File "/Users/kinianlo/.pyenv/versions/3.10.13/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 817, in _get_extension_dtypes
pandas_dtype = _pandas_api.pandas_dtype(dtype)
File "pyarrow/pandas-shim.pxi", line 140, in pyarrow.lib._PandasAPIShim.pandas_dtype
File "pyarrow/pandas-shim.pxi", line 143, in pyarrow.lib._PandasAPIShim.pandas_dtype
File "/Users/kinianlo/.pyenv/versions/3.10.13/lib/python3.10/site-packages/pandas/core/dtypes/common.py", line 1636, in pandas_dtype
npdtype = np.dtype(dtype)
TypeError: data type 'list<item: int64>[pyarrow]' not understood it appears that the problem comes from the Example of metadata:The metadata that I was referring to can be obtained through a pyarrow table, e.g.
{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 2, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "col", "field_name": "col", "pandas_type": "list[int64]", "numpy_type": "list<item: int64>[pyarrow]", "metadata": null}], "creator": {"library": "pyarrow", "version": "11.0.0"}, "pandas_version": "2.1.4"}'}
{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 2, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "col", "field_name": "col", "pandas_type": "list[int64]", "numpy_type": "object", "metadata": null}], "creator": {"library": "pyarrow", "version": "11.0.0"}, "pandas_version": "2.1.4"}'} temporary fixUse potential long term fixEnsure that |
Hello everyone!
What do you think about this solution? |
After some further digging, I propose the following two potential solutions:
|
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
The example code produces
TypeError: data type 'list<item: int64>[pyarrow]' not understood
Expected Behavior
No error should occur and
pd.read_parquet('ex.parquet')
should gives a data frame identical todf
Installed Versions
INSTALLED VERSIONS
commit : fd3f571
python : 3.10.12.final.0
python-bits : 64
OS : Linux
OS-release : 6.1.58+
Version : #1 SMP PREEMPT_DYNAMIC Sat Nov 18 15:31:17 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.2.0
numpy : 1.25.2
pytz : 2023.4
dateutil : 2.8.2
setuptools : 67.7.2
pip : 23.1.2
Cython : 3.0.8
pytest : 7.4.4
hypothesis : None
sphinx : 5.0.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.4
html5lib : 1.1
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.3
IPython : 7.34.0
pandas_datareader : 0.10.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2023.6.0
gcsfs : 2023.6.0
matplotlib : 3.7.1
numba : 0.58.1
numexpr : 2.9.0
odfpy : None
openpyxl : 3.1.2
pandas_gbq : 0.19.2
pyarrow : 10.0.1
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.11.4
sqlalchemy : 2.0.25
tables : 3.8.0
tabulate : 0.9.0
xarray : 2023.7.0
xlrd : 2.0.1
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: