Skip to content

BUG: Category based on int do not survive a round/trip in parquet #39480

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
yohplala opened this issue Jan 30, 2021 · 1 comment
Closed
2 of 3 tasks

BUG: Category based on int do not survive a round/trip in parquet #39480

yohplala opened this issue Jan 30, 2021 · 1 comment
Labels
Bug IO Parquet parquet, feather Upstream issue Issue related to pandas dependency

Comments

@yohplala
Copy link

  • I have checked that this issue has not already been reported. (not reported in pandas repo as I could see, but has been incidentally in pyarrow repo)

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
file = '/home/whatever_test_directory/test.parquet'

# Does not work: category based on int
df_test = pd.DataFrame({'a':[1,2,3]}).astype({'a':'category'})
print(df_test['a'])        # check category has been correctly created

Name: a, dtype: category
Categories (3, int64): [1, 2, 3]

df_test.to_parquet(file)
df_read = pd.read_parquet(file)
print(df_read['a'])        # no more a catgory

Name: a, dtype: int64

# Does work: category based on string
df_test = pd.DataFrame({'a':['1','2','3']}).astype({'a':'category'})
print(df_test['a'])        # check category has been correctly created

Name: a, dtype: category
Categories (2, object): ['1', '2', '3']

df_test.to_parquet(file)
df_read = pd.read_parquet(file)
print(df_read['a'])        # still a catgory!

Name: a, dtype: category
Categories (2, object): ['1', '2', '3']

Problem description

Even when based on int, category should survive a round/trip in parquet.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 9d598a5
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.0-41-generic
Version : #46~20.04.1-Ubuntu SMP Mon Jan 18 17:52:23 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : fr_FR.UTF-8
LOCALE : fr_FR.UTF-8

pandas : 1.2.1
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 50.3.1.post20201107
Cython : 0.29.21
pytest : 6.1.1
hypothesis : None
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.8.3
fastparquet : 0.5.0
gcsfs : None
matplotlib : 3.3.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : 2.0.0
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : 1.3.20
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.51.2

@yohplala yohplala added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 30, 2021
@jorisvandenbossche jorisvandenbossche added IO Parquet parquet, feather and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 2, 2021
@jorisvandenbossche
Copy link
Member

@yohplala thanks for the report. This is a known issue with the pyarrow parquet support. See a recent issue about this on the pyarrow side: https://issues.apache.org/jira/browse/ARROW-11157

I am going to close the issue here, since this is something that needs to be handled/fixed in pyarrow, but input on the aforementioned arrow issue is certainly welcome.

@jorisvandenbossche jorisvandenbossche added the Upstream issue Issue related to pandas dependency label Feb 2, 2021
@jorisvandenbossche jorisvandenbossche added this to the No action milestone Feb 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Parquet parquet, feather Upstream issue Issue related to pandas dependency
Projects
None yet
Development

No branches or pull requests

2 participants