You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have checked that this issue has not already been reported. (not reported in pandas repo as I could see, but has been incidentally in pyarrow repo)
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
importpandasaspdfile='/home/whatever_test_directory/test.parquet'# Does not work: category based on intdf_test=pd.DataFrame({'a':[1,2,3]}).astype({'a':'category'})
print(df_test['a']) # check category has been correctly created
Name: a, dtype: category
Categories (3, int64): [1, 2, 3]
df_test.to_parquet(file)
df_read=pd.read_parquet(file)
print(df_read['a']) # no more a catgory
Name: a, dtype: int64
# Does work: category based on stringdf_test=pd.DataFrame({'a':['1','2','3']}).astype({'a':'category'})
print(df_test['a']) # check category has been correctly created
Name: a, dtype: category
Categories (2, object): ['1', '2', '3']
df_test.to_parquet(file)
df_read=pd.read_parquet(file)
print(df_read['a']) # still a catgory!
Name: a, dtype: category
Categories (2, object): ['1', '2', '3']
Problem description
Even when based on int, category should survive a round/trip in parquet.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : 9d598a5
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.0-41-generic
Version : #46~20.04.1-Ubuntu SMP Mon Jan 18 17:52:23 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : fr_FR.UTF-8
LOCALE : fr_FR.UTF-8
I am going to close the issue here, since this is something that needs to be handled/fixed in pyarrow, but input on the aforementioned arrow issue is certainly welcome.
I have checked that this issue has not already been reported. (not reported in pandas repo as I could see, but has been incidentally in pyarrow repo)
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
Name: a, dtype: category
Categories (3, int64): [1, 2, 3]
Name: a, dtype: int64
Name: a, dtype: category
Categories (2, object): ['1', '2', '3']
Name: a, dtype: category
Categories (2, object): ['1', '2', '3']
Problem description
Even when based on int, category should survive a round/trip in parquet.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : 9d598a5
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.0-41-generic
Version : #46~20.04.1-Ubuntu SMP Mon Jan 18 17:52:23 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : fr_FR.UTF-8
LOCALE : fr_FR.UTF-8
pandas : 1.2.1
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 50.3.1.post20201107
Cython : 0.29.21
pytest : 6.1.1
hypothesis : None
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.8.3
fastparquet : 0.5.0
gcsfs : None
matplotlib : 3.3.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : 2.0.0
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : 1.3.20
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.51.2
The text was updated successfully, but these errors were encountered: