Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Error writing DataFrame with categorical type column and "Int" data to a CSV file ("int" works of course) #46812

Closed
3 tasks done
eason9 opened this issue Apr 20, 2022 · 4 comments · Fixed by #47347
Closed
3 tasks done
Labels
Bug Categorical Categorical Data Type IO CSV read_csv, to_csv NA - MaskedArrays Related to pd.NA and nullable extension arrays Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@eason9
Copy link

eason9 commented Apr 20, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

d = {'name':["bob", "todd", "sarah", "john"],
        'gp': [1, 2, np.NaN, 2],
        'score':[90, 40, 80, 98]}
df = pd.DataFrame(d)
df.name = df.name.astype("category")
df.gp = df.gp.astype("Int16")
df.gp = df.gp.astype("category")

print('-pandas version: ', pd.__version__)
print('-df dtypes:\n', df.dtypes)
print('-df.gp:\n', df.gp)
df.to_csv("test.csv")

Issue Description

This Bug is similar to: #46297. Executing the example above produces the following error in pandas 1.4.2 (this code example works fine on older pandas versions, e.g. 1.2.4). The problem occurs when saving a dataframe to a .csv when the categorical type is set over a nan supported "Int" dtype (note not a default "int" dtype).

-pandas version:  1.4.2
-df dtypes:
 name     category
gp       category
score       int64
dtype: object
-df.gp:
 0      1
1      2
2    NaN
3      2
Name: gp, dtype: category
Categories (2, Int16): [1, 2]
Traceback (most recent call last):

  File "<ipython-input-28-995d10a4922f>", line 15, in <module>
    df.to_csv("test.csv")

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\core\generic.py", line 3551, in to_csv
    return DataFrameRenderer(formatter).to_csv(

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\io\formats\format.py", line 1180, in to_csv
    csv_formatter.save()

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\io\formats\csvs.py", line 261, in save
    self._save()

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\io\formats\csvs.py", line 266, in _save
    self._save_body()

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\io\formats\csvs.py", line 304, in _save_body
    self._save_chunk(start_i, end_i)

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\io\formats\csvs.py", line 311, in _save_chunk
    res = df._mgr.to_native_types(**self._number_format)

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\core\internals\managers.py", line 473, in to_native_types
    return self.apply("to_native_types", **kwargs)

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\core\internals\managers.py", line 304, in apply
    applied = getattr(b, f)(**kwargs)

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\core\internals\blocks.py", line 634, in to_native_types
    result = to_native_types(self.values, na_rep=na_rep, quoting=quoting, **kwargs)

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\core\internals\blocks.py", line 2163, in to_native_types
    values = take_nd(

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\core\array_algos\take.py", line 114, in take_nd
    return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\core\arrays\masked.py", line 653, in take
    return type(self)(result, mask, copy=False)

  File "C:\Users\Sade\anaconda3\envs\asdf\lib\site-packages\pandas\core\arrays\integer.py", line 315, in __init__
    raise TypeError(

TypeError: values should be integer numpy array. Use the 'pd.array' function instead

Expected Behavior

I'd expect the above example to write a dataframe to a .csv as in older pandas versions instead of ending in an error traceback.

Installed Versions

INSTALLED VERSIONS

commit : 4bfe3d0
python : 3.8.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : AMD64 Family 23 Model 8 Stepping 2, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : en
LOCALE : English_United States.1252

pandas : 1.4.2
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 50.3.1.post20201107
Cython : 0.29.21
pytest : 6.1.1
hypothesis : None
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
brotli :
fastparquet : None
fsspec : 0.8.3
gcsfs : None
markupsafe : 1.1.1
matplotlib : 3.3.2
numba : 0.51.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
snappy : None
sqlalchemy : 1.3.20
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
zstandard : None

@eason9 eason9 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 20, 2022
@phofl
Copy link
Member

phofl commented Apr 22, 2022

Hi, thanks for your report. As a note: This also fails for Int64 not only Int16

@phofl phofl added IO CSV read_csv, to_csv NA - MaskedArrays Related to pd.NA and nullable extension arrays and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 22, 2022
@eason9 eason9 changed the title BUG: Error writing DataFrame with categorical type column and "Int16" data to a CSV file ("int16" works of course) BUG: Error writing DataFrame with categorical type column and "Int" data to a CSV file ("int" works of course) Apr 26, 2022
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue May 9, 2022
@simonjayhawkins simonjayhawkins added the Categorical Categorical Data Type label May 9, 2022
@simonjayhawkins
Copy link
Member

Thanks @eason9 for the report.

This Bug is similar to: #46297.

Indeed. but issue appears to arise from a different code path and the regression is from a different commit so will keep both issues open.

Executing the example above produces the following error in pandas 1.4.2 (this code example works fine on older pandas versions, e.g. 1.2.4).

first bad commit: [e750c94] ENH: allow storing ExtensionArrays in Index (#43930)

cc @jbrockmendel

@simonjayhawkins simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label May 9, 2022
@simonjayhawkins simonjayhawkins added this to the 1.4.3 milestone May 9, 2022
@jbrockmendel
Copy link
Member

First clear place where things are going wrong is in take_nd, where we have an IntegerArray arr and a string fill_value. For ndarray cases this goes through maybe_promote, ensuring that we don't raise. For EAs it does not, leading to the exception we see. Changing this is a way to avoid the failure here, not yet confident it is the right way to do so.

I think there was a recent-ish PR that started special-casing Categorical inside blocks.to_native_types. Looks like changing the condition there to if isinstance(values, Categorical) and values.categories.dtype.kind in "Mm" solves this (haven't run the test suite with this yet)

@phofl
Copy link
Member

phofl commented Jun 14, 2022

Thx, that was exactly what I wanted to accomplish with the special casing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type IO CSV read_csv, to_csv NA - MaskedArrays Related to pd.NA and nullable extension arrays Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants