Skip to content

Columns lose category dtype after calling replace on the dataframe #23305

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
somiandras opened this issue Oct 23, 2018 · 5 comments · Fixed by #35234
Closed

Columns lose category dtype after calling replace on the dataframe #23305

somiandras opened this issue Oct 23, 2018 · 5 comments · Fixed by #35234
Labels
Categorical Categorical Data Type good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@somiandras
Copy link

somiandras commented Oct 23, 2018

xref #25521

Code Sample, a copy-pastable example if possible

>>> import pandas as pd
>>> pd.__version__
'0.23.4'

# Sample dataframe with two categorical and one int column
>>> df = pd.DataFrame(
...     [[1., 'A', 'x'], [2, 'B', 'y'], [3, 'C', 'z']],
...     columns=['first', 'second', 'third']
... ).astype({'second': 'category', 'third': 'category'})

# Replace int values
>>> df = df.replace(1, 10)

# Both categorical columns turned into object...
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
first     3 non-null float64
second    3 non-null object
third     3 non-null object
dtypes: float64(1), object(2)
memory usage: 152.0+ bytes

# Sample dataframe with two categorical columns
>>> df = pd.DataFrame(
...     [['A', 'x'], ['B', 'y'], ['C', 'z']],
...     columns=['first', 'second',]
... ).astype({'first': 'category', 'second': 'category'})

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
first     3 non-null category
second    3 non-null category
dtypes: category(2)
memory usage: 294.0 bytes

# Replace values in column `first`
>>> df = df.replace('A', 'B')

# Dtype of column `second` becomes object
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
first     3 non-null category
second    3 non-null object
dtypes: category(1), object(1)
memory usage: 211.0+ bytes

Problem description

Calling replace() on a dataframe changes category columns' dtype to object in an apparently inconsisent manner:

  1. When replace() contains non-categorical values that are not present in either categorical column, all categorical dtype turn into object.
  2. When replace() contains categories from a category dtype column, then that column keeps its dtype, but other categorical columns turn into object.

See the examples above for both. Even if it is somehow intentional I find it quite confusing.

Expected Output

I would expect the categorical columns to keep their category dtype after replace is called on the dataframe (at least for those categorical columns that are unaffected by replace).

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None

pandas: 0.23.4
pytest: None
pip: 18.0
setuptools: 39.0.1
Cython: None
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented Oct 24, 2018

Hmm this does look strange. If called with a dict this doesn't seem to exhibit the same behavior:

In [14]: df.replace(1, 10).dtypes                                               
Out[14]: 
first     float64
second     object
third      object
dtype: object

In [15]: df.replace({1: 10}).dtypes                                             
Out[15]: 
first      float64
second    category
third     category
dtype: object

Investigation and PRs certainly welcome

@WillAyd WillAyd added Bug Categorical Categorical Data Type labels Oct 24, 2018
@WillAyd WillAyd added this to the Contributions Welcome milestone Oct 24, 2018
@ghost
Copy link

ghost commented Oct 25, 2018

import pandas as pd 

>>> pd.__version__
'0.23.4'
>>> df = pd.DataFrame(
    [[1., 10, True, 'x'],
    [2, 20, False, 'y']],
    columns=['float', 'int', 'bool', 'str'])
>>> df.dtypes
float    float64 
int        int64 
bool        bool 
str       object 
dtype: object    
>>> df2 = df.astype({
    'float': 'category',
    'int': 'category',
    'bool': 'category'
    'str': 'category'
    }) 
>>> df2.dtypes # all columns' type is category
float    category
int      category
bool     category
str      category
dtype: object
>>> df2_replace_float_column = df2.replace(1, 10)
>>> df2_replace_float_column.dtypes # just int no change...
float     float64
int      category
bool       object
str        object
dtype: object
>>> df2_replace_float_column_new = df2.replace({1:10})
>>> df2_replace_float_column_new.dtypes # bool changed
float       int64   
int      category   
bool        int64   
str      category   
dtype: object       

So using dict may not be the best way.
A simple way is :

>> df2_new = df.replace({1:10}).astype({
    'float': 'category',
    'int': 'category',
    'bool': 'category'
    'str': 'category'
    }) 
>>> df2_new.dtypes
float    category
int      category
bool     category
str      category
dtype: object

replace() created a new DataFrame object:

>>>  id(df2) == id(df2.replace({1:10}))
 False

@fuglede
Copy link

fuglede commented Dec 27, 2019

When something happens here, you may want to add the details to this related StackOverflow question.

@mroeschke
Copy link
Member

Looks like the categorical dtypes are kept now. Could use a test

In [103]: >>> df = pd.DataFrame(
     ...: ...     [[1., 'A', 'x'], [2, 'B', 'y'], [3, 'C', 'z']],
     ...: ...     columns=['first', 'second', 'third']
     ...: ... ).astype({'second': 'category', 'third': 'category'})
     ...:
     ...: # Replace int values
     ...: >>> df = df.replace(1, 10)
     ...:
     ...: # Both categorical columns turned into object...
     ...: >>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   first   3 non-null      float64
 1   second  3 non-null      category
 2   third   3 non-null      category
dtypes: category(2), float64(1)
memory usage: 366.0 bytes

In [104]: >>> df = pd.DataFrame(
     ...: ...     [['A', 'x'], ['B', 'y'], ['C', 'z']],
     ...: ...     columns=['first', 'second',]
     ...: ... ).astype({'first': 'category', 'second': 'category'})
     ...:
     ...: >>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   first   3 non-null      category
 1   second  3 non-null      category
dtypes: category(2)
memory usage: 342.0 bytes

In [105]: >>> df = df.replace('A', 'B')
     ...:
     ...: # Dtype of column `second` becomes object
     ...: >>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   first   3 non-null      category
 1   second  3 non-null      category
dtypes: category(2)
memory usage: 334.0 bytes

In [106]: pd.__version__
Out[106]: '1.1.0.dev0+1974.g0159cba6e'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Categorical Categorical Data Type labels Jun 28, 2020
mathurk1 pushed a commit to mathurk1/pandas that referenced this issue Jul 11, 2020
mathurk1 pushed a commit to mathurk1/pandas that referenced this issue Jul 11, 2020
@mathurk1
Copy link
Contributor

It looks like the issue is only partially fixed. The dtypes get changed when we use a dictionary to replace values.

>>> import pandas as pd
>>> pd.__version__
'1.1.0.dev0+2073.g280efbfcc'

>>> input_dict = {"col1": [1, 2, 3, 4], "col2": ["a", "b", "c", "d"],"col3": [1.5, 2.5, 3.5, 4.5],"col4": ["cat1", "cat2", "cat3", "cat4"],"col5": ["obj1", "obj2", "obj3", "obj4"],}
>>> input_df = pd.DataFrame(data=input_dict).astype({"col2": "category", "col4": "category"})
>>> input_df["col2"] = input_df["col2"].cat.reorder_categories(["a", "b", "c", "d"], ordered=True)
>>> input_df["col4"] = input_df["col4"].cat.reorder_categories(["cat1", "cat2", "cat3", "cat4"], ordered=True)

>>> input_df
   col1 col2  col3  col4  col5
0     1    a   1.5  cat1  obj1
1     2    b   2.5  cat2  obj2
2     3    c   3.5  cat3  obj3
3     4    d   4.5  cat4  obj4
>>> input_df.dtypes
col1       int64
col2    category
col3     float64
col4    category
col5      object
dtype: object

>>> input_df = input_df.replace({"d": "z", "obj1": "obj9", "cat2": "catX"})
>>> input_df
   col1 col2  col3  col4  col5
0     1    a   1.5  cat1  obj9
1     2    b   2.5  catX  obj2
2     3    c   3.5  cat3  obj3
3     4    z   4.5  cat4  obj4
>>> input_df.dtypes
col1      int64
col2     object
col3    float64
col4     object
col5     object
dtype: object

mathurk1 pushed a commit to mathurk1/pandas that referenced this issue Jul 11, 2020
mathurk1 pushed a commit to mathurk1/pandas that referenced this issue Jul 11, 2020
mathurk1 pushed a commit to mathurk1/pandas that referenced this issue Jul 11, 2020
@jreback jreback added the Categorical Categorical Data Type label Jul 13, 2020
@jreback jreback modified the milestones: Contributions Welcome, 1.1 Jul 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants