-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Columns lose category dtype after calling replace on the dataframe #23305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hmm this does look strange. If called with a dict this doesn't seem to exhibit the same behavior: In [14]: df.replace(1, 10).dtypes
Out[14]:
first float64
second object
third object
dtype: object
In [15]: df.replace({1: 10}).dtypes
Out[15]:
first float64
second category
third category
dtype: object Investigation and PRs certainly welcome |
import pandas as pd
>>> pd.__version__
'0.23.4'
>>> df = pd.DataFrame(
[[1., 10, True, 'x'],
[2, 20, False, 'y']],
columns=['float', 'int', 'bool', 'str'])
>>> df.dtypes
float float64
int int64
bool bool
str object
dtype: object
>>> df2 = df.astype({
'float': 'category',
'int': 'category',
'bool': 'category'
'str': 'category'
})
>>> df2.dtypes # all columns' type is category
float category
int category
bool category
str category
dtype: object
>>> df2_replace_float_column = df2.replace(1, 10)
>>> df2_replace_float_column.dtypes # just int no change...
float float64
int category
bool object
str object
dtype: object
>>> df2_replace_float_column_new = df2.replace({1:10})
>>> df2_replace_float_column_new.dtypes # bool changed
float int64
int category
bool int64
str category
dtype: object So using dict may not be the best way. >> df2_new = df.replace({1:10}).astype({
'float': 'category',
'int': 'category',
'bool': 'category'
'str': 'category'
})
>>> df2_new.dtypes
float category
int category
bool category
str category
dtype: object
>>> id(df2) == id(df2.replace({1:10}))
False |
When something happens here, you may want to add the details to this related StackOverflow question. |
Looks like the categorical dtypes are kept now. Could use a test
|
It looks like the issue is only partially fixed. The dtypes get changed when we use a dictionary to replace values.
|
xref #25521
Code Sample, a copy-pastable example if possible
Problem description
Calling
replace()
on a dataframe changescategory
columns' dtype toobject
in an apparently inconsisent manner:replace()
contains non-categorical values that are not present in either categorical column, all categorical dtype turn intoobject
.replace()
contains categories from acategory
dtype column, then that column keeps its dtype, but other categorical columns turn intoobject
.See the examples above for both. Even if it is somehow intentional I find it quite confusing.
Expected Output
I would expect the categorical columns to keep their
category
dtype afterreplace
is called on the dataframe (at least for those categorical columns that are unaffected byreplace
).Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
pandas: 0.23.4
pytest: None
pip: 18.0
setuptools: 39.0.1
Cython: None
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: