Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: dtype not being preserved for replace on a CategoricalDtype #46672

Closed
2 of 3 tasks
galipremsagar opened this issue Apr 6, 2022 · 5 comments · Fixed by #48168
Closed
2 of 3 tasks

BUG: dtype not being preserved for replace on a CategoricalDtype #46672

galipremsagar opened this issue Apr 6, 2022 · 5 comments · Fixed by #48168
Labels
Bug Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions Regression Functionality that used to work in a prior pandas version replace replace method
Milestone

Comments

@galipremsagar
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

>>> import pandas as pd
>>> pd.__version__
'1.4.2'
>>> pdf = pd.DataFrame(
...             {
...                 "a": ["one", "two", None, "three"],
...                 "b": ["one", None, "two", "three"],
...             },
...             dtype="category",
...         )
>>> pdf.dtypes
a    category
b    category
dtype: object
>>> pdf
       a      b
0    one    one
1    two    NaN
2    NaN    two
3  three  three
>>> new_df = pdf.replace(to_replace=[".", "def"], value=["_", None])
>>> new_df
       a      b
0    one    one
1    two    NaN
2    NaN    two
3  three  three
>>> new_df.dtypes
a    object
b    object
dtype: object

Issue Description

When the series is of category dtype, calling replace with above inputs is not preserving the dtype.

Expected Behavior

>>> import pandas as pd
>>> pd.__version__
'1.3.5'
>>> pdf = pd.DataFrame(
...             {
...                 "a": ["one", "two", None, "three"],
...                 "b": ["one", None, "two", "three"],
...             },
...             dtype="category",
...         )
>>> new_df = pdf.replace(to_replace=[".", "def"], value=["_", None])
>>> new_df
       a      b
0    one    one
1    two    NaN
2    NaN    two
3  three  three
>>> new_df.dtypes
a    category
b    category
dtype: object

Installed Versions

pd.show_versions()
Traceback (most recent call last):
File "", line 1, in
File "/nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/util/_print_versions.py", line 109, in show_versions
deps = _get_dependency_info()
File "/nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/util/_print_versions.py", line 88, in _get_dependency_info
mod = import_optional_dependency(modname, errors="ignore")
File "/nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/compat/_optional.py", line 138, in import_optional_dependency
module = importlib.import_module(name)
File "/nvme/0/pgali/envs/cudfdev/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1014, in _gcd_import
File "", line 991, in _find_and_load
File "", line 975, in _find_and_load_unlocked
File "", line 671, in _load_unlocked
File "", line 843, in exec_module
File "", line 219, in _call_with_frames_removed
File "/nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/setuptools/init.py", line 8, in
import _distutils_hack.override # noqa: F401
File "/nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/_distutils_hack/override.py", line 1, in
import('_distutils_hack').do_override()
File "/nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/_distutils_hack/init.py", line 72, in do_override
ensure_local_distutils()
File "/nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/_distutils_hack/init.py", line 59, in ensure_local_distutils
assert '_distutils' in core.file, core.file
AssertionError: /nvme/0/pgali/envs/cudfdev/lib/python3.8/distutils/core.py

@galipremsagar galipremsagar added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 6, 2022
@rhshadrach
Copy link
Member

Thanks for the report @galipremsagar! Within the BlockManager, there is special logic to handle None

else:
if value is None:
# gh-45601, gh-45836
nb = self.astype(np.dtype(object), copy=False)
if nb is self and not inplace:
nb = nb.copy()
putmask_inplace(nb.values, mask, value)
return [nb]
return self.replace(
to_replace=to_replace, value=value, inplace=inplace, mask=mask
)

but I'm not sure it was ever meant to handle Categorical blocks. If the code were to take the return self.replace(...) path, then the result is indeed Categorical. Further investigations and PRs to fix are welcome!

@rhshadrach rhshadrach added Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions replace replace method and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 11, 2022
@rhshadrach rhshadrach added this to the Contributions Welcome milestone Apr 11, 2022
@rhshadrach rhshadrach added the Regression Functionality that used to work in a prior pandas version label Apr 11, 2022
@rhshadrach rhshadrach modified the milestones: Contributions Welcome, 1.4.3 Apr 11, 2022
@simonjayhawkins
Copy link
Member

Thanks @galipremsagar for the report, I noticed this while fixing #46634

The cause of this regression is #46404

from #46636 (comment)

my biggest concern (need to check this though) is categorical since the special casing was also moved in #44940.

We need a fix that preserves all the bug fixes in #44940 that were included in 1.4 and also restores the 1.3.5 behavior when None is the replacement value #45601 and #45836 while at the same time making these None cases consistent with other replacement values such as copying and splitting blocks.

This work is ongoing and being discussed in #46636, #46443 and #46393

The underlying issue is that the bug fixes introduced in 1.4 (#44940) were achieved by dispatching to Block.replace which appears handle None values differently for single replacement than is the existing behavior for None in a list like.

Ideally we would fix this on main in Block.replace, but it is my opinion that this may be difficult to backport (due to other changes that we would not normally backport for a patch release). However, if we continue with small patches like #46634 we would also need to revert the special case handling of Categorical to pre #44940 and would still not solve all the issue around copying and splitting.

There is also an extra issue on main, #45725 which needs to be fixed before we can dispatch to Block.replace

but I'm not sure it was ever meant to handle Categorical blocks.

this was done elsewhere.

@simonjayhawkins
Copy link
Member

simonjayhawkins commented Apr 12, 2022

Looking some more, this code sample is actually a duplicate of #46634 since the replacement is a no-op?

The regression for categorical comes when there is a replacement performed.

in 1.3.5

>>> new_df = pdf.replace(to_replace=[".", "one"], value=["_", None])
>>> new_df
       a      b
0   None   None
1    two    NaN
2    NaN    two
3  three  three
>>> 
>>> new_df.dtypes
a    object
b    object
dtype: object
>>> 
>>> new_df["a"]
0     None
1      two
2      NaN
3    three
Name: a, dtype: object
>>> 
>>> new_df.dtypes.a
dtype('O')
>>> 
>>> pdf.dtypes.a
CategoricalDtype(categories=['one', 'three', 'two'], ordered=False)
>>> 
>>> 

in 1.4.0

>>> new_df = pdf.replace(to_replace=[".", "one"], value=["_", None])
>>> new_df
       a      b
0    NaN    NaN
1    two    NaN
2    NaN    two
3  three  three
>>> 
>>> new_df.dtypes
a    category
b    category
dtype: object
>>> 
>>> new_df["a"]
0      NaN
1      two
2      NaN
3    three
Name: a, dtype: category
Categories (2, object): ['three', 'two']
>>> 
>>> new_df.dtypes.a
CategoricalDtype(categories=['three', 'two'], ordered=False)
>>> 
>>> pdf.dtypes.a
CategoricalDtype(categories=['one', 'three', 'two'], ordered=False)
>>> 

in 1.4.2

>>> new_df = pdf.replace(to_replace=[".", "one"], value=["_", None])
>>> new_df
       a      b
0   None   None
1    two    NaN
2    NaN    two
3  three  three
>>> 
>>> new_df.dtypes
a    object
b    object
dtype: object
>>> 
>>> new_df["a"]
0     None
1      two
2      NaN
3    three
Name: a, dtype: object
>>> 
>>> new_df.dtypes.a
dtype('O')
>>> 
>>> pdf.dtypes.a
CategoricalDtype(categories=['one', 'three', 'two'], ordered=False)
>>> 

So in 1.4.2 we have reverted to 1.3.5 behavior and convert to object when replacing a value.

In 1.4.0 we retain the categorical dtype, but with a different CategoricalDtype (and convert the None to a missing value). I'm assuming, but not convinced, this is now the intended behavior. @jbrockmendel ?

@jbrockmendel
Copy link
Member

In 1.4.0 we retain the categorical dtype, but with a different CategoricalDtype (and convert the None to a missing value). I'm assuming, but not convinced, this is now the intended behavior. @jbrockmendel ?

Haven't looked super-closely, but I think the if value is None check @rhshadrach linked to above was written
a) with non-Cateogorical in mind, so his suggestion of not going through that path for Categorical would make sense (though it would further complicate an already messy method)
b) specifically for cases where non-None NA values are being replaced by None, which does not appear to be the case here.

@simonjayhawkins
Copy link
Member

moving to 1.4.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions Regression Functionality that used to work in a prior pandas version replace replace method
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants