BUG: CategoricalIndex.get_indexer issue with NaNs (#45361) #45373

Shashank-Shet · 2022-01-14T16:51:30Z

Bug Example:

ci = pd.CategoricalIndex([1, 2, np.nan, 3])
other = pd.Index([2, 3, 4])

res = ci.get_indexer(other)

>>> res
array([1, 3, 2])

ci.get_indexer(other) yields incorrect results when ci contains NaNs, and
other does not. The reason is that, if other contains elements that do not
match any category in ci, they are replaced by NaNs. In such a situation,
if ci also has NaNs, then the corresponding elements in ci are mapped to
an index that is not -1

eg:

ci = pd.CategoricalIndex([1, 2, np.nan, 3])
other = pd.Index([2, 3, 4])
ci.get_indexer(other)

In the implementation of get_indexer, other becomes [2, 3, NaN]
which is mapped to index 2, in ci

closes #xxxx
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

categorical_index_obj.get_indexer(target) yields incorrect results when categorical_index_obj contains NaNs, and target does not. The reason for this is that, if target contains elements which do not match any category in categorical_index_obj, they are replaced by NaNs. In such a situation, if categorical_index_obj also has NaNs, then the corresp elements in target are mapped to an index which is not -1 eg: ci = pd.CategoricalIndex([1, 2, np.nan, 3]) other = pd.Index([2, 3, 4]) ci.get_indexer(other) In the implementation of get_indexer, other becomes [2, 3, NaN] which is mapped to index 2, in ci

categorical_index_obj.get_indexer(target) yields incorrect results when categorical_index_obj contains NaNs, and target does not. The reason for this is that, if target contains elements which do not match any category in categorical_index_obj, they are replaced by NaNs. In such a situation, if categorical_index_obj also has NaNs, then the corresp elements in target are mapped to an index which is not -1 eg: ci = pd.CategoricalIndex([1, 2, np.nan, 3]) other = pd.Index([2, 3, 4]) ci.get_indexer(other) In the implementation of get_indexer, other becomes [2, 3, NaN] which is mapped to index 2, in ci Update: np.isnan(target) was breaking the existing codebase. As a solution, I have enclosed this line in a try-except block

debnathshoham · 2022-01-14T18:13:25Z

Hey, thanks for the PR!
Is this issue reported (if yes, could you please link to the issue), otherwise please consider creating an issue (for tracking, discussion, etc).

Shashank-Shet · 2022-01-14T18:17:21Z

@debnathshoham this is actually my first PR. Could you tell me what needs to be done? I'll look into it right away.

debnathshoham · 2022-01-14T18:35:43Z

So you can find a lot of open issues in the issues tab (around 3.3k). What it seems to me is you have identified an issue, and want to put a fix to that.
First step should be to check if there are any already open issue that someone else has opened, and point this PR to that particular issue (in closs #xxxx). If there is no such issue, please create a new issue with detailed description and reproducible examples.

jbrockmendel · 2022-01-14T19:22:41Z

@Shashank-Shet thanks for taking a look at this.

For any PR fixing a bug, there should always be a test for the fixed behavior. In this case it would go somewhere like tests/indexes/categorical/tests_indexing.py

It looks like this broke some existing tests. That suggests that either the existing tests have a problem, or this fix isn't quite right.

If this is your first PR, I suggest looking for issues with the "Good first issue" label.

Shashank-Shet · 2022-01-15T13:04:34Z

@debnathshoham will do. I'll work on identifying whether this is a new issue and reporting if necessary.
@jbrockmendel yes, the fix has some flaws. I'll look into them as well. Worst case, I can at least let another person know what I've dug up.

Update: Replaced `np.isnan` with `pandas.core.dtypes.missing.isna`

Shashank-Shet · 2022-01-21T09:17:00Z

@jbrockmendel I have made a small change to the commit. Now all but one check is passing. Unfortunately, I cannot get much data regarding the testcase itself. Apparently, it seems to be a build error (timeout).

jbrockmendel · 2022-01-21T21:05:20Z

Unfortunately, I cannot get much data regarding the testcase itself. Apparently, it seems to be a build error (timeout).

ill take a look. the timeout is unrelated, is affecting all PRs ATM.

pandas/core/indexes/base.py

jbrockmendel · 2022-01-21T21:06:59Z

Please add test(s) for the behavior you are trying to fix.

jreback

this needs a test and pls see @jbrockmendel comments on location, this is going to cause a large perf hit and so would need careful consideration

Shashank-Shet · 2022-01-23T12:55:16Z

@jbrockmendel I have added a test case (let me know if I need to add more). Also, the checks failing on this build seem to be related to a file I haven't changed (parquet.py). Some test cases which are failing are apparently due to a new test case added called test_unsupported_float16_cleanup. While I am not sure whether these tests are failing because of my commit, is it possible to find out if this is plaguing other PRs as well?

pandas/core/indexes/base.py

jbrockmendel · 2022-01-23T20:31:54Z

it possible to find out if this is plaguing other PRs as well?

it is. You can ignore this failure for now.

pandas/tests/indexes/categorical/test_indexing.py

jbrockmendel · 2022-01-23T20:51:52Z

Long term solution is #37930

pandas/tests/indexes/categorical/test_indexing.py

pandas/core/indexes/base.py

jbrockmendel · 2022-01-26T23:11:21Z

Looks good. Can you add a whatsnew note for 1.5.0

pandas/core/indexes/base.py

Shashank-Shet · 2022-01-27T05:38:32Z

Looks good. Can you add a whatsnew note for 1.5.0

I am not sure what this entails. May I know what I am required to add?

jbrockmendel · 2022-01-27T18:32:14Z

Looks good. Can you add a whatsnew note for 1.5.0

I am not sure what this entails. May I know what I am required to add?

In doc/source/whatsnew/v1.5.0.rst there should be a section for Indexing bugs. Add an entry describing the bug this fixes. (use the existing entries to get an idea for how much to write)

Shashank-Shet · 2022-01-28T04:35:16Z

Looks good. Can you add a whatsnew note for 1.5.0

I am not sure what this entails. May I know what I am required to add?

In doc/source/whatsnew/v1.5.0.rst there should be a section for Indexing bugs. Add an entry describing the bug this fixes. (use the existing entries to get an idea for how much to write)

Cool. I have made the changes. Thanks for the help!

jbrockmendel · 2022-01-28T16:29:32Z

thanks @Shashank-Shet

…andas-dev#45373) * BUG: CategoricalIndex.get_indexer issue with NaNs (pandas-dev#45361) categorical_index_obj.get_indexer(target) yields incorrect results when categorical_index_obj contains NaNs, and target does not. The reason for this is that, if target contains elements which do not match any category in categorical_index_obj, they are replaced by NaNs. In such a situation, if categorical_index_obj also has NaNs, then the corresp elements in target are mapped to an index which is not -1 eg: ci = pd.CategoricalIndex([1, 2, np.nan, 3]) other = pd.Index([2, 3, 4]) ci.get_indexer(other) In the implementation of get_indexer, other becomes [2, 3, NaN] which is mapped to index 2, in ci * BUG: CategoricalIndex.get_indexer issue with NaNs (pandas-dev#45361) categorical_index_obj.get_indexer(target) yields incorrect results when categorical_index_obj contains NaNs, and target does not. The reason for this is that, if target contains elements which do not match any category in categorical_index_obj, they are replaced by NaNs. In such a situation, if categorical_index_obj also has NaNs, then the corresp elements in target are mapped to an index which is not -1 eg: ci = pd.CategoricalIndex([1, 2, np.nan, 3]) other = pd.Index([2, 3, 4]) ci.get_indexer(other) In the implementation of get_indexer, other becomes [2, 3, NaN] which is mapped to index 2, in ci Update: np.isnan(target) was breaking the existing codebase. As a solution, I have enclosed this line in a try-except block * BUG: CategoricalIndex.get_indexer issue with NaNs (pandas-dev#45361) Update: Replaced `np.isnan` with `pandas.core.dtypes.missing.isna` * Added a testcase to verify output behaviour * Made pre-commit changes * Added a test case without NaNs * Moved NaN test to avoid unnecessary execution * Re-aligned test cases * Removed try-except block * Cleaned up base.py * Add GH#45361 comment to code * Added whatsnew entry * Resolved merge conflict * Moved whatsnew entry to indexing section

Shashank-Shet added 2 commits January 14, 2022 22:07

debnathshoham added the Categorical Categorical Data Type label Jan 14, 2022

debnathshoham added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Jan 15, 2022

BUG: CategoricalIndex.get_indexer issue with NaNs (pandas-dev#45361)

6133f48

Update: Replaced `np.isnan` with `pandas.core.dtypes.missing.isna`

jbrockmendel reviewed Jan 21, 2022

View reviewed changes

pandas/core/indexes/base.py Show resolved Hide resolved

jreback requested changes Jan 23, 2022

View reviewed changes

Shashank-Shet added 2 commits January 23, 2022 09:31

Added a testcase to verify output behaviour

0b75419

Made pre-commit changes

7d05154

jreback requested changes Jan 23, 2022

View reviewed changes

pandas/core/indexes/base.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Jan 23, 2022

View reviewed changes

pandas/tests/indexes/categorical/test_indexing.py Show resolved Hide resolved

jbrockmendel reviewed Jan 23, 2022

View reviewed changes

pandas/tests/indexes/categorical/test_indexing.py Outdated Show resolved Hide resolved

Shashank-Shet added 2 commits January 24, 2022 06:50

Added a test case without NaNs

826bb0f

Moved NaN test to avoid unnecessary execution

361d40c

jbrockmendel reviewed Jan 24, 2022

View reviewed changes

pandas/tests/indexes/categorical/test_indexing.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Jan 24, 2022

View reviewed changes

pandas/core/indexes/base.py Outdated Show resolved Hide resolved

Shashank-Shet added 3 commits January 24, 2022 21:55

Re-aligned test cases

b6fcd44

Removed try-except block

6628e53

Cleaned up base.py

d6d7b0f

jbrockmendel reviewed Jan 26, 2022

View reviewed changes

pandas/core/indexes/base.py Show resolved Hide resolved

Shashank-Shet added 4 commits January 28, 2022 09:20

Add GH#45361 comment to code

53f1ca1

Added whatsnew entry

99c9f4e

Resolved merge conflict

61a85d0

Moved whatsnew entry to indexing section

0be77e5

Merge branch 'main' into shashank-dev

6c31b87

jbrockmendel merged commit fcfd19f into pandas-dev:main Jan 28, 2022

Shashank-Shet deleted the shashank-dev branch January 28, 2022 17:17

jbrockmendel mentioned this pull request Feb 9, 2022

BUG: CategoricalIndex.get_indexer with #45361

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: CategoricalIndex.get_indexer issue with NaNs (#45361) #45373

BUG: CategoricalIndex.get_indexer issue with NaNs (#45361) #45373

Shashank-Shet commented Jan 14, 2022 •

edited

Loading

debnathshoham commented Jan 14, 2022

Shashank-Shet commented Jan 14, 2022

debnathshoham commented Jan 14, 2022

jbrockmendel commented Jan 14, 2022

Shashank-Shet commented Jan 15, 2022

Shashank-Shet commented Jan 21, 2022

jbrockmendel commented Jan 21, 2022

jbrockmendel commented Jan 21, 2022

jreback left a comment

Shashank-Shet commented Jan 23, 2022

jbrockmendel commented Jan 23, 2022

jbrockmendel commented Jan 23, 2022

jbrockmendel commented Jan 26, 2022

Shashank-Shet commented Jan 27, 2022

jbrockmendel commented Jan 27, 2022

Shashank-Shet commented Jan 28, 2022 •

edited

Loading

jbrockmendel commented Jan 28, 2022

BUG: CategoricalIndex.get_indexer issue with NaNs (#45361) #45373

BUG: CategoricalIndex.get_indexer issue with NaNs (#45361) #45373

Conversation

Shashank-Shet commented Jan 14, 2022 • edited Loading

debnathshoham commented Jan 14, 2022

Shashank-Shet commented Jan 14, 2022

debnathshoham commented Jan 14, 2022

jbrockmendel commented Jan 14, 2022

Shashank-Shet commented Jan 15, 2022

Shashank-Shet commented Jan 21, 2022

jbrockmendel commented Jan 21, 2022

jbrockmendel commented Jan 21, 2022

jreback left a comment

Choose a reason for hiding this comment

Shashank-Shet commented Jan 23, 2022

jbrockmendel commented Jan 23, 2022

jbrockmendel commented Jan 23, 2022

jbrockmendel commented Jan 26, 2022

Shashank-Shet commented Jan 27, 2022

jbrockmendel commented Jan 27, 2022

Shashank-Shet commented Jan 28, 2022 • edited Loading

jbrockmendel commented Jan 28, 2022

Shashank-Shet commented Jan 14, 2022 •

edited

Loading

Shashank-Shet commented Jan 28, 2022 •

edited

Loading