Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: CategoricalIndex.get_indexer issue with NaNs (#45361) #45373

Merged
merged 15 commits into from
Jan 28, 2022

Conversation

Shashank-Shet
Copy link
Contributor

@Shashank-Shet Shashank-Shet commented Jan 14, 2022

Bug Example:

ci = pd.CategoricalIndex([1, 2, np.nan, 3])
other = pd.Index([2, 3, 4])

res = ci.get_indexer(other)

>>> res
array([1, 3, 2])

ci.get_indexer(other) yields incorrect results when ci contains NaNs, and
other does not. The reason is that, if other contains elements that do not
match any category in ci, they are replaced by NaNs. In such a situation,
if ci also has NaNs, then the corresponding elements in ci are mapped to
an index that is not -1

eg:

ci = pd.CategoricalIndex([1, 2, np.nan, 3])
other = pd.Index([2, 3, 4])
ci.get_indexer(other)

In the implementation of get_indexer, other becomes [2, 3, NaN]
which is mapped to index 2, in ci

  • closes #xxxx
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

categorical_index_obj.get_indexer(target) yields incorrect results when
categorical_index_obj contains NaNs, and target does not. The reason
for this is that, if target contains elements which do not match any
category in categorical_index_obj, they are replaced by NaNs. In such
a situation, if categorical_index_obj also has NaNs, then the corresp
elements in target are mapped to an index which is not -1

eg:
ci = pd.CategoricalIndex([1, 2, np.nan, 3])
other = pd.Index([2, 3, 4])

ci.get_indexer(other)
In the implementation of get_indexer, other becomes [2, 3, NaN]
which is mapped to index 2, in ci
categorical_index_obj.get_indexer(target) yields incorrect results when
categorical_index_obj contains NaNs, and target does not. The reason
for this is that, if target contains elements which do not match any
category in categorical_index_obj, they are replaced by NaNs. In such
a situation, if categorical_index_obj also has NaNs, then the corresp
elements in target are mapped to an index which is not -1

eg:
ci = pd.CategoricalIndex([1, 2, np.nan, 3])
other = pd.Index([2, 3, 4])

ci.get_indexer(other)
In the implementation of get_indexer, other becomes [2, 3, NaN]
which is mapped to index 2, in ci

Update:
np.isnan(target) was breaking the existing codebase.
As a solution, I have enclosed this line in a try-except block
@debnathshoham
Copy link
Member

Hey, thanks for the PR!
Is this issue reported (if yes, could you please link to the issue), otherwise please consider creating an issue (for tracking, discussion, etc).

@debnathshoham debnathshoham added the Categorical Categorical Data Type label Jan 14, 2022
@Shashank-Shet
Copy link
Contributor Author

@debnathshoham this is actually my first PR. Could you tell me what needs to be done? I'll look into it right away.

@debnathshoham
Copy link
Member

So you can find a lot of open issues in the issues tab (around 3.3k). What it seems to me is you have identified an issue, and want to put a fix to that.
First step should be to check if there are any already open issue that someone else has opened, and point this PR to that particular issue (in closs #xxxx). If there is no such issue, please create a new issue with detailed description and reproducible examples.

@jbrockmendel
Copy link
Member

@Shashank-Shet thanks for taking a look at this.

For any PR fixing a bug, there should always be a test for the fixed behavior. In this case it would go somewhere like tests/indexes/categorical/tests_indexing.py

It looks like this broke some existing tests. That suggests that either the existing tests have a problem, or this fix isn't quite right.

If this is your first PR, I suggest looking for issues with the "Good first issue" label.

@debnathshoham debnathshoham added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Jan 15, 2022
@Shashank-Shet
Copy link
Contributor Author

@debnathshoham will do. I'll work on identifying whether this is a new issue and reporting if necessary.
@jbrockmendel yes, the fix has some flaws. I'll look into them as well. Worst case, I can at least let another person know what I've dug up.

Update: Replaced `np.isnan` with `pandas.core.dtypes.missing.isna`
@Shashank-Shet
Copy link
Contributor Author

@jbrockmendel I have made a small change to the commit. Now all but one check is passing. Unfortunately, I cannot get much data regarding the testcase itself. Apparently, it seems to be a build error (timeout).

@jbrockmendel
Copy link
Member

Unfortunately, I cannot get much data regarding the testcase itself. Apparently, it seems to be a build error (timeout).

ill take a look. the timeout is unrelated, is affecting all PRs ATM.

@jbrockmendel
Copy link
Member

Please add test(s) for the behavior you are trying to fix.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs a test and pls see @jbrockmendel comments on location, this is going to cause a large perf hit and so would need careful consideration

@Shashank-Shet
Copy link
Contributor Author

@jbrockmendel I have added a test case (let me know if I need to add more). Also, the checks failing on this build seem to be related to a file I haven't changed (parquet.py). Some test cases which are failing are apparently due to a new test case added called test_unsupported_float16_cleanup. While I am not sure whether these tests are failing because of my commit, is it possible to find out if this is plaguing other PRs as well?

@jbrockmendel
Copy link
Member

it possible to find out if this is plaguing other PRs as well?

it is. You can ignore this failure for now.

@jbrockmendel
Copy link
Member

Long term solution is #37930

@jbrockmendel
Copy link
Member

Looks good. Can you add a whatsnew note for 1.5.0

@Shashank-Shet
Copy link
Contributor Author

Looks good. Can you add a whatsnew note for 1.5.0

I am not sure what this entails. May I know what I am required to add?

@jbrockmendel
Copy link
Member

Looks good. Can you add a whatsnew note for 1.5.0

I am not sure what this entails. May I know what I am required to add?

In doc/source/whatsnew/v1.5.0.rst there should be a section for Indexing bugs. Add an entry describing the bug this fixes. (use the existing entries to get an idea for how much to write)

@Shashank-Shet
Copy link
Contributor Author

Shashank-Shet commented Jan 28, 2022

Looks good. Can you add a whatsnew note for 1.5.0

I am not sure what this entails. May I know what I am required to add?

In doc/source/whatsnew/v1.5.0.rst there should be a section for Indexing bugs. Add an entry describing the bug this fixes. (use the existing entries to get an idea for how much to write)

Cool. I have made the changes. Thanks for the help!

@jbrockmendel jbrockmendel merged commit fcfd19f into pandas-dev:main Jan 28, 2022
@jbrockmendel
Copy link
Member

thanks @Shashank-Shet

@Shashank-Shet Shashank-Shet deleted the shashank-dev branch January 28, 2022 17:17
phofl pushed a commit to phofl/pandas that referenced this pull request Feb 14, 2022
…andas-dev#45373)

* BUG: CategoricalIndex.get_indexer issue with NaNs (pandas-dev#45361)

categorical_index_obj.get_indexer(target) yields incorrect results when
categorical_index_obj contains NaNs, and target does not. The reason
for this is that, if target contains elements which do not match any
category in categorical_index_obj, they are replaced by NaNs. In such
a situation, if categorical_index_obj also has NaNs, then the corresp
elements in target are mapped to an index which is not -1

eg:
ci = pd.CategoricalIndex([1, 2, np.nan, 3])
other = pd.Index([2, 3, 4])

ci.get_indexer(other)
In the implementation of get_indexer, other becomes [2, 3, NaN]
which is mapped to index 2, in ci

* BUG: CategoricalIndex.get_indexer issue with NaNs (pandas-dev#45361)

categorical_index_obj.get_indexer(target) yields incorrect results when
categorical_index_obj contains NaNs, and target does not. The reason
for this is that, if target contains elements which do not match any
category in categorical_index_obj, they are replaced by NaNs. In such
a situation, if categorical_index_obj also has NaNs, then the corresp
elements in target are mapped to an index which is not -1

eg:
ci = pd.CategoricalIndex([1, 2, np.nan, 3])
other = pd.Index([2, 3, 4])

ci.get_indexer(other)
In the implementation of get_indexer, other becomes [2, 3, NaN]
which is mapped to index 2, in ci

Update:
np.isnan(target) was breaking the existing codebase.
As a solution, I have enclosed this line in a try-except block

* BUG: CategoricalIndex.get_indexer issue with NaNs (pandas-dev#45361)

Update: Replaced `np.isnan` with `pandas.core.dtypes.missing.isna`

* Added a testcase to verify output behaviour

* Made pre-commit changes

* Added a test case without NaNs

* Moved NaN test to avoid unnecessary execution

* Re-aligned test cases

* Removed try-except block

* Cleaned up base.py

* Add GH#45361 comment to code

* Added whatsnew entry

* Resolved merge conflict

* Moved whatsnew entry to indexing section
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
…andas-dev#45373)

* BUG: CategoricalIndex.get_indexer issue with NaNs (pandas-dev#45361)

categorical_index_obj.get_indexer(target) yields incorrect results when
categorical_index_obj contains NaNs, and target does not. The reason
for this is that, if target contains elements which do not match any
category in categorical_index_obj, they are replaced by NaNs. In such
a situation, if categorical_index_obj also has NaNs, then the corresp
elements in target are mapped to an index which is not -1

eg:
ci = pd.CategoricalIndex([1, 2, np.nan, 3])
other = pd.Index([2, 3, 4])

ci.get_indexer(other)
In the implementation of get_indexer, other becomes [2, 3, NaN]
which is mapped to index 2, in ci

* BUG: CategoricalIndex.get_indexer issue with NaNs (pandas-dev#45361)

categorical_index_obj.get_indexer(target) yields incorrect results when
categorical_index_obj contains NaNs, and target does not. The reason
for this is that, if target contains elements which do not match any
category in categorical_index_obj, they are replaced by NaNs. In such
a situation, if categorical_index_obj also has NaNs, then the corresp
elements in target are mapped to an index which is not -1

eg:
ci = pd.CategoricalIndex([1, 2, np.nan, 3])
other = pd.Index([2, 3, 4])

ci.get_indexer(other)
In the implementation of get_indexer, other becomes [2, 3, NaN]
which is mapped to index 2, in ci

Update:
np.isnan(target) was breaking the existing codebase.
As a solution, I have enclosed this line in a try-except block

* BUG: CategoricalIndex.get_indexer issue with NaNs (pandas-dev#45361)

Update: Replaced `np.isnan` with `pandas.core.dtypes.missing.isna`

* Added a testcase to verify output behaviour

* Made pre-commit changes

* Added a test case without NaNs

* Moved NaN test to avoid unnecessary execution

* Re-aligned test cases

* Removed try-except block

* Cleaned up base.py

* Add GH#45361 comment to code

* Added whatsnew entry

* Resolved merge conflict

* Moved whatsnew entry to indexing section
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants