-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: CategoricalIndex.get_indexer issue with NaNs (#45361) #45373
Conversation
categorical_index_obj.get_indexer(target) yields incorrect results when categorical_index_obj contains NaNs, and target does not. The reason for this is that, if target contains elements which do not match any category in categorical_index_obj, they are replaced by NaNs. In such a situation, if categorical_index_obj also has NaNs, then the corresp elements in target are mapped to an index which is not -1 eg: ci = pd.CategoricalIndex([1, 2, np.nan, 3]) other = pd.Index([2, 3, 4]) ci.get_indexer(other) In the implementation of get_indexer, other becomes [2, 3, NaN] which is mapped to index 2, in ci
categorical_index_obj.get_indexer(target) yields incorrect results when categorical_index_obj contains NaNs, and target does not. The reason for this is that, if target contains elements which do not match any category in categorical_index_obj, they are replaced by NaNs. In such a situation, if categorical_index_obj also has NaNs, then the corresp elements in target are mapped to an index which is not -1 eg: ci = pd.CategoricalIndex([1, 2, np.nan, 3]) other = pd.Index([2, 3, 4]) ci.get_indexer(other) In the implementation of get_indexer, other becomes [2, 3, NaN] which is mapped to index 2, in ci Update: np.isnan(target) was breaking the existing codebase. As a solution, I have enclosed this line in a try-except block
Hey, thanks for the PR! |
@debnathshoham this is actually my first PR. Could you tell me what needs to be done? I'll look into it right away. |
So you can find a lot of open issues in the issues tab (around 3.3k). What it seems to me is you have identified an issue, and want to put a fix to that. |
@Shashank-Shet thanks for taking a look at this. For any PR fixing a bug, there should always be a test for the fixed behavior. In this case it would go somewhere like tests/indexes/categorical/tests_indexing.py It looks like this broke some existing tests. That suggests that either the existing tests have a problem, or this fix isn't quite right. If this is your first PR, I suggest looking for issues with the "Good first issue" label. |
@debnathshoham will do. I'll work on identifying whether this is a new issue and reporting if necessary. |
Update: Replaced `np.isnan` with `pandas.core.dtypes.missing.isna`
@jbrockmendel I have made a small change to the commit. Now all but one check is passing. Unfortunately, I cannot get much data regarding the testcase itself. Apparently, it seems to be a build error (timeout). |
ill take a look. the timeout is unrelated, is affecting all PRs ATM. |
Please add test(s) for the behavior you are trying to fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs a test and pls see @jbrockmendel comments on location, this is going to cause a large perf hit and so would need careful consideration
@jbrockmendel I have added a test case (let me know if I need to add more). Also, the checks failing on this build seem to be related to a file I haven't changed (parquet.py). Some test cases which are failing are apparently due to a new test case added called |
it is. You can ignore this failure for now. |
Long term solution is #37930 |
Looks good. Can you add a whatsnew note for 1.5.0 |
I am not sure what this entails. May I know what I am required to add? |
In doc/source/whatsnew/v1.5.0.rst there should be a section for Indexing bugs. Add an entry describing the bug this fixes. (use the existing entries to get an idea for how much to write) |
Cool. I have made the changes. Thanks for the help! |
thanks @Shashank-Shet |
…andas-dev#45373) * BUG: CategoricalIndex.get_indexer issue with NaNs (pandas-dev#45361) categorical_index_obj.get_indexer(target) yields incorrect results when categorical_index_obj contains NaNs, and target does not. The reason for this is that, if target contains elements which do not match any category in categorical_index_obj, they are replaced by NaNs. In such a situation, if categorical_index_obj also has NaNs, then the corresp elements in target are mapped to an index which is not -1 eg: ci = pd.CategoricalIndex([1, 2, np.nan, 3]) other = pd.Index([2, 3, 4]) ci.get_indexer(other) In the implementation of get_indexer, other becomes [2, 3, NaN] which is mapped to index 2, in ci * BUG: CategoricalIndex.get_indexer issue with NaNs (pandas-dev#45361) categorical_index_obj.get_indexer(target) yields incorrect results when categorical_index_obj contains NaNs, and target does not. The reason for this is that, if target contains elements which do not match any category in categorical_index_obj, they are replaced by NaNs. In such a situation, if categorical_index_obj also has NaNs, then the corresp elements in target are mapped to an index which is not -1 eg: ci = pd.CategoricalIndex([1, 2, np.nan, 3]) other = pd.Index([2, 3, 4]) ci.get_indexer(other) In the implementation of get_indexer, other becomes [2, 3, NaN] which is mapped to index 2, in ci Update: np.isnan(target) was breaking the existing codebase. As a solution, I have enclosed this line in a try-except block * BUG: CategoricalIndex.get_indexer issue with NaNs (pandas-dev#45361) Update: Replaced `np.isnan` with `pandas.core.dtypes.missing.isna` * Added a testcase to verify output behaviour * Made pre-commit changes * Added a test case without NaNs * Moved NaN test to avoid unnecessary execution * Re-aligned test cases * Removed try-except block * Cleaned up base.py * Add GH#45361 comment to code * Added whatsnew entry * Resolved merge conflict * Moved whatsnew entry to indexing section
…andas-dev#45373) * BUG: CategoricalIndex.get_indexer issue with NaNs (pandas-dev#45361) categorical_index_obj.get_indexer(target) yields incorrect results when categorical_index_obj contains NaNs, and target does not. The reason for this is that, if target contains elements which do not match any category in categorical_index_obj, they are replaced by NaNs. In such a situation, if categorical_index_obj also has NaNs, then the corresp elements in target are mapped to an index which is not -1 eg: ci = pd.CategoricalIndex([1, 2, np.nan, 3]) other = pd.Index([2, 3, 4]) ci.get_indexer(other) In the implementation of get_indexer, other becomes [2, 3, NaN] which is mapped to index 2, in ci * BUG: CategoricalIndex.get_indexer issue with NaNs (pandas-dev#45361) categorical_index_obj.get_indexer(target) yields incorrect results when categorical_index_obj contains NaNs, and target does not. The reason for this is that, if target contains elements which do not match any category in categorical_index_obj, they are replaced by NaNs. In such a situation, if categorical_index_obj also has NaNs, then the corresp elements in target are mapped to an index which is not -1 eg: ci = pd.CategoricalIndex([1, 2, np.nan, 3]) other = pd.Index([2, 3, 4]) ci.get_indexer(other) In the implementation of get_indexer, other becomes [2, 3, NaN] which is mapped to index 2, in ci Update: np.isnan(target) was breaking the existing codebase. As a solution, I have enclosed this line in a try-except block * BUG: CategoricalIndex.get_indexer issue with NaNs (pandas-dev#45361) Update: Replaced `np.isnan` with `pandas.core.dtypes.missing.isna` * Added a testcase to verify output behaviour * Made pre-commit changes * Added a test case without NaNs * Moved NaN test to avoid unnecessary execution * Re-aligned test cases * Removed try-except block * Cleaned up base.py * Add GH#45361 comment to code * Added whatsnew entry * Resolved merge conflict * Moved whatsnew entry to indexing section
Bug Example:
ci.get_indexer(other)
yields incorrect results whenci
contains NaNs, andother
does not. The reason is that, ifother
contains elements that do notmatch any category in
ci
, they are replaced by NaNs. In such a situation,if
ci
also has NaNs, then the corresponding elements inci
are mapped toan index that is not -1
eg:
In the implementation of get_indexer,
other
becomes [2, 3, NaN]which is mapped to index 2, in
ci