-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix Categorical comparsion with Series of dtype 'category' #16667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix Categorical comparsion with Series of dtype 'category' #16667
Conversation
pandas/core/categorical.py
Outdated
try: | ||
return (self.categories.equals(other.categories) and | ||
self.ordered == other.ordered) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead simply
other = Categorical(other)
pandas/tests/test_categorical.py
Outdated
@@ -152,6 +152,11 @@ def test_is_equal_dtype(self): | |||
CategoricalIndex(c1, categories=list('cab')))) | |||
assert not c1.is_dtype_equal(CategoricalIndex(c1, ordered=True)) | |||
|
|||
s1 = pd.Series(c1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue as a comment
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -130,6 +130,8 @@ Numeric | |||
Categorical | |||
^^^^^^^^^^^ | |||
|
|||
- Bug in ``Categorical.is_dtype_equal()`` where comparison with Series whose dtype is 'category' is not handled correctly (:issue:`16659`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comparison with a Series
with dtype='category'
.
remove the 'is not handled correctly' (that's what the Bug indicates).
pandas/core/categorical.py
Outdated
other_categorical = other.values | ||
else: | ||
other_categorical = other | ||
other = Categorical(other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can always do this no if needed (pass copy=False)
pandas/core/categorical.py
Outdated
from pandas.core.series import Series | ||
|
||
if isinstance(other, Series): | ||
other = Categorical(other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or even better just do this if not Categorical or CategoricalIndex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your guidance.
It seems that the following code
if not isinstance(other, Categorical) and not isinstance(CategoricalIndex):
other = Categorical(other)
will lead to the failure of following test case
assert not c1.is_dtype_equal(Index(list('aabca')))
# GH 16659 | ||
s1 = pd.Series(c1) | ||
assert c1.is_dtype_equal(s1) | ||
assert not c2.is_dtype_equal(s1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u add some tests for things that are not is_dryoe_equal (i don't remember if these are sufficiently covered)
e.g. pass in scalers, ndarray, Dataframe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and cycle thru all Indexes (except CI)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you update for this
Codecov Report
@@ Coverage Diff @@
## master #16667 +/- ##
==========================================
- Coverage 91.43% 91.01% -0.43%
==========================================
Files 163 161 -2
Lines 50091 49353 -738
==========================================
- Hits 45800 44917 -883
- Misses 4291 4436 +145
Continue to review full report at Codecov.
|
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -130,6 +130,8 @@ Numeric | |||
Categorical | |||
^^^^^^^^^^^ | |||
|
|||
- Bug in ``Categorical.is_dtype_equal()`` where comparison with a Series with dtype='category' (:issue:`16659`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Bug in ``Categorical.is_dtype_equal()`` where comparison with a Series with `'category'` dtype incorrectly returned False (:issue:`16659`)
pandas/core/categorical.py
Outdated
try: | ||
from pandas.core.series import Series | ||
|
||
if isinstance(other, Series): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
if is_categorical(other):
other = Categorical(other)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the advice. However, this seems not to work either.
It will at least make the following test case to fail,
assert c3.is_dtype_equal(c3)
One of the reason is that the new instance created by Categorical(c3)
will not preserve the attribute ordered
of c3
. Not quite sure whether this is by design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that doesn't sound right. an u show a self contained repro
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that makes sense I guess.
In [5]: pd.Categorical(pd.Categorical([0, 1], ordered=True))
Out[5]:
[0, 1]
Categories (2, int64): [0, 1]
so not ordered. The default for the constructor is ordered=False
.
I would say just explicitly check for the known cases of Categorical
, Series
w/ category dtype, and CategoricalIndex
and just extract the categories
and ordered
off of them and build a new other = Categorical(categories, ordered=ordered)
.
Once I finish #16015 this will be much simpler (just use .dtype
), but that won't be for a few weeks probably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no this is a bug
could be fixed in his PR or independently before this PR
don't work around this pls put in place the correct patch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing Categorical(thing, ordered=False)
and having it come out ordered would be very odd. If we do this, we would need to change the default to ordered=None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback With what? With that Categorical(thing, ordered=False)
should return ordered False? Or with changing the default to None ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't find this necessarily a bug. Personally I would find keeping ordered=False
as the default much clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the bug is this.
This should take the ordered flag from the passed Categorical/CategoricalIndex
In [2]: c = pd.Categorical(list('bac'), ordered=True)
In [3]: c
Out[3]:
[b, a, c]
Categories (3, object): [a < b < c]
In [4]: pd.Categorical(c)
Out[4]:
[b, a, c]
Categories (3, object): [a, b, c]
so it needs to default to ordered=None
. If its not specified it can take the attribute of a passed categorical (we do this already for categories). otherwise, will be False
can you update |
I was thinking that an agreement was yet to achieve. So, I will try to fix the two bugs mentioned above in this PR altogether. |
can you rebase / update #16667 (comment) could be done as a separate PR prior to this (or here). |
Sorry for late reply. So, by "rebase", do you mean that I can make a new branch from current topic branch and fix #16667 (comment) on that new branch and rebase its content to the master first? Something like,
Or just create a new separate PR to fix #16667 (comment), and have it merged into master first. Then rebase the topic branch in this PR after that commit (on master)? Thanks for your guidance. |
Rebase means the following:
To do this, just to the following:
No new branch is needed, nor is there any need for a new PR |
So this is an option that @jreback presented to you in terms of addressing his comment. You are also welcome to address the issue directly in this PR (so you wouldn't need to create a new one for it). |
No worries! We've had to wait MUCH longer than this for PR's before 😄 |
2581275
to
2fb416d
Compare
Thanks for you detailed explanation. I was overthinking about the rebase. So what I have done is as follows. Update my upstream/master with
Switch to local master branch and update it
Switch back to local branch fix_bug_in_categorical_comparison
Rebase it to fresh master branch
Finally, push to origin/fix_bug_in_categorical_comparison with force
Since I am the only one working on this branch, I believe the force push here should not cause any trouble. Please let me know if this is not considered a good practice. Thank you. |
@funnycrab : That should be just fine! |
@gfyoung |
can you rebase |
2fb416d
to
0697610
Compare
Hello @funnycrab! Thanks for updating the PR.
|
Apologies for the conflicts @funnycrab. Let me know if you need help cleaning them up. |
It is OK. Actually, it is my bad for keeping this issue open for such a long time. Hopefully, I can fix this early October. Will let you know when clarification is needed. Thanks in advance! |
if isinstance(other, Series): | ||
other = Categorical(other) | ||
|
||
<<<<<<< a581a743fe6740011e4fb0a7031ee92ce57b480b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge issues
# GH 16659 | ||
s1 = pd.Series(c1) | ||
assert c1.is_dtype_equal(s1) | ||
assert not c2.is_dtype_equal(s1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you update for this
can you rebase / update |
This is a fix attempt for issue pandas-dev#16659.
…d tighten the wording in doc whatsnew
0697610
to
2771e57
Compare
can you rebase |
closing as stale, but if you want to continue working ping and we can reopen |
This is a fix attempt for issue #16659.
git diff upstream/master --name-only -- '*.py' | flake8 --diff