-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG pd.NA
not treated correctly in where
and mask
operations
#53124
Conversation
The problem is also present if we operate using a BooleanArray instead of a Series:
Can you fix this case also? |
@topper-123 Thanks for your review. I have pushed a fix when using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a couple of changes.
@topper-123 I'm a little bit of confused about the following case: >>> df = pd.DataFrame([[1, pd.NA], [pd.NA, 2]], dtype=pd.Int64Dtype())
>>> df
0 1
0 1 <NA>
1 <NA> 2
>>> df.mask(df[0] % 2 == 1, 0)
0 1
0 0 0
1 <NA> 2 Is this really the desired behavior? Here the case is: >>> df[0] % 2 == 1
0 True
1 <NA> The first row has Thank you very much! (PS: I will assume that the above is the correct behavior for now.) |
The new behavior looks correct to me:
I think rewording could make it clearer, would be good if you'd update a bit. |
pandas/core/generic.py
Outdated
@@ -9869,6 +9869,8 @@ def _where( | |||
# align the cond to same shape as myself | |||
cond = common.apply_if_callable(cond, self) | |||
if isinstance(cond, NDFrame): | |||
# GH #52955: if cond is NA, element propagates in mask and where | |||
cond = cond.fillna(True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has the option of just raising on NAs been discussed? seems ambiguous and a general PITA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are saying raising in where
and mask
, no we haven't discussed yet. If you are saying raising in _where
, I think this is not desired since then, the following will not work:
>>> df = pd.DataFrame(np.random.random((3, 3)), dtype=pd.Float64Dtype())
>>> df[0][0] = pd.NA
>>> df
0 1 2
0 <NA> 0.609241 0.419094
1 0.274784 0.342904 0.026101
2 0.670259 0.218889 0.177126
>>> df[df >= 0.5] = 0 # This will raise an error, which I assume is undesired
>>> df
0 1 2
0 <NA> 0.0 0.419094
1 0.274784 0.342904 0.026101
2 0.0 0.218889 0.177126
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would just have that raise too, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel I think the above code snippet actually works for versions v2.0.x
, do we really want to change its behavior? @topper-123 I think we may need further discussion about the desired behavior of _where
, i.e., propagate or raise. I will postpone the rewording mentioned in #53124 (comment) until maintainers reach an agreement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should accept BooleanArrays (and Series/DataFrame containing BooleanArrays/ArrowArray[bool]) as conditional here. I think it will be surprising if those data structure work in loc
and not here.
Do similar functionality raise in any other methods? I don't recall any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jbrockmendel any updates on this?
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
I'm still interested in working on this, but maintainers have not reached an agreement yet. |
@MarcoGorelli @phofl @mroeschke what would you expect from each of |
Makes sense to raise to me |
Sure, I will make the change soon. |
Pretty sure that we changed this to be used as False before 2.0 came out, that's a bit annoying |
Sorry for the late follow-up @jbrockmendel @mroeschke. I have made the suggested changes: now There seems to be a lot more to do since as @phofl has also mentioned, NA has been used as False in nullable boolean arrays since 1.0.2. There will be more codes to change (updating error messages and updating tests), but I just want to make sure I'm on the right track. (See also #31591 and What's new 1.0.2) |
So at the sprint we decided a long-term plan where pd.NA would be treated as false in these cases. I'm not sure if there is a plan for how to get there. Apologies for the indecisiveness. |
Thanks for the PR, but appears this issue probably needs more discussion on the issue before proceeding with a solution here. Closing for now, but happy for you to engage in the discussion there |
doc/source/whatsnew/v2.1.0.rst
fileSuppose we have
In the above example, if use a condition such as
ser % 2 == 1
, then there will bepd.NA
incond
. I'm not sure which of the following would be the desired behavior: (1) an entry propagates through bothwhere
andmask
(expect for some really special cases) ifcond
evaluates topd.NA
, (2) we raise an error message if thecond
of any entry evaluates topd.NA
(in other words, users shouldfillna
themselves in advance), or (3) provide an additional keyword for users to specify how they want to treat entries for whichcond
evaluates. (Or maybe none of the above is the desired behavior, I'm not sure about that.)This PR is currently implementing the first approach. Please let me know if maintainers prefer some other approaches.
To be more specific
(1)
(2)
I don't think this is the right way to go. This can affect the behavior of the following:
(3)
Provide a new keyword that defaults to
True
.