Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Modify .any() to return Null if all values are null #41967

Closed
rapatel0 opened this issue Jun 12, 2021 · 5 comments
Closed

ENH: Modify .any() to return Null if all values are null #41967

rapatel0 opened this issue Jun 12, 2021 · 5 comments
Labels
Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reduction Operations sum, mean, min, max, etc.
Milestone

Comments

@rapatel0
Copy link

rapatel0 commented Jun 12, 2021

Is your feature request related to a problem?

For sparce data it is particuarly helpful to preserve null values when doing reduction operations. I end up using this (see below) wrapper function to achieve the intended result.

def any_nan(x):
    if pd.isnull(x).all():
        return pd.NA
    else:
        return x.any(skipna=True)

output = df.column_to_operate_on.apply(any_nan) 

Describe the solution you'd like

The ask is to add a configuration parameter to. .any() and .all() that optionally preserves null values when returning the array. Skipna doesn't seem to have to correct functionality as it will return false, if all the values of the array are null. Ideally, with the toggleable parameter, .any() would return np.nan/pd.NA if all values in the array are null.

API breaking implications

This shouldn't change the api. It will just add an optional parameter

Describe alternatives you've considered

pd.NA

@rapatel0 rapatel0 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 12, 2021
@mzeitlin11
Copy link
Member

Thanks for the request @rapatel0! Does https://pandas.pydata.org/pandas-docs/stable/user_guide/boolean.html satisfy what you are looking for? For the new pandas nullable data types, the behavior will be as you ask, eg

>>> ser = pd.Series([pd.NA, pd.NA], dtype="boolean")
>>> ser.any(skipna=False)
<NA>
>>> ser.all(skipna=False)
<NA>

I think there are still some inconsistencies with this behavior, though, for example

>>> ser = pd.Series([pd.NA, pd.NA], dtype="Int64")
>>> ser.all(skipna=False)
True

should be pd.NA I think.

@mzeitlin11 mzeitlin11 added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reduction Operations sum, mean, min, max, etc. and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 12, 2021
@rapatel0
Copy link
Author

rapatel0 commented Jun 16, 2021

The functionality works in your test case but I'm not seeing it when wrapped in a groupby to operate on a dataframe. The pd.NAs are filled in with False Am I doing something wrong?

Note: on pd.__version__ is 1.2.3

image

@rapatel0
Copy link
Author

rapatel0 commented Jun 16, 2021

Running a simple groupby in front of the series code implies that some masking logic is failing or probably type promotion.


[53]: df.groupby(['a','b']).Positive.any(skipna=True)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-53-5757a49c0321> in <module>
----> 1 df.groupby(['a','b']).Positive.any(skipna=True)

/opt/conda/lib/python3.8/site-packages/pandas/core/groupby/groupby.py in any(self, skipna)
   1407             is True within its respective group, False otherwise.
   1408         """
-> 1409         return self._bool_agg("any", skipna)
   1410 
   1411     @final

/opt/conda/lib/python3.8/site-packages/pandas/core/groupby/groupby.py in _bool_agg(self, val_test, skipna)
   1376             return result.astype(inference, copy=False)
   1377 
-> 1378         return self._get_cythonized_result(
   1379             "group_any_all",
   1380             aggregate=True,

/opt/conda/lib/python3.8/site-packages/pandas/core/groupby/groupby.py in _get_cythonized_result(self, how, cython_dtype, aggregate, numeric_only, needs_counts, needs_values, needs_2d, min_count, needs_mask, needs_ngroups, result_is_index, pre_processing, post_processing, **kwargs)
   2681         # error_msg is "" on an frame/series with no rows or columns
   2682         if not output and error_msg != "":
-> 2683             raise TypeError(error_msg)
   2684 
   2685         if aggregate:

TypeError: boolean value of NA is ambiguous

@mzeitlin11
Copy link
Member

mzeitlin11 commented Jun 16, 2021

When you ran df.a.fillna(pd.NA), the dtype of a becomes object. Rather than doing that, you can convert (or start with) one of the new masked types using astype. That failure is a known bug (#37501), but works fine for masked types

(EDIT: what I said above is only valid for version 1.3, before that there were other issues)

@jreback jreback added this to the 1.4 milestone Sep 9, 2021
@mzeitlin11
Copy link
Member

Closing for now, but please ping to reopen if you think there's any other weird behavior here @rapatel0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

No branches or pull requests

3 participants