Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby any/all raising with pd.NA object data #42085

Merged
merged 4 commits into from
Jun 21, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -268,6 +268,7 @@ Other enhancements
- :meth:`read_csv` and :meth:`read_json` expose the argument ``encoding_errors`` to control how encoding errors are handled (:issue:`39450`)
- :meth:`.GroupBy.any` and :meth:`.GroupBy.all` use Kleene logic with nullable data types (:issue:`37506`)
- :meth:`.GroupBy.any` and :meth:`.GroupBy.all` return a ``BooleanDtype`` for columns with nullable data types (:issue:`33449`)
- :meth:`.GroupBy.any` and :meth:`.GroupBy.all` raising with ``object`` data containing ``pd.NA`` even when ``skipna=True`` (:issue:`37501`)
- :meth:`.GroupBy.rank` now supports object-dtype data (:issue:`38278`)
- Constructing a :class:`DataFrame` or :class:`Series` with the ``data`` argument being a Python iterable that is *not* a NumPy ``ndarray`` consisting of NumPy scalars will now result in a dtype with a precision the maximum of the NumPy scalars; this was already the case when ``data`` is a NumPy ``ndarray`` (:issue:`40908`)
- Add keyword ``sort`` to :func:`pivot_table` to allow non-sorting of the result (:issue:`39143`)
Expand Down
6 changes: 5 additions & 1 deletion pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1519,7 +1519,11 @@ def _bool_agg(self, val_test, skipna):

def objs_to_bool(vals: ArrayLike) -> tuple[np.ndarray, type]:
if is_object_dtype(vals):
vals = np.array([bool(x) for x in vals])
# GH#37501: don't raise on pd.NA when skipna=True
if skipna:
vals = np.array([bool(x) if not isna(x) else True for x in vals])
else:
vals = np.array([bool(x) for x in vals])
elif isinstance(vals, BaseMaskedArray):
vals = vals._data.astype(bool, copy=False)
else:
Expand Down
26 changes: 26 additions & 0 deletions pandas/tests/groupby/test_any_all.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,3 +152,29 @@ def test_masked_bool_aggs_skipna(bool_agg_func, dtype, skipna, frame_or_series):

result = obj.groupby([1, 1]).agg(bool_agg_func, skipna=skipna)
tm.assert_equal(result, expected)


@pytest.mark.parametrize(
"bool_agg_func,data,expected_res",
[
("any", [pd.NA, np.nan], False),
("any", [pd.NA, 1, np.nan], True),
("all", [pd.NA, pd.NaT], True),
("all", [pd.NA, False, pd.NaT], False),
],
)
def test_object_type_missing_vals(bool_agg_func, data, expected_res, frame_or_series):
# GH#37501
obj = frame_or_series(data, dtype=object)
result = obj.groupby([1] * len(data)).agg(bool_agg_func)
expected = frame_or_series([expected_res], index=[1], dtype="bool")
tm.assert_equal(result, expected)


@pytest.mark.filterwarnings("ignore:Dropping invalid columns:FutureWarning")
@pytest.mark.parametrize("bool_agg_func", ["any", "all"])
def test_object_NA_raises_with_skipna_false(bool_agg_func):
# GH#37501
ser = Series([pd.NA], dtype=object)
with pytest.raises(TypeError, match="boolean value of NA is ambiguous"):
ser.groupby([1]).agg(bool_agg_func, skipna=False)