Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix SeriesGroupBy.quantile for nullable integers #33138

Merged
merged 13 commits into from
Apr 7, 2020
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -445,6 +445,7 @@ Groupby/resample/rolling
- Bug in :meth:`DataFrameGroupby.transform` produces incorrect result with transformation functions (:issue:`30918`)
- Bug in :meth:`GroupBy.count` causes segmentation fault when grouped-by column contains NaNs (:issue:`32841`)
- Bug in :meth:`DataFrame.groupby` and :meth:`Series.groupby` produces inconsistent type when aggregating Boolean series (:issue:`32894`)
- Bug in :meth:`SeriesGroupBy.quantile` raising on nullable integers (:issue:`33136`)
- Bug in :meth:`SeriesGroupBy.first`, :meth:`SeriesGroupBy.last`, :meth:`SeriesGroupBy.min`, and :meth:`SeriesGroupBy.max` returning floats when applied to nullable Booleans (:issue:`33071`)
- Bug in :meth:`DataFrameGroupBy.agg` with dictionary input losing ``ExtensionArray`` dtypes (:issue:`32194`)
- Bug in :meth:`DataFrame.resample` where an ``AmbiguousTimeError`` would be raised when the resulting timezone aware :class:`DatetimeIndex` had a DST transition at midnight (:issue:`25758`)
Expand Down
10 changes: 8 additions & 2 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,9 @@ class providing the base-class of operations.
from pandas.core.dtypes.cast import maybe_cast_result
from pandas.core.dtypes.common import (
ensure_float,
is_bool_dtype,
is_datetime64_dtype,
is_extension_array_dtype,
is_integer_dtype,
is_numeric_dtype,
is_object_dtype,
Expand Down Expand Up @@ -1867,9 +1869,13 @@ def pre_processor(vals: np.ndarray) -> Tuple[np.ndarray, Optional[Type]]:
)

inference = None
if is_integer_dtype(vals):
if is_integer_dtype(vals.dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine for here, but I think we need a generic 'convert_this_to_something_we_can_use_in_cython)` method on EA. @jbrockmendel @jorisvandenbossche @TomAugspurger

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed here, i think the EA.convert_this_to_something_we_can_use_in_cython is going to look something like _ordinal_values, at least for methods like quantile that are ordering-based

if is_extension_array_dtype(vals.dtype):
vals = vals.to_numpy(dtype=float, na_value=np.nan)
inference = np.int64
elif is_datetime64_dtype(vals):
elif is_bool_dtype(vals.dtype) and is_extension_array_dtype(vals.dtype):
vals = vals.to_numpy(dtype=float, na_value=np.nan)
elif is_datetime64_dtype(vals.dtype):
inference = "datetime64[ns]"
vals = np.asarray(vals).astype(np.float)

Expand Down
24 changes: 24 additions & 0 deletions pandas/tests/groupby/test_function.py
Original file line number Diff line number Diff line change
Expand Up @@ -1519,6 +1519,30 @@ def test_quantile_missing_group_values_correct_results():
tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize(
"values",
[
pd.array([1, 0, None] * 2, dtype="Int64"),
pd.array([True, False, None] * 2, dtype="boolean"),
],
)
@pytest.mark.parametrize("q", [0.5, [0.0, 0.5, 1.0]])
def test_groupby_quantile_nullable_array(values, q):
# https://github.com/pandas-dev/pandas/issues/33136
df = pd.DataFrame({"a": ["x"] * 3 + ["y"] * 3, "b": values})
result = df.groupby("a")["b"].quantile(q)

if isinstance(q, list):
idx = pd.MultiIndex.from_product((["x", "y"], q), names=["a", None])
true_quantiles = [0.0, 0.5, 1.0]
else:
idx = pd.Index(["x", "y"], name="a")
true_quantiles = [0.5]

expected = pd.Series(true_quantiles * 2, index=idx, name="b")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracking down #51424 it looks like support for nullable-bool was added here, but it seems weird that we'd return Float64 for that. only discussion is see is a comment from @jreback saying it would need a separate test. is this intentional?

tm.assert_series_equal(result, expected)


# pipe
# --------------------------------

Expand Down