ENH: improve the resulting dtype for groupby operations on nullable dtypes #37494

jorisvandenbossche · 2020-10-29T19:29:51Z

Follow-up on #37433, and partly related to #37493

Currently, after groupby operations we try to cast back to the original dtype when possible (at least in case of extension arrays). But this is not always correct, and also not done consistently. Some examples using the test case from the mentioned PR using a nullable Int64 column as input:

In [1]: df = DataFrame(
   ...:     {
   ...:         "A": ["A", "B"] * 5,
   ...:         "B": pd.array([1, 2, 3, 4, 5, 6, 7, 8, 9, pd.NA], dtype="Int64"),
   ...:     }
   ...: )

In [2]: df.groupby("A")["B"].sum()
Out[2]: 
A
A    25
B    20
Name: B, dtype: Int64

In [3]: df.groupby("A")["B"].std()
Out[3]: 
A
A    3.162278
B    2.581989
Name: B, dtype: float64

In [4]: df.groupby("A")["B"].mean()
Out[4]: 
A
A    5
B    5
Name: B, dtype: Int64

In [5]: df.groupby("A")["B"].count()
Out[5]: 
A
A    5
B    4
Name: B, dtype: int64

So some observations:

For sum(), we correctly have Int64 for the result
For std(), we could use the nullable Float64 instead of float64 dtype
For mean(), we incorrectly cast back to Int64 dtype, as the result of mean should always be floating (in this case the casting just happened to work because the means were rounded numbers)
For count(), we did not create a nullable Int64 dtype for the result, while this could be done in the input is nullable

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-10-29T19:56:43Z

This feels similar to #31359, right? I guess that was (just?) for UDFs. In this case things should be much easier since each dtype should exactly know the right output dtype for each method, right?

jorisvandenbossche · 2020-10-30T09:19:00Z

That PR does a try_cast_to_ea, which we introduced to try to preserve the extension dtype. But it is exactly that usage that can also give wrong results for known operations (like mean should not preserve int dtype, but give float). Indeed, as you say, for the built-in methods we know exactly the expected output dtype, and so we can have more specific / correct logic for this.

jbrockmendel · 2020-12-01T20:53:20Z

TL;DR: there is one usage of maybe_cast_to_extension_array that is liable to produce incorrect results on UDFs, but the rest we can make reliable.

I've spent some time the last few days tracking down the EA-casting done in the groupby code. Most of them are done within calls to dtypes.cast.maybe_cast_result, which unfortunately special-cases dt64tz and categorical. In total we have:

a try/except for _from_sequence in DataFrameGroupBy_cython_agg_blocks
- pinned down by REF: avoid try/except in wrapping in cython_agg_blocks #38164
maybe_cast_result and DTA/PA constructor call in BaseGrouper._cython_operation
- refactored to isolate EA casting/construction in REF: implement _ea_wrap_cython_operation #38162
- incorrect casting for dt64tz/period fixed in BUG: groupby.rank for dt64tz, period dtypes #38187
1 maybe_cast_result call in BaseGroupBy._cython_transform
- can be replaced with numpy-only casting consolidated into one place at the end of BaseGrouper._cython_operation (no PR yet)
2 maybe_cast_result calls in BaseGroupBy._cython_agg_general
- included in the fix for 3)
2 maybe_cast_result calls in BaseGroupBy._python_agg_general
- This can be reduced to two calls to numpy-only maybe_downcast_to_dtype and one maybe_cast_to_ea/equivalent at the end of BaseGrouper._aggregate_series_pure_python.

It's this last one in _aggregate_series_pure_python that is the problem child. It is needed for 12 tests that use DecimalArray, 8 of them of the form:

from pandas.tests.extension.decimal.array import DecimalArray, make_data, to_decimal

data = make_data()[:5]
df = DataFrame(
    {"id1": [0, 0, 0, 1, 1], "id2": [0, 1, 0, 1, 1], "decimals": DecimalArray(data)}
)

expected = Series(to_decimal([data[0], data[3]]))

err_cls = RuntimeError  # parametrized over a bunch of Exception subclasses

def weird_func(x):
    # weird function that raise something other than TypeError or IndexError
    #  in _python_agg_general
    if len(x) == 0:
        raise err_cls
    return x.iloc[0]

result = df["decimals"].groupby(df["id1"]).agg(weird_func)
tm.assert_series_equal(result, expected, check_names=False)

Without the attempted EA construction we would end up with an object ndarray of Decimal objects.

The catch is that this assumes that _from_sequence is sufficiently strict (xref #33254) (which DecimalArray._from_sequence is). We are currently avoiding getting incorrect results for dt64tz cases because of a kludge in maybe_cast_result to explicitly exclude dt64tz.

Until _from_sequence is sufficiently strict, I'm inclined to break the Decimal casting (to return ndarray[object] of Decimals) rather than risk silently returning incorrect results for arbitrary EAs.

jorisvandenbossche · 2020-12-04T16:10:55Z

@jbrockmendel thanks for the overview!

Until _from_sequence is sufficiently strict, I'm inclined to break the Decimal casting (to return ndarray[object] of Decimals) rather than risk silently returning incorrect results for arbitrary EAs.

AFAIK this doesn't only break decimal, but any EA with a user-defined function in groupby?

Copying your comment from https://github.com/pandas-dev/pandas/pull/38164/files#r536193430:

[about casting back in general] i am trying to get rid of that altogether, and only do cast-back in cases where we know it is correct. If/when _from_sequence is reliably strict we can reconsider.

Considering for a moment the _from_sequence issue is solved: are you then OK with this "try to cast back" for generic ops?

jbrockmendel · 2020-12-04T17:44:04Z

Considering for a moment the _from_sequence issue is solved: are you then OK with this "try to cast back" for generic ops?

Once _from_sequence is reliably strict, I wouldn't have a serious problem with the "try to cast back" approach.

jbrockmendel · 2023-07-27T21:51:53Z

The examples in the OP all now cast to Masked dtypes. Closable?

jorisvandenbossche added Groupby ExtensionArray Extending pandas with custom dtypes or arrays. NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Oct 29, 2020

jorisvandenbossche mentioned this issue Oct 29, 2020

REGR: fix groupby std() with nullable dtypes #37433

Merged

arw2019 mentioned this issue Oct 29, 2020

BUG: Groupby any/all raises TypeError for pd.NA #37501

Closed

3 tasks

jbrockmendel mentioned this issue Dec 2, 2020

BUG: groupby.rank for dt64tz, period dtypes #38187

Merged

5 tasks

jorisvandenbossche mentioned this issue Dec 4, 2020

BUG: preserve categorical & sparse types when grouping / pivot #27071

Merged

This was referenced Dec 5, 2020

API: add EA._from_scalars / stricter casting of result values back to EA dtype #38315

Closed

ENH: use correct dtype in groupby cython ops when it is known (without try/except) #38291

Merged

rhshadrach mentioned this issue May 6, 2021

BUG: groupby sum, mean, var should always be floats #41139

Merged

4 tasks

ppletscher mentioned this issue May 31, 2021

BUG: groupby().agg( ) with min/max on Int64 leads to incorrect results #41743

Closed

1 task

mroeschke added the Enhancement label Aug 14, 2021

phofl mentioned this issue Aug 31, 2021

BUG: groupby.std() casts to float64 #43330

Closed

3 tasks

jbrockmendel mentioned this issue Feb 22, 2022

EA interface - requirements for "hashable, value+order-preserving ndarray" #33276

Open

jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Jul 27, 2023

mroeschke closed this as completed May 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: improve the resulting dtype for groupby operations on nullable dtypes #37494

ENH: improve the resulting dtype for groupby operations on nullable dtypes #37494

jorisvandenbossche commented Oct 29, 2020

TomAugspurger commented Oct 29, 2020

Uh oh!

jorisvandenbossche commented Oct 30, 2020

Uh oh!

jbrockmendel commented Dec 1, 2020

Uh oh!

jorisvandenbossche commented Dec 4, 2020

Uh oh!

jbrockmendel commented Dec 4, 2020

Uh oh!

jbrockmendel commented Jul 27, 2023

Uh oh!

Uh oh!

ENH: improve the resulting dtype for groupby operations on nullable dtypes #37494

ENH: improve the resulting dtype for groupby operations on nullable dtypes #37494

Comments

jorisvandenbossche commented Oct 29, 2020

TomAugspurger commented Oct 29, 2020

Uh oh!

jorisvandenbossche commented Oct 30, 2020

Uh oh!

jbrockmendel commented Dec 1, 2020

Uh oh!

jorisvandenbossche commented Dec 4, 2020

Uh oh!

jbrockmendel commented Dec 4, 2020

Uh oh!

jbrockmendel commented Jul 27, 2023

Uh oh!