Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: improve the resulting dtype for groupby operations on nullable dtypes #37494

Open
jorisvandenbossche opened this issue Oct 29, 2020 · 6 comments
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Groupby NA - MaskedArrays Related to pd.NA and nullable extension arrays

Comments

@jorisvandenbossche
Copy link
Member

Follow-up on #37433, and partly related to #37493

Currently, after groupby operations we try to cast back to the original dtype when possible (at least in case of extension arrays). But this is not always correct, and also not done consistently. Some examples using the test case from the mentioned PR using a nullable Int64 column as input:

In [1]: df = DataFrame(
   ...:     {
   ...:         "A": ["A", "B"] * 5,
   ...:         "B": pd.array([1, 2, 3, 4, 5, 6, 7, 8, 9, pd.NA], dtype="Int64"),
   ...:     }
   ...: )

In [2]: df.groupby("A")["B"].sum()
Out[2]: 
A
A    25
B    20
Name: B, dtype: Int64

In [3]: df.groupby("A")["B"].std()
Out[3]: 
A
A    3.162278
B    2.581989
Name: B, dtype: float64

In [4]: df.groupby("A")["B"].mean()
Out[4]: 
A
A    5
B    5
Name: B, dtype: Int64

In [5]: df.groupby("A")["B"].count()
Out[5]: 
A
A    5
B    4
Name: B, dtype: int64

So some observations:

  • For sum(), we correctly have Int64 for the result
  • For std(), we could use the nullable Float64 instead of float64 dtype
  • For mean(), we incorrectly cast back to Int64 dtype, as the result of mean should always be floating (in this case the casting just happened to work because the means were rounded numbers)
  • For count(), we did not create a nullable Int64 dtype for the result, while this could be done in the input is nullable
@jorisvandenbossche jorisvandenbossche added Groupby ExtensionArray Extending pandas with custom dtypes or arrays. NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Oct 29, 2020
@TomAugspurger
Copy link
Contributor

This feels similar to #31359, right? I guess that was (just?) for UDFs. In this case things should be much easier since each dtype should exactly know the right output dtype for each method, right?

@jorisvandenbossche
Copy link
Member Author

That PR does a try_cast_to_ea, which we introduced to try to preserve the extension dtype. But it is exactly that usage that can also give wrong results for known operations (like mean should not preserve int dtype, but give float). Indeed, as you say, for the built-in methods we know exactly the expected output dtype, and so we can have more specific / correct logic for this.

@jbrockmendel
Copy link
Member

TL;DR: there is one usage of maybe_cast_to_extension_array that is liable to produce incorrect results on UDFs, but the rest we can make reliable.

I've spent some time the last few days tracking down the EA-casting done in the groupby code. Most of them are done within calls to dtypes.cast.maybe_cast_result, which unfortunately special-cases dt64tz and categorical. In total we have:

  1. a try/except for _from_sequence in DataFrameGroupBy_cython_agg_blocks
  2. maybe_cast_result and DTA/PA constructor call in BaseGrouper._cython_operation
  3. 1 maybe_cast_result call in BaseGroupBy._cython_transform
    • can be replaced with numpy-only casting consolidated into one place at the end of BaseGrouper._cython_operation (no PR yet)
  4. 2 maybe_cast_result calls in BaseGroupBy._cython_agg_general
    • included in the fix for 3)
  5. 2 maybe_cast_result calls in BaseGroupBy._python_agg_general
    • This can be reduced to two calls to numpy-only maybe_downcast_to_dtype and one maybe_cast_to_ea/equivalent at the end of BaseGrouper._aggregate_series_pure_python.

It's this last one in _aggregate_series_pure_python that is the problem child. It is needed for 12 tests that use DecimalArray, 8 of them of the form:

from pandas.tests.extension.decimal.array import DecimalArray, make_data, to_decimal

data = make_data()[:5]
df = DataFrame(
    {"id1": [0, 0, 0, 1, 1], "id2": [0, 1, 0, 1, 1], "decimals": DecimalArray(data)}
)

expected = Series(to_decimal([data[0], data[3]]))

err_cls = RuntimeError  # parametrized over a bunch of Exception subclasses

def weird_func(x):
    # weird function that raise something other than TypeError or IndexError
    #  in _python_agg_general
    if len(x) == 0:
        raise err_cls
    return x.iloc[0]

result = df["decimals"].groupby(df["id1"]).agg(weird_func)
tm.assert_series_equal(result, expected, check_names=False)

Without the attempted EA construction we would end up with an object ndarray of Decimal objects.

The catch is that this assumes that _from_sequence is sufficiently strict (xref #33254) (which DecimalArray._from_sequence is). We are currently avoiding getting incorrect results for dt64tz cases because of a kludge in maybe_cast_result to explicitly exclude dt64tz.

Until _from_sequence is sufficiently strict, I'm inclined to break the Decimal casting (to return ndarray[object] of Decimals) rather than risk silently returning incorrect results for arbitrary EAs.

@jorisvandenbossche
Copy link
Member Author

@jbrockmendel thanks for the overview!

Until _from_sequence is sufficiently strict, I'm inclined to break the Decimal casting (to return ndarray[object] of Decimals) rather than risk silently returning incorrect results for arbitrary EAs.

AFAIK this doesn't only break decimal, but any EA with a user-defined function in groupby?

Copying your comment from https://github.com/pandas-dev/pandas/pull/38164/files#r536193430:

[about casting back in general] i am trying to get rid of that altogether, and only do cast-back in cases where we know it is correct. If/when _from_sequence is reliably strict we can reconsider.

Considering for a moment the _from_sequence issue is solved: are you then OK with this "try to cast back" for generic ops?

@jbrockmendel
Copy link
Member

Considering for a moment the _from_sequence issue is solved: are you then OK with this "try to cast back" for generic ops?

Once _from_sequence is reliably strict, I wouldn't have a serious problem with the "try to cast back" approach.

@jbrockmendel
Copy link
Member

The examples in the OP all now cast to Masked dtypes. Closable?

@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Groupby NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

No branches or pull requests

4 participants