-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: improve the resulting dtype for groupby operations on nullable dtypes #37494
Comments
This feels similar to #31359, right? I guess that was (just?) for UDFs. In this case things should be much easier since each dtype should exactly know the right output dtype for each method, right? |
That PR does a |
TL;DR: there is one usage of I've spent some time the last few days tracking down the EA-casting done in the groupby code. Most of them are done within calls to
It's this last one in
Without the attempted EA construction we would end up with an object ndarray of Decimal objects. The catch is that this assumes that Until _from_sequence is sufficiently strict, I'm inclined to break the Decimal casting (to return ndarray[object] of Decimals) rather than risk silently returning incorrect results for arbitrary EAs. |
@jbrockmendel thanks for the overview!
AFAIK this doesn't only break decimal, but any EA with a user-defined function in groupby? Copying your comment from https://github.com/pandas-dev/pandas/pull/38164/files#r536193430:
Considering for a moment the |
Once _from_sequence is reliably strict, I wouldn't have a serious problem with the "try to cast back" approach. |
The examples in the OP all now cast to Masked dtypes. Closable? |
Follow-up on #37433, and partly related to #37493
Currently, after groupby operations we try to cast back to the original dtype when possible (at least in case of extension arrays). But this is not always correct, and also not done consistently. Some examples using the test case from the mentioned PR using a nullable Int64 column as input:
So some observations:
sum()
, we correctly have Int64 for the resultstd()
, we could use the nullable Float64 instead of float64 dtypemean()
, we incorrectly cast back to Int64 dtype, as the result of mean should always be floating (in this case the casting just happened to work because the means were rounded numbers)count()
, we did not create a nullable Int64 dtype for the result, while this could be done in the input is nullableThe text was updated successfully, but these errors were encountered: