-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
mean of int64 results in int64 instead of float64 #11199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
yeh, we try to cast back to the original dtype. The check if we can do this is not strict enough. pull-requests are welcome. This is actually done as float (the computation) |
I'll take a peek. |
@iwschris, fyi I would start here - https://github.com/pydata/pandas/blob/master/pandas/core/common.py#L1293
|
we might need an additional test where you have |
Why do we try to cast back to the original dtype for the mean of integers? I think we should follow numpy's lead and have the output dtype only depend on the input dtypes. The mean of an integer column should always be a float. |
in general it is a nice thing to try to cast back to the input type of u can eg I wouldn't want to always upcast float32 |
I was flying yesterday, but should get to look at this on the plane today. Might see some of you at Strata + Hadoop. I'll let you know if stuff goes wonky. |
Possibly related: #10972 (that's for transform). |
@iwschris any luck with this? |
Sorry Jeff, just went through conference season at work and haven't had a chance to review it. I'll look at it this weekend. |
np just sorting thru issues :) |
Alright. I have a solid test failure now, and have proven that the issue is using I'm going to add several more tests so that I can make sure we're covering the full spectrum of integer sizes. |
I was thinking we could just drop
The internal test for numpy is
and when that happens it ends up returning Before I go down that path further, do we want the cast-back behavior to be some absolute tolerance? If so, are we just dealing with an acceptable issue related to certain floats not being representable? I would think that we'd only want to cast back to an int when we're sure that we can do so safely, and I'm not 100% certain that we can know that. Any thoughts on this would be appreciated. |
Also, remember that this is only applicable for non-arithmetic aggregation functions on integers. The cast-back isn't necessary on Just wondering if we want to do the |
My two cents (and I know @jreback probably disagrees ;) ) is that we should not be casting back to integers for the result of mean under any circumstances. Just because it's "intuitive" does not mean it's a good thing. These sort of precision issues illustrate exactly why this is problematic. There is tremendous value in making operations predictable, and an important part of that is consistent dtypes. |
actually was going to say that I think we should not upcast on mean (eg essentially return a float always) see what tests break with this type of change - further we want to have an intuitive / predictable API that is as least complex as possible iirc that is really the rationale here - what u put it is what u get out (but in this case it's a bit suprising) |
Awesome. I'll get it done. |
22 test failures. I'm going to be working through each of them to make sure that we aren't overlooking anything. |
Ok after playing around with some ideas and watching the test failures, I think this should be done as two PR's. In the first PR, I'd like to just tighten up our use of In the second PR, I'd like to look at remove downcasting entirely from operations where it doesn't make sense. Certainly Does that approach make sense? |
Yep, seems reasonable to me. On Tue, Nov 17, 2015 at 7:38 PM, Chris Reynolds notifications@github.com
|
sure |
@jreback : Yep, can do. This bug is patched but needs a test. |
xref #15091
xref #3707
Dear pandas team
My environment is:
On it, may it be possible that the bug #10172 has been re-introduced? For instance, with a csv file (
mini.csv
) with three int columns generated as per the code below:Once we load the file into pandas
results in an the price mean being an int64
but if we add an extra column to the selection,
The price mean is a float64
Thanks in advance
Cristobal
The text was updated successfully, but these errors were encountered: