Skip to content

mean of int64 results in int64 instead of float64 #11199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
c-garcia opened this issue Sep 27, 2015 · 24 comments
Closed

mean of int64 results in int64 instead of float64 #11199

c-garcia opened this issue Sep 27, 2015 · 24 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@c-garcia
Copy link

c-garcia commented Sep 27, 2015

xref #15091
xref #3707

Dear pandas team

My environment is:

numpy==1.9.3
pandas==0.16.2

On it, may it be possible that the bug #10172 has been re-introduced? For instance, with a csv file (mini.csv) with three int columns generated as per the code below:

import random

def contents(f, sc1=10, sc2=1000, cnt=10000):
    for i in range(cnt):
        zone = random.choice(range(10))
        val1 = random.randint(0, zone * sc1)
        val2 = random.randint(0, zone * sc2)
        f.write("%d,%d,%d\n" % (zone, val1, val2))

with open("mini.csv", "w") as f:
    f.write("zipcode,sqft,price\n")
    contents(f, sc2=1000000, cnt=50000)

Once we load the file into pandas

import pandas
mini = pandas.read_csv("mini.csv")
mini.groupby('zipcode')[['price']].mean()

results in an the price mean being an int64

          price
zipcode
0              0
1         499960
2        1005490
3        1465088
4        2001135
5        2495200
6        2993253
7        3569320
8        4076548
9        4416133

but if we add an extra column to the selection,

mini.groupby('zipcode')[['price','sqft']].mean()

The price mean is a float64

                  price       sqft
zipcode
0              0.000000   0.000000
1         499960.563138   5.000400
2        1005490.239062   9.928459
3        1465088.765919  14.922507
4        2001135.222045  20.148962
5        2495200.657097  25.160088
6        2993253.872691  29.624297
7        3569320.017926  35.089428
8        4076548.696706  39.605981
9        4416133.246409  44.851756

Thanks in advance

Cristobal

@jreback
Copy link
Contributor

jreback commented Sep 27, 2015

yeh, we try to cast back to the original dtype. The check if we can do this is not strict enough. pull-requests are welcome. This is actually done as float (the computation)

@jreback jreback added Bug Groupby Dtype Conversions Unexpected or buggy dtype conversions labels Sep 27, 2015
@jreback jreback added this to the 0.17.1 milestone Sep 27, 2015
@jreback
Copy link
Contributor

jreback commented Sep 27, 2015

xref to #9345 (though IIRC some other PR's since then)

cc @iwschris

@chrisbyboston
Copy link

I'll take a peek.

@chris-b1
Copy link
Contributor

@iwschris, fyi I would start here - https://github.com/pydata/pandas/blob/master/pandas/core/common.py#L1293

np.allclose seems to have relatively liberal defaults:

In [1]: np.allclose([49973200.3181996], [49973200])
Out[1]: True

In [2]: np.allclose([49973200.3181996], [49973200], rtol=1e-10)
Out[2]: False

@jreback
Copy link
Contributor

jreback commented Sep 27, 2015

we might need an additional test where you have float32 columns (which get computed using float64), then downcast back to float32 (This should work for that)

@shoyer
Copy link
Member

shoyer commented Sep 28, 2015

Why do we try to cast back to the original dtype for the mean of integers? I think we should follow numpy's lead and have the output dtype only depend on the input dtypes. The mean of an integer column should always be a float.

@jreback
Copy link
Contributor

jreback commented Sep 28, 2015

@shoyer

in general it is a nice thing to try to cast back to the input type of u can
in his case it's a bug as the precision check is too loose

eg I wouldn't want to always upcast float32
or return float just because I am doing a max for example

@chrisbyboston
Copy link

I was flying yesterday, but should get to look at this on the plane today. Might see some of you at Strata + Hadoop. I'll let you know if stuff goes wonky.

@TomAugspurger
Copy link
Contributor

Possibly related: #10972 (that's for transform).

@jreback
Copy link
Contributor

jreback commented Nov 13, 2015

@iwschris any luck with this?

@jreback jreback modified the milestones: Next Major Release, 0.17.1 Nov 13, 2015
@chrisbyboston
Copy link

Sorry Jeff, just went through conference season at work and haven't had a chance to review it. I'll look at it this weekend.

@jreback
Copy link
Contributor

jreback commented Nov 13, 2015

np

just sorting thru issues :)

@chrisbyboston
Copy link

Alright. I have a solid test failure now, and have proven that the issue is using np.allclose with the defaults. In particular the relative tolerance arg rtol is too high.

I'm going to add several more tests so that I can make sure we're covering the full spectrum of integer sizes.

@chrisbyboston
Copy link

I was thinking we could just drop rtol to 0, and use only absolute tolerance. Here I'm using the default atol of 1e-08:

In [1]: np.allclose([2147483647.0000001], [2147483647], rtol=0, atol=1e-08)
Out[1]: True

In [2]: np.allclose([2147483647.0000002], [2147483647], rtol=0, atol=1e-08)
Out[2]: False

In [3]: np.allclose([2147483648.0000002], [2147483648], rtol=0, atol=1e-08)
Out[3]: True

In [4]: np.allclose([2147483648.00000024], [2147483648], rtol=0, atol=1e-08)
Out[4]: False

The internal test for numpy is absolute(a - b) <= (atol + rtol * absolute(b))
...but it looks like some of those subtractions end up giving us 0:

In [5]: 2147483647.0000001 - 2147483647.0
Out[5]: 0.0

and when that happens it ends up returning True. I was hoping to make the behavior consistent so that within a certain absolute tolerance, it would be understood that pandas was going to cast back to an integer.

Before I go down that path further, do we want the cast-back behavior to be some absolute tolerance? If so, are we just dealing with an acceptable issue related to certain floats not being representable? I would think that we'd only want to cast back to an int when we're sure that we can do so safely, and I'm not 100% certain that we can know that.

Any thoughts on this would be appreciated.

@chrisbyboston
Copy link

Also, remember that this is only applicable for non-arithmetic aggregation functions on integers. The cast-back isn't necessary on first, last, nth, min, max, or count, because of the change I did earlier this year where we keep the integers as ints throughout the cython functions and back: 5f6cbf8

Just wondering if we want to do the allclose thing at all or just yank it out.

@shoyer
Copy link
Member

shoyer commented Nov 16, 2015

My two cents (and I know @jreback probably disagrees ;) ) is that we should not be casting back to integers for the result of mean under any circumstances. Just because it's "intuitive" does not mean it's a good thing. These sort of precision issues illustrate exactly why this is problematic. There is tremendous value in making operations predictable, and an important part of that is consistent dtypes.

@jreback
Copy link
Contributor

jreback commented Nov 16, 2015

actually was going to say that I think we should not upcast on mean (eg essentially return a float always)

see what tests break with this type of change - further we want to have an intuitive / predictable API that is as least complex as possible

iirc that is really the rationale here - what u put it is what u get out (but in this case it's a bit suprising)

@chrisbyboston
Copy link

Awesome. I'll get it done.

@chrisbyboston
Copy link

22 test failures. I'm going to be working through each of them to make sure that we aren't overlooking anything.

@chrisbyboston
Copy link

Ok after playing around with some ideas and watching the test failures, I think this should be done as two PR's.

In the first PR, I'd like to just tighten up our use of np.allclose, because it's way too loose right now. In that PR, I'm going to ignore the fact that mean shouldn't downcast at all, because that's really a separate issue. Currently _possibly_downcast_to_dtype is used in quite a few places and the result can be rather staggering with the current default args for np.allclose.

In the second PR, I'd like to look at remove downcasting entirely from operations where it doesn't make sense. Certainly mean is one of those, but there may be others as well.

Does that approach make sense?

@shoyer
Copy link
Member

shoyer commented Nov 18, 2015

Yep, seems reasonable to me.

On Tue, Nov 17, 2015 at 7:38 PM, Chris Reynolds notifications@github.com
wrote:

Ok after playing around with some ideas and watching the test failures, I
think this should be done as two PR's.

In the first PR, I'd like to just tighten up our use of np.allclose,
because it's way too loose right now. In that PR, I'm going to ignore the
fact that mean shouldn't downcast at all, because that's really a
separate issue. Currently _possibly_downcast_to_dtype is used in quite a
few places and the result can be rather staggering with the current default
args for np.allclose.

In the second PR, I'd like to look at remove downcasting entirely from
operations where it doesn't make sense. Certainly mean is one of those,
but there may be others as well.

Does that approach make sense?


Reply to this email directly or view it on GitHub
#11199 (comment).

@jreback
Copy link
Contributor

jreback commented Nov 18, 2015

sure

@jreback jreback modified the milestones: 0.18.1, Next Major Release Mar 12, 2016
@jreback jreback modified the milestones: 0.18.1, 0.18.2 Apr 26, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.20.0, 0.19.0 Aug 21, 2016
@jreback jreback added the Numeric Operations Arithmetic, Comparison, and Logical operations label Jan 9, 2017
@jreback jreback modified the milestones: 0.20.0, 0.21.0 Mar 23, 2017
@jreback
Copy link
Contributor

jreback commented May 31, 2017

#16549 should fix this.

if someone wants to do a test.

cc @gfyoung

@gfyoung
Copy link
Member

gfyoung commented Jun 11, 2017

@jreback : Yep, can do. This bug is patched but needs a test.

gfyoung added a commit to forking-repos/pandas that referenced this issue Jun 11, 2017
jreback pushed a commit that referenced this issue Jun 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

No branches or pull requests

8 participants