Skip to content

DataFrame.groupby() causes loss of precision on large integer values #5260

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nisaggarwal opened this issue Oct 18, 2013 · 4 comments
Closed
Labels
Dtype Conversions Unexpected or buggy dtype conversions Groupby
Milestone

Comments

@nisaggarwal
Copy link

Here's what I've in a dataframe to start with:

In [298]: g
Out[298]: 
                     ts  level
33  1382016600011617669     32
34  1382016600011625872     32
44  1382016600013590377     43
45  1382016600013598606     43

Running groupby() causes my 'ts' value to change and causes some loss of precision in the smallest units. My guess would be somewhere internally this column
is converted to a np.float64 and then returned back as an np.int64.

In [299]: g.groupby('level').last()
Out[299]: 
                        ts
level                     
32     1382016600011625984
43     1382016600013598720

Type information:

In [300]: g.dtypes
Out[300]: 
ts       int64
level    int64
dtype: object

Version info:

In [303]: pd.__version__
Out[303]: '0.12.0'

In [304]: np.__version__
Out[304]: '1.7.1'
@jreback
Copy link
Contributor

jreback commented Oct 18, 2013

yep that's exactly what happens
related #3007

not hard to fix by expanding the code generator for more types and then
directly them properly

would welcome a PR for this

@jtratner
Copy link
Contributor

how would fused types help this?

@jreback
Copy link
Contributor

jreback commented Oct 18, 2013

its not fused types, but for example, need a definition for group_last_int64 (this part is easy, just a slight modification in the code generator). The harder part is then executing the group block-wise (e.g. by dtype) and dispatching, rather than just casting to floa64 (and then back). A bit tricky because of the splitting/recombing, but more so because you have to handle errors at the cython level when you have overflow (then upcast it). and do it again with a newer type. In some cases this can be predicted so its not a big deal (e.g. in the case of last, you know that you can always have the same return type, but sum could cause upcasting).

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Mar 19, 2014
@jreback
Copy link
Contributor

jreback commented Mar 3, 2015

this was fixed by #9311

In [5]: df.groupby('level').last()
Out[5]: 
                        ts
level                     
32     1382016600011625872
43     1382016600013598606

@jreback jreback closed this as completed Mar 3, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Groupby
Projects
None yet
Development

No branches or pull requests

3 participants