Skip to content

pivot_table: precision lost #15091

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mstanichenko opened this issue Jan 9, 2017 · 14 comments
Closed

pivot_table: precision lost #15091

mstanichenko opened this issue Jan 9, 2017 · 14 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Groupby Numeric Operations Arithmetic, Comparison, and Logical operations

Comments

@mstanichenko
Copy link

Code Sample, a copy-pastable example if possible

In [29]: df = pd.DataFrame([[1, 2, 11111111111111111]], columns=['index', 'type', 'value'])

In [30]: df.dtypes
Out[30]: 
index    int64
type     int64
value    int64
dtype: object

In [31]: df.pivot_table(index='index', columns='type', values='value')
Out[31]: 
type                   2
index                   
1      11111111111111112

Problem description

Since value is a 64-bit integer, we should be able to present it with zero precision loss.

Expected Output

In [31]: df.pivot_table(index='index', columns='type', values='value')
Out[31]: 
type                   2
index                   
1      11111111111111111

Output of pd.show_versions()

# Paste the output here pd.show_versions() here

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-514.2.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
nose: 1.3.7
pip: 9.0.1
setuptools: 28.8.0.post20161110
Cython: 0.23.4
numpy: 1.11.2
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.6.0
pytz: 2016.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.2
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0

@jreback
Copy link
Contributor

jreback commented Jan 9, 2017

this is passing thru float

In [5]: np.int64(11111111111111111).astype(float).astype('int64')
Out[5]: 11111111111111112

very similar to this

In [7]: df.groupby('index').value.mean()
Out[7]: 
index
1    11111111111111112
Name: value, dtype: int64

so even though its cast back it will lose precision.

not really sure of a great way to prevent this and simulataneously deal with overflow.

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Groupby Numeric Operations Arithmetic, Comparison, and Logical operations labels Jan 9, 2017
@jorisvandenbossche
Copy link
Member

You can use pivot instead of pivot_table for now to circumvent this:

In [6]: df.pivot(index='index', columns='type', values='value')
Out[6]: 
type                   2
index                   
1      11111111111111111

The thing is that pivot_table takes a mean, and does not just represent the original value, so this is the reason you see the precion error.

In [10]: df['value'].mean()
Out[10]: 1.1111111111111112e+16

So not sure there is something to do about it.

@jreback
Copy link
Contributor

jreback commented Jan 9, 2017

xref to #3707

@jorisvandenbossche
Copy link
Member

@jreback Do we have a general issue about the precision for int64 algos like mean? As this is not specific to pivot_table, so I would close this.
Although it is specific to pivot_table/groupby that we cast back to int instead of leaving it as a float (which masks the reason for the precision error), but that is covered here: #11199 (comment)

@jreback
Copy link
Contributor

jreback commented Jan 9, 2017

oh that whole issue is better anyhow #11199

@jreback jreback closed this as completed Jan 9, 2017
@jreback jreback added this to the No action milestone Jan 9, 2017
@mstanichenko
Copy link
Author

You can use pivot instead of pivot_table for now

Unfortunately, pivot is not able to handle multiindex.
I deliberately simplified the example to present the point.

@jorisvandenbossche
Copy link
Member

@mstanichenko pivot is specifically defined not to handle aggregations (pure reshaping function), but otherwise there should not be more restrictions in comparison to pivot_table (or otherwise this may be a bug). Can you give a small example of how you want to use multiindex? (if it is still without aggregation of course)

@mstanichenko
Copy link
Author

@jorisvandenbossche, I don't really need a multiindex.
You are right that I need just to reshape a dataframe. Unfortunately, pivot didn't work for me.

Can you give a small example of how you want to use multiindex?

I have a set of components and subcomponents which regularly take timestamps.
The timestamps can be of various types. Therefore the dataframe looks like following.

In [41]: df_ts.columns
Out[41]: Index(['component_id', 'subcomponent_id', 'timestamp_type', 'timestamp_value'], dtype='object')

All I want to do is

df = df_ts.pivot(index=['component_id', 'subcomponent_id'], colums='timestamp_type', values='timestamp_value').reset_index()

@jorisvandenbossche
Copy link
Member

So with a dummy example, something like this?

In [73]: df = pd.DataFrame({'A':[0,0,0,1,1,1], 'B':[0,1,0,1,0,1], 'C':[0,0,1,1,1,0], 'D':[1,2,3,4,5,6]})

In [74]: df
Out[74]: 
   A  B  C  D
0  0  0  0  1
1  0  1  0  2
2  0  0  1  3
3  1  1  1  4
4  1  0  1  5
5  1  1  0  6

In [75]: df.pivot(index=['A', 'B'], columns='C', values='D')
...
ValueError: Wrong number of items passed 6, placement implies 2

In [76]: df.pivot_table(index=['A', 'B'], columns='C', values='D')
Out[76]: 
C      0    1
A B          
0 0  1.0  3.0
  1  2.0  NaN
1 0  NaN  5.0
  1  6.0  4.0

where you would expect the output of pivot being the same as pivot_table ?

@mstanichenko
Copy link
Author

mstanichenko commented Jan 9, 2017

@jorisvandenbossche , sorry but I lost the point.

You dummy example looks fine to me. I need the output that pivot_table gives to us

In [48]: df = pd.DataFrame({'A':[0,0,0,1,1,1], 'B':[0,1,0,1,0,1], 'C':[0,0,1,1,1,0], 'D':[1,2,3,4,5,6]})

In [49]: df.pivot_table(index=['A', 'B'], columns='C', values='D').reset_index()
Out[49]: 
C  A  B    0    1
0  0  0  1.0  3.0
1  0  1  2.0  NaN
2  1  0  NaN  5.0
3  1  1  6.0  4.0

I never claim that pivot and pivot_table are the same. I just don't understand how one can come up with the desired output with pivot.

@jorisvandenbossche
Copy link
Member

My point is that pivot should (maybe) also be able to do this (so give the same result as pivot_table in this case). As my original comment was: if you don't do any aggregation but pure reshape, you can use pivot instead of pivot_table (which didn't work for you because of the above).

BTW, you can mimick pivot with a combination of set_index and unstack:

In [95]: df.set_index(['index', 'type']).unstack()
Out[95]: 
                   value
type                   2
index                   
1      11111111111111111

but if that will help against the precision loss will depend on whether there will be NaNs in the result or not (as it is cast to float anyway in that case)

@mstanichenko
Copy link
Author

Could you please provide a dataframe for which pivot would work?
Although, I got your point that pivot just reshapes the data, it does not even work for the simplest case

In [53]: df = pd.DataFrame({'A':[0], 'B':[0], 'C':[1], 'D':[2]})

In [54]: df
Out[54]: 
   A  B  C  D
0  0  0  1  2

In [55]: df.pivot(index=['A', 'B'], columns='C', values='D')
IndexError: index 1 is out of bounds for size 1

@jreback
Copy link
Contributor

jreback commented Jan 9, 2017

In [28]: df.set_index(['A', 'B', 'C']).unstack()
Out[28]: 
     D
C    1
A B   
0 0  2

what would you expected [55] to yield?

@mstanichenko
Copy link
Author

@jreback, I expect [55] to give the output [28] of your example.

@jreback, @jorisvandenbossche, thank you very much for the useful tips. I managed to mimic pivot_table behavior without mangling the underlying data.

Nevertheless, it's rather difficult to spot that the data were actually changed by pandas as there are no warning that might catch one's attention. In the meantime I managed to bump into another (although very similar) issue (#8596)

From my limited experience data types are the most frequent source of issues and the most annoying thing for me :-)

Again, thank you very much for you assistance :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Groupby Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

No branches or pull requests

3 participants