pivot_table: precision lost #15091

mstanichenko · 2017-01-09T19:52:10Z

Code Sample, a copy-pastable example if possible

In [29]: df = pd.DataFrame([[1, 2, 11111111111111111]], columns=['index', 'type', 'value'])

In [30]: df.dtypes
Out[30]: 
index    int64
type     int64
value    int64
dtype: object

In [31]: df.pivot_table(index='index', columns='type', values='value')
Out[31]: 
type                   2
index                   
1      11111111111111112

Problem description

Since value is a 64-bit integer, we should be able to present it with zero precision loss.

Expected Output

In [31]: df.pivot_table(index='index', columns='type', values='value')
Out[31]: 
type                   2
index                   
1      11111111111111111

Output of `pd.show_versions()`

# Paste the output here pd.show_versions() here

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-514.2.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
nose: 1.3.7
pip: 9.0.1
setuptools: 28.8.0.post20161110
Cython: 0.23.4
numpy: 1.11.2
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.6.0
pytz: 2016.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.2
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0

The text was updated successfully, but these errors were encountered:

jreback · 2017-01-09T20:00:48Z

this is passing thru float

In [5]: np.int64(11111111111111111).astype(float).astype('int64')
Out[5]: 11111111111111112

very similar to this

In [7]: df.groupby('index').value.mean()
Out[7]: 
index
1    11111111111111112
Name: value, dtype: int64

so even though its cast back it will lose precision.

not really sure of a great way to prevent this and simulataneously deal with overflow.

jorisvandenbossche · 2017-01-09T20:01:22Z

You can use pivot instead of pivot_table for now to circumvent this:

In [6]: df.pivot(index='index', columns='type', values='value')
Out[6]: 
type                   2
index                   
1      11111111111111111

The thing is that pivot_table takes a mean, and does not just represent the original value, so this is the reason you see the precion error.

In [10]: df['value'].mean()
Out[10]: 1.1111111111111112e+16

So not sure there is something to do about it.

jreback · 2017-01-09T20:03:44Z

xref to #3707

jorisvandenbossche · 2017-01-09T20:06:16Z

@jreback Do we have a general issue about the precision for int64 algos like mean? As this is not specific to pivot_table, so I would close this.
Although it is specific to pivot_table/groupby that we cast back to int instead of leaving it as a float (which masks the reason for the precision error), but that is covered here: #11199 (comment)

jreback · 2017-01-09T20:09:16Z

oh that whole issue is better anyhow #11199

mstanichenko · 2017-01-09T20:14:59Z

You can use pivot instead of pivot_table for now

Unfortunately, pivot is not able to handle multiindex.
I deliberately simplified the example to present the point.

jorisvandenbossche · 2017-01-09T20:18:53Z

@mstanichenko pivot is specifically defined not to handle aggregations (pure reshaping function), but otherwise there should not be more restrictions in comparison to pivot_table (or otherwise this may be a bug). Can you give a small example of how you want to use multiindex? (if it is still without aggregation of course)

mstanichenko · 2017-01-09T20:33:44Z

@jorisvandenbossche, I don't really need a multiindex.
You are right that I need just to reshape a dataframe. Unfortunately, pivot didn't work for me.

Can you give a small example of how you want to use multiindex?

I have a set of components and subcomponents which regularly take timestamps.
The timestamps can be of various types. Therefore the dataframe looks like following.

In [41]: df_ts.columns
Out[41]: Index(['component_id', 'subcomponent_id', 'timestamp_type', 'timestamp_value'], dtype='object')

All I want to do is

df = df_ts.pivot(index=['component_id', 'subcomponent_id'], colums='timestamp_type', values='timestamp_value').reset_index()

jorisvandenbossche · 2017-01-09T20:40:21Z

So with a dummy example, something like this?

In [73]: df = pd.DataFrame({'A':[0,0,0,1,1,1], 'B':[0,1,0,1,0,1], 'C':[0,0,1,1,1,0], 'D':[1,2,3,4,5,6]})

In [74]: df
Out[74]: 
   A  B  C  D
0  0  0  0  1
1  0  1  0  2
2  0  0  1  3
3  1  1  1  4
4  1  0  1  5
5  1  1  0  6

In [75]: df.pivot(index=['A', 'B'], columns='C', values='D')
...
ValueError: Wrong number of items passed 6, placement implies 2

In [76]: df.pivot_table(index=['A', 'B'], columns='C', values='D')
Out[76]: 
C      0    1
A B          
0 0  1.0  3.0
  1  2.0  NaN
1 0  NaN  5.0
  1  6.0  4.0

where you would expect the output of pivot being the same as pivot_table ?

mstanichenko · 2017-01-09T20:57:04Z

@jorisvandenbossche , sorry but I lost the point.

You dummy example looks fine to me. I need the output that pivot_table gives to us

In [48]: df = pd.DataFrame({'A':[0,0,0,1,1,1], 'B':[0,1,0,1,0,1], 'C':[0,0,1,1,1,0], 'D':[1,2,3,4,5,6]})

In [49]: df.pivot_table(index=['A', 'B'], columns='C', values='D').reset_index()
Out[49]: 
C  A  B    0    1
0  0  0  1.0  3.0
1  0  1  2.0  NaN
2  1  0  NaN  5.0
3  1  1  6.0  4.0

I never claim that pivot and pivot_table are the same. I just don't understand how one can come up with the desired output with pivot.

jorisvandenbossche · 2017-01-09T21:04:43Z

My point is that pivot should (maybe) also be able to do this (so give the same result as pivot_table in this case). As my original comment was: if you don't do any aggregation but pure reshape, you can use pivot instead of pivot_table (which didn't work for you because of the above).

BTW, you can mimick pivot with a combination of set_index and unstack:

In [95]: df.set_index(['index', 'type']).unstack()
Out[95]: 
                   value
type                   2
index                   
1      11111111111111111

but if that will help against the precision loss will depend on whether there will be NaNs in the result or not (as it is cast to float anyway in that case)

mstanichenko · 2017-01-09T21:17:06Z

Could you please provide a dataframe for which pivot would work?
Although, I got your point that pivot just reshapes the data, it does not even work for the simplest case

In [53]: df = pd.DataFrame({'A':[0], 'B':[0], 'C':[1], 'D':[2]})

In [54]: df
Out[54]: 
   A  B  C  D
0  0  0  1  2

In [55]: df.pivot(index=['A', 'B'], columns='C', values='D')
IndexError: index 1 is out of bounds for size 1

jreback · 2017-01-09T21:51:17Z

In [28]: df.set_index(['A', 'B', 'C']).unstack()
Out[28]: 
     D
C    1
A B   
0 0  2

what would you expected [55] to yield?

mstanichenko · 2017-01-10T20:45:16Z

@jreback, I expect [55] to give the output [28] of your example.

@jreback, @jorisvandenbossche, thank you very much for the useful tips. I managed to mimic pivot_table behavior without mangling the underlying data.

Nevertheless, it's rather difficult to spot that the data were actually changed by pandas as there are no warning that might catch one's attention. In the meantime I managed to bump into another (although very similar) issue (#8596)

From my limited experience data types are the most frequent source of issues and the most annoying thing for me :-)

Again, thank you very much for you assistance :-)

jreback added Dtype Conversions Unexpected or buggy dtype conversions Groupby Numeric Operations Arithmetic, Comparison, and Logical operations labels Jan 9, 2017

jreback closed this as completed Jan 9, 2017

jreback added this to the No action milestone Jan 9, 2017

c-garcia mentioned this issue Jan 9, 2017

mean of int64 results in int64 instead of float64 #11199

Closed

jreback mentioned this issue Jun 11, 2017

TST: Add test for groupby mean of ints #16670

Merged

gfyoung mentioned this issue Jun 12, 2017

BUG / ENH: Implement groupby_helper funcs for int #16676

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pivot_table: precision lost #15091

pivot_table: precision lost #15091

mstanichenko commented Jan 9, 2017

INSTALLED VERSIONS

jreback commented Jan 9, 2017

jorisvandenbossche commented Jan 9, 2017

jreback commented Jan 9, 2017

jorisvandenbossche commented Jan 9, 2017

jreback commented Jan 9, 2017

mstanichenko commented Jan 9, 2017

jorisvandenbossche commented Jan 9, 2017

mstanichenko commented Jan 9, 2017

jorisvandenbossche commented Jan 9, 2017

mstanichenko commented Jan 9, 2017 •

edited

Loading

jorisvandenbossche commented Jan 9, 2017

mstanichenko commented Jan 9, 2017

jreback commented Jan 9, 2017

mstanichenko commented Jan 10, 2017

pivot_table: precision lost #15091

pivot_table: precision lost #15091

Comments

mstanichenko commented Jan 9, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Jan 9, 2017

jorisvandenbossche commented Jan 9, 2017

jreback commented Jan 9, 2017

jorisvandenbossche commented Jan 9, 2017

jreback commented Jan 9, 2017

mstanichenko commented Jan 9, 2017

jorisvandenbossche commented Jan 9, 2017

mstanichenko commented Jan 9, 2017

jorisvandenbossche commented Jan 9, 2017

mstanichenko commented Jan 9, 2017 • edited Loading

jorisvandenbossche commented Jan 9, 2017

mstanichenko commented Jan 9, 2017

jreback commented Jan 9, 2017

mstanichenko commented Jan 10, 2017

Output of `pd.show_versions()`

mstanichenko commented Jan 9, 2017 •

edited

Loading