Skip to content

Floating point precision in DataFrame.to_csv #2069

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wesm opened this issue Oct 14, 2012 · 14 comments
Closed

Floating point precision in DataFrame.to_csv #2069

wesm opened this issue Oct 14, 2012 · 14 comments
Labels
IO CSV read_csv, to_csv Output-Formatting __repr__ of pandas objects, to_string
Milestone

Comments

@wesm
Copy link
Member

wesm commented Oct 14, 2012

http://stackoverflow.com/questions/12877189/float64-with-pandas-to-csv

What does R (or others) do?

@pmorissette
Copy link

Hey all,

I just started using Pandas a few days ago and ran into a related issue.

Basically I am reading in data from a .csv file. I have been writing some unit tests and was getting some errors because my expected values were different from the ones I calculated in Excel. At first, I assumed it was due to rounding but when I inspected my data frame, I realized that I was getting errors because of floating point issues. Basically, an input price of 7.34 was now 7.3399999999999999 (I am working with stock prices).

I was just wondering what the recommended way of dealing with this is, if any? Should I be converting my data frame to another type once imported?

Thanks in advance for your help and great job on this solid library.

@wesm
Copy link
Member Author

wesm commented Nov 3, 2012

It seems that CPython does a better job of float formatting than NumPy. I'll see what I can do

@wesm
Copy link
Member Author

wesm commented Nov 3, 2012

I can't manage to find a standalone reproduction of this. The csv module uses str (via PyObject_Str) to format the numbers, and that appears to work fine on numbers like 0.085 or 7.34. If someone can post an example illustrating this breaking down, I'll see what I can do

@adamobeng
Copy link
Contributor

I think I've been able to reproduce this:

    df = pa.DataFrame({'float' : [9.728141, 4.810295]})
    df.to_csv('floats.csv')

floats.csv looks like:

,float                                                                                                                                                                 
0,9.7281410000000008
1,4.810295

@wesm
Copy link
Member Author

wesm commented Nov 28, 2012

What OS/Python/NumPy combination are you using?

@adamobeng
Copy link
Contributor

uname -a

 Darwin boron 12.2.0 Darwin Kernel Version 12.2.0: Sat Aug 25 00:48:52 PDT 2012; root:xnu-2050.18.24~1/RELEASE_X86_64 x86_64

sys.version

'2.7.3 (default, Nov  3 2012, 17:31:26) \n[GCC 4.2.1 Compatible Apple Clang 4.0 ((tags/Apple/clang-421.0.57))]'

np.version

1.6.2

Edit: This does not happen (i.e. the output is as expected) on an EC2 node running starcluster with:

uname -a

Linux master 3.0.0-14-virtual #23-Ubuntu SMP Mon Nov 21 21:09:11 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

sys.version

'2.7.2+ (default, Oct  4 2011, 20:06:09) \n[GCC 4.6.1]'

np.version

'1.6.2'

@wesm
Copy link
Member Author

wesm commented Nov 28, 2012

Urgh I've dug down into the belly of the Python interpreter and believe that the formatting is eventually happening in the C stdlib, which means that Linux and OS X (BSD) have slightly different implementations. This is annoying is crap.

@adamobeng
Copy link
Contributor

If I understand correctly, the problem comes from trying to write the underlying ndarray directly.

Is there a philosophical reason why there could not be a DataFrameFormatter for the CSV format, given that FloatArrayFormatter already takes care of this problem when outputting to LaTeX, HTML and plain text?

@wesm
Copy link
Member Author

wesm commented Nov 28, 2012

I guess the concern would be loss of precision

@adamobeng
Copy link
Contributor

It depends whether you're using the CSV file for display or storage (i.e. as a faithful reproduction of the DataFrame). You might argue that using CSVs for storage is a bad idea anyway, because if the DataFrame contains arbitrary objects, you'll only end up with their string representations. Especially when you can serialize the same data very easily.

@antonywu
Copy link

antonywu commented Apr 6, 2013

So the current workaround is to use Linux, instead of Mac to get the results we wanted in csv file?
Honestly, for display purpose, I would prefer the option to intentionally drop trailing digits (yes, I mean rounding)... I wonder if there is a way to make it happen with .to_csv()..or would I have to write my own .to_csv() with dataframe iteration + round()

@frgomes
Copy link

frgomes commented May 8, 2013

I detected that read_csv has this bug too.

It's not a Python format issue. It's not a general floating point issue, despite it's true that floating point arithmetic is a subject which demands some care from the programmer. This article below clarifies a bit this subject:

http://docs.python.org/2/tutorial/floatingpoint.html

The problem is that it's necessary to employ fixed point arithmetic and only convert to floating point in the end, applying a convenient divisor.

A classic one-liner which shows the "problem" is ...

0.1 + 0.1 + 0.1
0.30000000000000004

... which does not display 0.3 as one would expect. On the other hand, if you handle the calculation using fixed point arithmetic and only in the last step you employ floating point arithmetic, it will work as you expect. See this:

(1 + 1 + 1) * 1.0 / 10
0.3

So, it's necessary to account to the position of the decimal point, ignore it initially and go ahead with the algorithm which converts text to integers (not floats!). The last step consists on converting an integer to a float by dividing by an adequate power of 10.

If you desperately need to circumvent this problem quickly, I recommend you create another CSV file which contains all figures as integers, for example multiplying by 100, 1000 or other factor which turns out to be convenient. Inside your application, read the CSV file as usual and you will get those integer values back. Then convert those values to floating point, dividing by the same factor you multiplied before.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

closing in favor of #4668

@jiahe224
Copy link

@pmorissette Hi, Have you found a solution? I found this problem whenever read decimals to dataframe and save as other file, I don't want to use solutions like round or format

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

No branches or pull requests

7 participants