-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Floating point precision in DataFrame.to_csv #2069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey all, I just started using Pandas a few days ago and ran into a related issue. Basically I am reading in data from a .csv file. I have been writing some unit tests and was getting some errors because my expected values were different from the ones I calculated in Excel. At first, I assumed it was due to rounding but when I inspected my data frame, I realized that I was getting errors because of floating point issues. Basically, an input price of 7.34 was now 7.3399999999999999 (I am working with stock prices). I was just wondering what the recommended way of dealing with this is, if any? Should I be converting my data frame to another type once imported? Thanks in advance for your help and great job on this solid library. |
It seems that CPython does a better job of float formatting than NumPy. I'll see what I can do |
I can't manage to find a standalone reproduction of this. The csv module uses |
I think I've been able to reproduce this: df = pa.DataFrame({'float' : [9.728141, 4.810295]})
df.to_csv('floats.csv')
|
What OS/Python/NumPy combination are you using? |
uname -a
sys.version
np.version
Edit: This does not happen (i.e. the output is as expected) on an EC2 node running starcluster with: uname -a
sys.version
np.version
|
Urgh I've dug down into the belly of the Python interpreter and believe that the formatting is eventually happening in the C stdlib, which means that Linux and OS X (BSD) have slightly different implementations. This is annoying is crap. |
If I understand correctly, the problem comes from trying to write the underlying ndarray directly. Is there a philosophical reason why there could not be a |
I guess the concern would be loss of precision |
It depends whether you're using the CSV file for display or storage (i.e. as a faithful reproduction of the DataFrame). You might argue that using CSVs for storage is a bad idea anyway, because if the DataFrame contains arbitrary objects, you'll only end up with their string representations. Especially when you can serialize the same data very easily. |
So the current workaround is to use Linux, instead of Mac to get the results we wanted in csv file? |
I detected that read_csv has this bug too. It's not a Python format issue. It's not a general floating point issue, despite it's true that floating point arithmetic is a subject which demands some care from the programmer. This article below clarifies a bit this subject: http://docs.python.org/2/tutorial/floatingpoint.html The problem is that it's necessary to employ fixed point arithmetic and only convert to floating point in the end, applying a convenient divisor. A classic one-liner which shows the "problem" is ...
... which does not display 0.3 as one would expect. On the other hand, if you handle the calculation using fixed point arithmetic and only in the last step you employ floating point arithmetic, it will work as you expect. See this:
So, it's necessary to account to the position of the decimal point, ignore it initially and go ahead with the algorithm which converts text to integers (not floats!). The last step consists on converting an integer to a float by dividing by an adequate power of 10. If you desperately need to circumvent this problem quickly, I recommend you create another CSV file which contains all figures as integers, for example multiplying by 100, 1000 or other factor which turns out to be convenient. Inside your application, read the CSV file as usual and you will get those integer values back. Then convert those values to floating point, dividing by the same factor you multiplied before. |
closing in favor of #4668 |
@pmorissette Hi, Have you found a solution? I found this problem whenever read decimals to dataframe and save as other file, I don't want to use solutions like round or format |
http://stackoverflow.com/questions/12877189/float64-with-pandas-to-csv
What does R (or others) do?
The text was updated successfully, but these errors were encountered: