-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weak precision performance for floating point values with non-negligible negative exponents #109
Comments
@steven-buytaert : Thank you for noticing this. I will look into it later this week, and try to figure out whether this is a minor bug/oversight somewhere or something more fundamental. Can you list the exact CMake option settings you were using?
Indeed. |
I did a normal build, as follows:
And build my mentioned small sample program. I think the issue is fundamental, i.e. underflow happening during the double precision operations. You can increase the precision requirement in the print_config.h file, but the fundamental issue remains. I was triggered by the fact that the autotest program only produced positive exponents, never negative. I was interested in this embedded printf, just because I wrote one myself. Kind regards, Steven |
I meant fundamental in the sense of can't easily be worked around (we already have some workarounds for precision issues in there - playing with the exponent, keeping track of remainders etc.) |
Ok, here are the results of investigation... the problem (not a bug) occurs in this piece of code:
And the two numbers Now we need to think about what to do about this. Some ideas:
Thoughts? |
More info: The linear approximation of exp10 for the e-23 gives ~ -22.83 , and for the e+23 gives ~ 23.15. So... maybe it's the (int) conversion that's the problem. Hmmm... |
…nt. CAVEAT: This exposes an issue with printing values having exponent -308 (the minimum exponent for double's); and - we see some divergence from perfect corrtness with higher precision values.
@steven-buytaert : With the recent commit, your program now yields:
i.e. about 12 digits of precision. (Although it exposes an unrelated issue.) What do you think? |
@eyalroz That is already much closer to the real Boltzmann constant, so it is an improvement, yes. With some more rounding, you could make that 8999 into 9000 but the point is that you don't know where to start and stop rounding. The fundamental issue is that the floating point operations you use, are but an approximation. Operations that yield bits beyond the mantissabits loose precision. I decided therefore to not use floating point operations in my ecvt routine.
As I said, it depends what you want to achieve. If you really need the precision to be 17 digits, floating operations are not good, unless you could use a higher precision floating point the IEEE754 128 or 256 bit variants. Of course, to format these themselves, you would still need to fall back to even wider integer arithmetic, as floating point again would be too limiting. In short, for formatting floating points with enough precision, you always need a type wider than the one you want to format. Don't worry too much. Floating point formatting is still a major minefield and black art, as you undoubtedly have found out. :-) BTW When I look at that code of David Gay, he also seems to be using a much wider floating point type when USE_BF96 is defined, exactly to achieve the proper precision to format 64 bit doubles. When this is enabled, there is a large table of powers of ten that is included, inflating the code size. In my code, I have a very small table, but I do several exact scale up or scale down operations, to bring the mantissa and the exponent in the proper range. Regards, Steven |
…nt. CAVEAT: This exposes an issue with printing values having exponent -308 (the minimum exponent for double's); and - we see some divergence from perfect corrtness with higher precision values.
…nt. CAVEAT: This exposes an issue with printing values having exponent -308 (the minimum exponent for double's); and - we see some divergence from perfect corrtness with higher precision values.
…nt. CAVEAT: This exposes an issue with printing values having exponent -308 (the minimum exponent for double's); and - we see some divergence from perfect corrtness with higher precision values.
…nt. CAVEAT: This exposes an issue with printing values having exponent -308 (the minimum exponent for double's); and - we see some divergence from perfect corrtness with higher precision values.
…nt. CAVEAT: This exposes an issue with printing values having exponent -308 (the minimum exponent for double's); and - we see some divergence from perfect corrtness with higher precision values.
* Added a couple of test cases for exposing the behavior of the `#` modifier (alternative mode) together with `ll` (long long modifier), and specifically exposing the example format string mentioned in bug eyalroz#114. * Our fix for eyalroz#109 was too eager to keep `FLAG_HASH` - it dropped it only when a precision wasn't specified. We're now going part of the way back - dropping `FLAG_HASH` even when precision wasn't specific - except for octal. * The `long long` version of ntoa now behaves just like the `long` version.
Hi, I notice that the autotest floating point testing is biased towards large values. I tried the following example, against the libc printf.
Output:
libc 1.3806515690000000e-23
this 1.3806515648376314e-23
As you see there's a serious precision issue due too arithmetic underflow. Depending on the application, this might be problematic or not. In any case, it's something you should check also in the test framework.
If floating point precision is/becomes an issue, I have a standalone implementation of ecvt that uses (very wide) integer multiplication/division to format floating point with the proper precision based upon GRISU; as such, the formatting code doesn't do any floating point arithmetic itself to format the floats. The total text size for ARM Thumb code with -Os is 1979 bytes, only depending on strcpy from libc. A good reference to this floating point formatting can be found here. This code produces the exact same bit representation of the boltzmann constant as GNU libc. It can also format 32 bit IEEE 754 directly, without going via a 64 bit IEEE 754 format. You can find this ecvt and the required wide integer arithmetic code here.
Kind regards,
Steven
The text was updated successfully, but these errors were encountered: