[C++] Decimal{128,256}::FromReal accuracy loss on non-small scale values #35576

wirable23 · 2023-05-12T20:16:09Z

Describe the bug, including details regarding any error messages, version, and platform.

a = pa.array([545803904.0], type=pa.float32())
>>> a.cast(pa.decimal128(38, 18))
<pyarrow.lib.Decimal128Array object at 0x000001DE8FF59840>
[
  545803886.966396699654750208
]
>>>

While encoding of decimal128 and float32 is very different, the difference after the cast seems much larger than expected.

Component(s)

Python

The text was updated successfully, but these errors were encountered:

emkornfield · 2023-05-15T08:20:15Z

the first part of the error is down-casting from float64 (python default representation to)->float32

a = pa.array([545803904.0], type=pa.float32())
a
<pyarrow.lib.FloatArray object at 0x3ebb737bb400>
[
  545803900
]

Same happens with numpy:

numpy.float32(545803904.0)
545803900.0

The second part of the error I think is likely due to implementation which looks like we somehow might do an extra cast through an intermediate value:

a = pa.array([545803900.0], type=pa.float64())
print(a.cast(pa.decimal128(38, 18)))
print(a.cast(pa.float32()).cast(pa.decimal128(38, 18)))

gives:

[
  545803899.999999976169013248
]
[
  545803886.966396699654750208
]

I think the second source of error might be:

arrow/cpp/src/arrow/util/decimal.cc

Line 158 in cd6e2a4

const auto high = std::floor(std::ldexp(x, -64));

since this looks like it it is done in float space (instead of casting to double) which potentially causes further loss of precision.

pitrou · 2023-05-25T14:27:02Z

float32 seems to be a red herring here, as even float64 exhibits the issue:

>>> pa.array([1234567890.]).cast(pa.decimal128(38, 18))
<pyarrow.lib.Decimal128Array object at 0x7f05f4a3f0a0>
[
  1234567889.999999973827018752
]

pitrou · 2023-05-25T14:29:54Z

There are also very strange precision-related phenomena going on:

>>> pa.array([1234567890.]).cast(pa.decimal128(38, 10))
<pyarrow.lib.Decimal128Array object at 0x7f05f49238e0>
[
  1234567890.0000000000
]
>>> pa.array([1234567890.]).cast(pa.decimal128(38, 11))
<pyarrow.lib.Decimal128Array object at 0x7f05f4a3f1c0>
[
  1234567889.99999995904
]
>>> pa.array([1234567890.]).cast(pa.decimal128(38, 12))
<pyarrow.lib.Decimal128Array object at 0x7f05f494f9a0>
[
  1234567890.000000057344
]

pitrou · 2023-05-25T14:37:30Z

@rok @felipecrv Does any of you want to take a look at this?

pitrou · 2023-05-25T14:47:22Z

The issue is with the algorithm used in Decimal128::FromReal. We can reproduce the underlying precision problem by repeating the first steps of the algorithm in Python:

>>> powers_of_ten = [1e-38, 1e-37, 1e-36, 1e-35, 1e-34, 1e-33, 1e-32, 1e-31, 1e-30, 1e-29, 1e-28,
...:     1e-27, 1e-26, 1e-25, 1e-24, 1e-23, 1e-22, 1e-21, 1e-20, 1e-19, 1e-18, 1e-17,
...:     1e-16, 1e-15, 1e-14, 1e-13, 1e-12, 1e-11, 1e-10, 1e-9,  1e-8,  1e-7,  1e-6,
...:     1e-5,  1e-4,  1e-3,  1e-2,  1e-1,  1e0,   1e1,   1e2,   1e3,   1e4,   1e5,
...:     1e6,   1e7,   1e8,   1e9,   1e10,  1e11,  1e12,  1e13,  1e14,  1e15,  1e16,
...:     1e17,  1e18,  1e19,  1e20,  1e21,  1e22,  1e23,  1e24,  1e25,  1e26,  1e27,
...:     1e28,  1e29,  1e30,  1e31,  1e32,  1e33,  1e34,  1e35,  1e36,  1e37,  1e38]
>>> int(powers_of_ten[38+10]*1234567890.)
12345678900000000000
>>> int(powers_of_ten[38+11]*1234567890.)
123456788999999995904
>>> int(powers_of_ten[38+12]*1234567890.)
1234567890000000057344

Also it looks like I'm the culprit:
#7612
:-)

The original algorithm for real-to-decimal conversion did its computations in the floating-point domain, accumulating rounding errors especially for large scale or precision values, such as: ``` >>> pa.array([1234567890.]).cast(pa.decimal128(38, 11)) <pyarrow.lib.Decimal128Array object at 0x7f05f4a3f1c0> [ 1234567889.99999995904 ] >>> pa.array([1234567890.]).cast(pa.decimal128(38, 12)) <pyarrow.lib.Decimal128Array object at 0x7f05f494f9a0> [ 1234567890.000000057344 ] ``` The new algorithm strives to avoid precision loss by doing all its computations in the decimal domain. However, negative scales, which are presumably infrequent, fall back on the old algorithm.

The original algorithm for real-to-decimal conversion did its computations in the floating-point domain, accumulating rounding errors especially for large scale or precision values, such as: ``` >>> pa.array([1234567890.]).cast(pa.decimal128(38, 11)) <pyarrow.lib.Decimal128Array object at 0x7f05f4a3f1c0> [ 1234567889.99999995904 ] >>> pa.array([1234567890.]).cast(pa.decimal128(38, 12)) <pyarrow.lib.Decimal128Array object at 0x7f05f494f9a0> [ 1234567890.000000057344 ] ``` The new algorithm strives to avoid precision loss by doing all its computations in the decimal domain. However, negative scales, which are presumably infrequent, fall back on the old algorithm. * Closes: #35576 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

wirable23 added the Type: bug label May 12, 2023

github-actions bot added the Component: Python label May 12, 2023

westonpace changed the title ~~Unexpected float32 to decimal128 cast result~~ [C++] Unexpected float32 to decimal128 cast result May 18, 2023

pitrou changed the title ~~[C++] Unexpected float32 to decimal128 cast result~~ [C++] Unexpected float32/64 to decimal128 cast result May 25, 2023

pitrou added the Component: C++ label May 25, 2023

pitrou changed the title ~~[C++] Unexpected float32/64 to decimal128 cast result~~ [C++] Decimal{128,256}::FromReal accuracy loss on non-small scale values May 25, 2023

pitrou self-assigned this Jun 5, 2023

pitrou mentioned this issue Jun 6, 2023

[C++] Decimal-to-real accuracy loss / rounding issue #35942

Closed

github-actions bot mentioned this issue Jun 8, 2023

GH-35576: [C++] Make Decimal{128,256}::FromReal more accurate #35997

Merged

pitrou closed this as completed in #35997 Jun 14, 2023

pitrou added this to the 13.0.0 milestone Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Decimal{128,256}::FromReal accuracy loss on non-small scale values #35576

[C++] Decimal{128,256}::FromReal accuracy loss on non-small scale values #35576

wirable23 commented May 12, 2023

emkornfield commented May 15, 2023 •

edited

Loading

pitrou commented May 25, 2023

pitrou commented May 25, 2023

pitrou commented May 25, 2023

pitrou commented May 25, 2023

[C++] Decimal{128,256}::FromReal accuracy loss on non-small scale values #35576

[C++] Decimal{128,256}::FromReal accuracy loss on non-small scale values #35576

Comments

wirable23 commented May 12, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

emkornfield commented May 15, 2023 • edited Loading

pitrou commented May 25, 2023

pitrou commented May 25, 2023

pitrou commented May 25, 2023

pitrou commented May 25, 2023

emkornfield commented May 15, 2023 •

edited

Loading