Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Decimal{128,256}::FromReal accuracy loss on non-small scale values #35576

Closed
wirable23 opened this issue May 12, 2023 · 5 comments · Fixed by #35997
Closed

[C++] Decimal{128,256}::FromReal accuracy loss on non-small scale values #35576

wirable23 opened this issue May 12, 2023 · 5 comments · Fixed by #35997

Comments

@wirable23
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

a = pa.array([545803904.0], type=pa.float32())
>>> a.cast(pa.decimal128(38, 18))
<pyarrow.lib.Decimal128Array object at 0x000001DE8FF59840>
[
  545803886.966396699654750208
]
>>>

While encoding of decimal128 and float32 is very different, the difference after the cast seems much larger than expected.

Component(s)

Python

@emkornfield
Copy link
Contributor

emkornfield commented May 15, 2023

the first part of the error is down-casting from float64 (python default representation to)->float32

a = pa.array([545803904.0], type=pa.float32())
a
<pyarrow.lib.FloatArray object at 0x3ebb737bb400>
[
  545803900
]

Same happens with numpy:

numpy.float32(545803904.0)
545803900.0

The second part of the error I think is likely due to implementation which looks like we somehow might do an extra cast through an intermediate value:

a = pa.array([545803900.0], type=pa.float64())
print(a.cast(pa.decimal128(38, 18)))
print(a.cast(pa.float32()).cast(pa.decimal128(38, 18)))

gives:

[
  545803899.999999976169013248
]
[
  545803886.966396699654750208
]

I think the second source of error might be:

const auto high = std::floor(std::ldexp(x, -64));

since this looks like it it is done in float space (instead of casting to double) which potentially causes further loss of precision.

@westonpace westonpace changed the title Unexpected float32 to decimal128 cast result [C++] Unexpected float32 to decimal128 cast result May 18, 2023
@pitrou pitrou changed the title [C++] Unexpected float32 to decimal128 cast result [C++] Unexpected float32/64 to decimal128 cast result May 25, 2023
@pitrou
Copy link
Member

pitrou commented May 25, 2023

float32 seems to be a red herring here, as even float64 exhibits the issue:

>>> pa.array([1234567890.]).cast(pa.decimal128(38, 18))
<pyarrow.lib.Decimal128Array object at 0x7f05f4a3f0a0>
[
  1234567889.999999973827018752
]

@pitrou
Copy link
Member

pitrou commented May 25, 2023

There are also very strange precision-related phenomena going on:

>>> pa.array([1234567890.]).cast(pa.decimal128(38, 10))
<pyarrow.lib.Decimal128Array object at 0x7f05f49238e0>
[
  1234567890.0000000000
]
>>> pa.array([1234567890.]).cast(pa.decimal128(38, 11))
<pyarrow.lib.Decimal128Array object at 0x7f05f4a3f1c0>
[
  1234567889.99999995904
]
>>> pa.array([1234567890.]).cast(pa.decimal128(38, 12))
<pyarrow.lib.Decimal128Array object at 0x7f05f494f9a0>
[
  1234567890.000000057344
]

@pitrou
Copy link
Member

pitrou commented May 25, 2023

@rok @felipecrv Does any of you want to take a look at this?

@pitrou
Copy link
Member

pitrou commented May 25, 2023

The issue is with the algorithm used in Decimal128::FromReal. We can reproduce the underlying precision problem by repeating the first steps of the algorithm in Python:

>>> powers_of_ten = [1e-38, 1e-37, 1e-36, 1e-35, 1e-34, 1e-33, 1e-32, 1e-31, 1e-30, 1e-29, 1e-28,
...:     1e-27, 1e-26, 1e-25, 1e-24, 1e-23, 1e-22, 1e-21, 1e-20, 1e-19, 1e-18, 1e-17,
...:     1e-16, 1e-15, 1e-14, 1e-13, 1e-12, 1e-11, 1e-10, 1e-9,  1e-8,  1e-7,  1e-6,
...:     1e-5,  1e-4,  1e-3,  1e-2,  1e-1,  1e0,   1e1,   1e2,   1e3,   1e4,   1e5,
...:     1e6,   1e7,   1e8,   1e9,   1e10,  1e11,  1e12,  1e13,  1e14,  1e15,  1e16,
...:     1e17,  1e18,  1e19,  1e20,  1e21,  1e22,  1e23,  1e24,  1e25,  1e26,  1e27,
...:     1e28,  1e29,  1e30,  1e31,  1e32,  1e33,  1e34,  1e35,  1e36,  1e37,  1e38]
>>> int(powers_of_ten[38+10]*1234567890.)
12345678900000000000
>>> int(powers_of_ten[38+11]*1234567890.)
123456788999999995904
>>> int(powers_of_ten[38+12]*1234567890.)
1234567890000000057344

Also it looks like I'm the culprit:
#7612
:-)

@pitrou pitrou changed the title [C++] Unexpected float32/64 to decimal128 cast result [C++] Decimal{128,256}::FromReal accuracy loss on non-small scale values May 25, 2023
@pitrou pitrou self-assigned this Jun 5, 2023
pitrou added a commit to pitrou/arrow that referenced this issue Jun 8, 2023
The original algorithm for real-to-decimal conversion did its computations in the floating-point domain,
accumulating rounding errors especially for large scale or precision values, such as:
```
>>> pa.array([1234567890.]).cast(pa.decimal128(38, 11))
<pyarrow.lib.Decimal128Array object at 0x7f05f4a3f1c0>
[
  1234567889.99999995904
]
>>> pa.array([1234567890.]).cast(pa.decimal128(38, 12))
<pyarrow.lib.Decimal128Array object at 0x7f05f494f9a0>
[
  1234567890.000000057344
]
```

The new algorithm strives to avoid precision loss by doing all its computations in the decimal domain.
However, negative scales, which are presumably infrequent, fall back on the old algorithm.
pitrou added a commit to pitrou/arrow that referenced this issue Jun 8, 2023
The original algorithm for real-to-decimal conversion did its computations in the floating-point domain,
accumulating rounding errors especially for large scale or precision values, such as:
```
>>> pa.array([1234567890.]).cast(pa.decimal128(38, 11))
<pyarrow.lib.Decimal128Array object at 0x7f05f4a3f1c0>
[
  1234567889.99999995904
]
>>> pa.array([1234567890.]).cast(pa.decimal128(38, 12))
<pyarrow.lib.Decimal128Array object at 0x7f05f494f9a0>
[
  1234567890.000000057344
]
```

The new algorithm strives to avoid precision loss by doing all its computations in the decimal domain.
However, negative scales, which are presumably infrequent, fall back on the old algorithm.
pitrou added a commit to pitrou/arrow that referenced this issue Jun 14, 2023
The original algorithm for real-to-decimal conversion did its computations in the floating-point domain,
accumulating rounding errors especially for large scale or precision values, such as:
```
>>> pa.array([1234567890.]).cast(pa.decimal128(38, 11))
<pyarrow.lib.Decimal128Array object at 0x7f05f4a3f1c0>
[
  1234567889.99999995904
]
>>> pa.array([1234567890.]).cast(pa.decimal128(38, 12))
<pyarrow.lib.Decimal128Array object at 0x7f05f494f9a0>
[
  1234567890.000000057344
]
```

The new algorithm strives to avoid precision loss by doing all its computations in the decimal domain.
However, negative scales, which are presumably infrequent, fall back on the old algorithm.
pitrou added a commit that referenced this issue Jun 14, 2023
The original algorithm for real-to-decimal conversion did its computations in the floating-point domain, accumulating rounding errors especially for large scale or precision values, such as:
```
>>> pa.array([1234567890.]).cast(pa.decimal128(38, 11))
<pyarrow.lib.Decimal128Array object at 0x7f05f4a3f1c0>
[
  1234567889.99999995904
]
>>> pa.array([1234567890.]).cast(pa.decimal128(38, 12))
<pyarrow.lib.Decimal128Array object at 0x7f05f494f9a0>
[
  1234567890.000000057344
]
```

The new algorithm strives to avoid precision loss by doing all its computations in the decimal domain. However, negative scales, which are presumably infrequent, fall back on the old algorithm.

* Closes: #35576

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 13.0.0 milestone Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants