GH-35576: [C++] Make Decimal{128,256}::FromReal more accurate #35997

pitrou · 2023-06-08T15:59:10Z

The original algorithm for real-to-decimal conversion did its computations in the floating-point domain, accumulating rounding errors especially for large scale or precision values, such as:

>>> pa.array([1234567890.]).cast(pa.decimal128(38, 11))
<pyarrow.lib.Decimal128Array object at 0x7f05f4a3f1c0>
[
  1234567889.99999995904
]
>>> pa.array([1234567890.]).cast(pa.decimal128(38, 12))
<pyarrow.lib.Decimal128Array object at 0x7f05f494f9a0>
[
  1234567890.000000057344
]

The new algorithm strives to avoid precision loss by doing all its computations in the decimal domain. However, negative scales, which are presumably infrequent, fall back on the old algorithm.

Closes: [C++] Decimal{128,256}::FromReal accuracy loss on non-small scale values #35576

pitrou · 2023-06-08T19:53:11Z

@felipecrv @benibus Would either you have time to review this?

westonpace

I won't promise to follow the conversion algorithm entirely :) but this looks well tested and the bit shifting routines make sense.

cpp/src/arrow/util/decimal_internal.h

westonpace · 2023-06-09T12:18:21Z

cpp/src/arrow/util/basic_decimal.cc

-bool operator==(const BasicDecimal128& left, const BasicDecimal128& right) {
-  return left.high_bits() == right.high_bits() && left.low_bits() == right.low_bits();
-}
-
-bool operator!=(const BasicDecimal128& left, const BasicDecimal128& right) {
-  return !operator==(left, right);
-}


Is this going away because it is superceded by the version in GenericBasicDecimal?

cpp/src/arrow/util/decimal.cc

westonpace · 2023-06-09T12:43:38Z

cpp/src/arrow/util/decimal_test.cc

@@ -802,34 +814,21 @@ class TestDecimalFromReal : public ::testing::Test {
    const std::vector<ParamType> params{
        // clang-format off
        {0.0f, 1, 0, "0"},
-        {-0.0f, 1, 0, "0"},


Why remove these cases? We can still convert negative numbers right? Isn't it just less precise?

Because CheckDecimalFromReal and CheckDecimalFromRealIntegerString now automatically deduce the negative test cases from the positive ones.

westonpace · 2023-06-09T12:46:22Z

python/pyarrow/tests/test_compute.py

+            f"diff_digits = {diff_digits!r}")
+
+
+# XXX Cannot test float32 as case generators above assume float64


Is this a TODO (should we create an issue?) or is it a fact, in which case we can get rid of the XXX?

Not sure. I wouldn't create an issue for it as the effort-benefit ratio is probably low.

felipecrv

I can't claim I went through all the details of the implementation, but it looks like the kind of code that doesn't have rarely-taken branches that wouldn't be exercised from the unit tests. LGTM.

cpp/src/arrow/util/decimal.cc

cpp/src/arrow/util/basic_decimal.cc

The original algorithm for real-to-decimal conversion did its computations in the floating-point domain, accumulating rounding errors especially for large scale or precision values, such as: ``` >>> pa.array([1234567890.]).cast(pa.decimal128(38, 11)) <pyarrow.lib.Decimal128Array object at 0x7f05f4a3f1c0> [ 1234567889.99999995904 ] >>> pa.array([1234567890.]).cast(pa.decimal128(38, 12)) <pyarrow.lib.Decimal128Array object at 0x7f05f494f9a0> [ 1234567890.000000057344 ] ``` The new algorithm strives to avoid precision loss by doing all its computations in the decimal domain. However, negative scales, which are presumably infrequent, fall back on the old algorithm.

pitrou · 2023-06-14T14:12:24Z

I think I have addressed all review comments. Would you like to take another look? @westonpace @felipecrv

conbench-apache-arrow · 2023-06-16T09:56:12Z

Conbench analyzed the 7 benchmark runs on commit b6eab1f4.

There were 35 benchmark results indicating a performance regression:

Commit Run at 2023-06-15 05:59:13Z
- params=/1048576/100, source=cpp-micro, suite=arrow-acero-aggregate-benchmark
- params=/1048576/0, source=cpp-micro, suite=arrow-acero-aggregate-benchmark
and 33 more (see the report linked below)

The full Conbench report has more details.

huberylee · 2023-10-31T08:39:38Z

cpp/src/arrow/util/decimal_test.cc

+    return {
+        // -- Stress the 24 bits of precision of a float
+        // 2**63 + 2**40
+        FromFloatTestParam{9.223373e+18f, 19, 0, "9223373136366403584"},


@pitrou Hi, the expected return value of FromFloatTestParam{5.76460752e13f, 18, 4, "57646075230342.3488"} is 57646073774080.0000, which seems different from the original value, does that meet expectations？

What do you call "the original value" in that context?
Both values are actually equal in float32:

>>> np.float32(5.76460752e13) == np.float32(57646073774080) True

What do you call "the original value" in that context? Both values are actually equal in float32:

>>> np.float32(5.76460752e13) == np.float32(57646073774080) True

I see what you mean. Thanks!

github-actions bot added Component: C++ Component: Python awaiting review Awaiting review labels Jun 8, 2023

pitrou force-pushed the gh-35576-float-to-decimal branch from 18155dc to f8a6443 Compare June 8, 2023 16:02

pitrou marked this pull request as ready for review June 8, 2023 19:52

pitrou requested review from AlenkaF and westonpace as code owners June 8, 2023 19:52

westonpace approved these changes Jun 9, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Jun 9, 2023

felipecrv approved these changes Jun 9, 2023

View reviewed changes

cpp/src/arrow/util/decimal.cc Outdated Show resolved Hide resolved

cpp/src/arrow/util/basic_decimal.cc Outdated Show resolved Hide resolved

pitrou added 4 commits June 14, 2023 12:57

Fix out of bounds read

daef12b

Compatibility with older Pythons

747dfa6

Address review comments and add a test

87be93a

pitrou force-pushed the gh-35576-float-to-decimal branch from 5f0abee to 87be93a Compare June 14, 2023 12:47

felipecrv approved these changes Jun 14, 2023

View reviewed changes

pitrou merged commit b6eab1f into apache:main Jun 14, 2023

pitrou deleted the gh-35576-float-to-decimal branch June 14, 2023 14:49

js8544 mentioned this pull request Jul 12, 2023

[C++] power_checked incorrectly returns NaN #36602

Closed

galipremsagar mentioned this pull request Oct 24, 2023

[BUG] casting from float32 to Decimal64Dtype is resulting in incorrect values rapidsai/cudf#14169

Closed

huberylee reviewed Oct 31, 2023

View reviewed changes

AlenkaF removed their request for review November 1, 2023 11:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-35576: [C++] Make Decimal{128,256}::FromReal more accurate #35997

GH-35576: [C++] Make Decimal{128,256}::FromReal more accurate #35997

pitrou commented Jun 8, 2023 •

edited by github-actions bot

Loading

pitrou commented Jun 8, 2023

westonpace left a comment

westonpace Jun 9, 2023

pitrou Jun 14, 2023

westonpace Jun 9, 2023

pitrou Jun 14, 2023

westonpace Jun 9, 2023

pitrou Jun 14, 2023

felipecrv left a comment

pitrou commented Jun 14, 2023

conbench-apache-arrow bot commented Jun 16, 2023

huberylee Oct 31, 2023 •

edited

Loading

pitrou Oct 31, 2023

huberylee Oct 31, 2023

		f"diff_digits = {diff_digits!r}")


		# XXX Cannot test float32 as case generators above assume float64

GH-35576: [C++] Make Decimal{128,256}::FromReal more accurate #35997

GH-35576: [C++] Make Decimal{128,256}::FromReal more accurate #35997

Conversation

pitrou commented Jun 8, 2023 • edited by github-actions bot Loading

pitrou commented Jun 8, 2023

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felipecrv left a comment

Choose a reason for hiding this comment

pitrou commented Jun 14, 2023

conbench-apache-arrow bot commented Jun 16, 2023

huberylee Oct 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Jun 8, 2023 •

edited by github-actions bot

Loading

huberylee Oct 31, 2023 •

edited

Loading