ARROW-9747: [Java][C++] Initial Support for 256-bit Decimals #8475

emkornfield · 2020-10-15T22:57:36Z

This provides sufficient coverage to support round trip between C++ and Java. There are still some gaps in python. Based on review, I will open JIRAs to track missing functionality (i.e. parquet support in C++). Marking as draft until i can triage CI failures but early feedback is welcome.

Open questions I have:

[C++]

Should we retain logic in decimal() factory function to adjust type on scale/precision or take an explicit argument or keep it as an alias for decimal128?

[Java]

Naming: Would Decimal256 be better then BigDecimal?

github-actions · 2020-10-15T23:06:27Z

https://issues.apache.org/jira/browse/ARROW-9747

liyafan82 · 2020-10-16T02:27:47Z

IMO, Decimal256 is better, as it avoids confusing with java.math.BigDecimal.

java/vector/src/main/codegen/templates/AbstractPromotableFieldWriter.java

liyafan82 · 2020-10-16T04:53:11Z

Maybe some CI failures can be fixed by referencing #8455

liyafan82 · 2020-10-16T06:50:05Z

java/vector/src/main/codegen/templates/UnionVector.java

  public ${name}Vector get${name}Vector() {
    if (${uncappedName}Vector == null) {
-      throw new IllegalArgumentException("No Decimal Vector present. Provide ArrowType argument to create a new vector");
+      throw new IllegalArgumentException("No Decimal ${uncappedName} present. Provide ArrowType argument to create a new vector");


It should be "No ${uncappedName} vector present ..." ?

yes, I think so. Nice catch.

liyafan82 · 2020-10-16T08:52:13Z

java/vector/src/main/java/org/apache/arrow/vector/BigDecimalVector.java

+    holder.buffer = valueBuffer;
+    holder.precision = precision;
+    holder.scale = scale;
+    holder.start = index * TYPE_WIDTH;


This should be cast to long

liyafan82 · 2020-10-16T10:29:39Z

java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java

    byte temp;
-    final int startIndex = index * DECIMAL_BYTE_LENGTH;
+    final long startIndex = index * byteWidth;


Maybe we need a cast here

Otherwise, it first multiplies two (32-bit) integers, and then promotes it to a long.
If the result of the multiplication overflows, it just promotes the overflown value to a long, which is useless.

emkornfield

@liyafan82 I think I addressed your feedback except for the rename of the new class. Unfortunately I had to force push due to a rebase. I'll tag you when I rename.

pitrou · 2020-10-19T14:43:20Z

cpp/src/arrow/array/array_dict_test.cc

-                             static_cast<uint8_t>(i % 128)};
+    // Decimal256Builder takes 32 bytes, while Decimal128Builder takes only the first 16
+    // bytes.
+    const uint8_t bytes[32] = {0,


Are we sure the remaining bytes will be zeroed?

According o: https://en.cppreference.com/w/c/language/array_initialization is should be.

cpp/src/arrow/array/builder_decimal.h

pitrou · 2020-10-19T14:46:26Z

cpp/src/arrow/array/array_test.cc

+  this->TestCreate(precision, draw, valid_bytes, 2);
+}
+
+INSTANTIATE_TEST_SUITE_P(Decimal256Test, Decimal256Test, ::testing::Range(1, 76));


Do we really want to test every precision between 1 and 76? (note the same comment applies to Decimal128Test above).

I'm concerned about the readability of test output here.

Something like ::testing::Values(1, 2, 5, 10, 75, 76) would sound sufficient (untested).

The testing every value between 1 and 38 for decimal 128 appears to be the previous behavior I think these tests are fairly light weight but I'll update for Decimal256

pitrou · 2020-10-19T14:48:27Z

cpp/src/arrow/array/validate.cc

@@ -64,6 +64,13 @@ struct ValidateArrayVisitor {
    return Status::OK();
  }

+  Status Visit(const Decimal256Array& array) {


Hmm, we could have a BaseDecimalArray class like we already have BaseListArray and BaseBinaryArray.

Yes, I tried to make this work, and at the moment making this seems like it would make this change bigger then I would feel comfortable with. There are a lot of type_traits that have confusing hierarchies (is_primitive and is_binary_like both would include Decimal and SFINAE doesn't work out well, so it would be an intrusive change).

FWIW, https://github.com/apache/arrow/pull/8417/files is probably what some of it would look like but I haven't reviewed it fully.

pitrou · 2020-10-19T14:50:58Z

cpp/src/arrow/c/bridge_test.cc

@@ -740,6 +741,7 @@ TEST_F(TestArrayExport, Primitive) {
  TestPrimitive(large_utf8(), R"(["foo", "bar", null])");

  TestPrimitive(decimal(16, 4), R"(["1234.5670", null])");
+  TestPrimitive(decimal256(16, 4), R"(["1234.5670", null])");


Can you also add import and/or roundtrip tests?

done, added additional schema tests and a round trip test.

cpp/src/arrow/ipc/metadata_internal.cc

cpp/src/arrow/python/decimal.cc

pitrou · 2020-10-19T14:59:09Z

cpp/src/arrow/type.cc

@@ -131,6 +133,7 @@ std::string ToString(Type::type id) {
    TO_STRING_CASE(FLOAT)
    TO_STRING_CASE(DOUBLE)
    TO_STRING_CASE(DECIMAL)


Shouldn't this be changed to DECIMAL128?
(in general, do a search for DECIMAL in all the C++ code, this may catch some overloooked instances)

this one should. We have DECIMAL for backwards compatibility, I think the remaining places that it is used are places we will need to update to support Decimal256. By leaving them as DECIMAL we can find them easily with by commenting out the alias. Does this sound reasonable?

Ok. Maybe we can deprecate the alias at some point?

yes, that is the intent. I'll be opening up a bunch of JIRA work to track down usages and remove. Partial list so far:

CSV

Ruby/Gobj bindings

Implementation for Parquet

Finish Python implementation (rescaling is needed)

Gandiva

Computation kernels (in particular casts)

Likely some others ...

pitrou · 2020-10-19T15:02:38Z

cpp/src/arrow/type_traits.h

@@ -614,7 +635,7 @@ template <typename T>
 using is_list_type =
    std::integral_constant<bool, std::is_same<T, ListType>::value ||
                                     std::is_same<T, LargeListType>::value ||
-                                     std::is_same<T, FixedSizeListType>::valuae>;
+                                     std::is_same<T, FixedSizeListType>::value>;


cpp/src/arrow/util/basic_decimal.cc

pitrou · 2020-10-19T15:08:29Z

cpp/src/arrow/util/basic_decimal.cc

+    reinterpret_cast<int64_t*>(out)[0] = little_endian_array_[3];
+    reinterpret_cast<int64_t*>(out)[1] = little_endian_array_[2];
+    reinterpret_cast<int64_t*>(out)[2] = little_endian_array_[1];
+    reinterpret_cast<int64_t*>(out)[3] = little_endian_array_[0];


Nit: wrong indentation here.

for some reason this is how "make format" wants it to be

cpp/src/arrow/util/basic_decimal.cc

pitrou · 2020-10-19T15:13:43Z

cpp/src/arrow/util/decimal_benchmark.cc

+
+  for (auto _ : state) {
+    for (int x = 0; x < kValueSize; x += 5) {
+      benchmark::DoNotOptimize(v1[x + 2] * v2[x + 2]);


Why only this line? Ideally we would to the same operations as in BinaryMathOp.

As of right now we've only added support for operator*, I think as we add more operators this benchmark can be expanded to reach parity with the other.

cpp/src/arrow/util/decimal_benchmark.cc

pitrou · 2020-10-19T15:16:35Z

cpp/src/arrow/util/decimal_test.cc

@@ -26,6 +26,7 @@
 #include <vector>

 #include <gtest/gtest.h>
+#include <boost/multiprecision/cpp_int.hpp>


Can we include arrow/util/int128_internal.h instead?

Ah, I see that we also use int256_t...

emkornfield · 2020-10-23T03:57:59Z

Thanks for the reviews @liyafan82 and @pitrou. I rebased an i'll merge when green an open up some follow-up JIRAs.

emkornfield · 2020-10-23T04:38:12Z

Mac CI failures seem unrelated. going to merge.

jorisvandenbossche · 2020-10-26T07:54:40Z

This might have "broken" the spark integration builds: https://github.com/ursa-labs/crossbow/runs/1304128112

Error: ] /spark/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:47: not enough arguments for constructor Decimal: (x$1: Int, x$2: Int, x$3: Int)org.apache.arrow.vector.types.pojo.ArrowType.Decimal.
Unspecified value parameter x$3.

(now I am not familiar enough with spark to know what kind of "broken" it is, but in any case the integration build is failing)

liyafan82 · 2020-10-26T08:23:30Z

This might have "broken" the spark integration builds: https://github.com/ursa-labs/crossbow/runs/1304128112
Error: ] /spark/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:47: not enough arguments for constructor Decimal: (x$1: Int, x$2: Int, x$3: Int)org.apache.arrow.vector.types.pojo.ArrowType.Decimal.
Unspecified value parameter x$3.
(now I am not familiar enough with spark to know what kind of "broken" it is, but in any case the integration build is failing)

@jorisvandenbossche Thanks for reporting the problem.
The problem was caused by adding a new parameter to the constructor. Maybe we can solve it by restoring the default constructor and mark it as deprecated.
Let's open an issue for it.

This provides sufficient coverage to support round trip between C++ and Java. There are still some gaps in python. Based on review, I will open JIRAs to track missing functionality (i.e. parquet support in C++). Marking as draft until i can triage CI failures but early feedback is welcome. Open questions I have: [C++] * Should we retain logic in decimal() factory function to adjust type on scale/precision or take an explicit argument or keep it as an alias for decimal128? [Java] * Naming: Would Decimal256 be better then BigDecimal? Closes apache#8475 from emkornfield/decimal256 Lead-authored-by: Mingyu Zhong <69326943+MingyuZhong@users.noreply.github.com> Co-authored-by: Micah Kornfield <micahk@google.com> Co-authored-by: Micah Kornfield <emkornfield@gmail.com> Co-authored-by: emkornfield <emkornfield@gmail.com> Co-authored-by: Ezra <eumen@google.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>

emkornfield requested review from pitrou, praveenbingo and liyafan82 and removed request for praveenbingo October 15, 2020 22:57

emkornfield marked this pull request as draft October 15, 2020 23:02

liyafan82 reviewed Oct 16, 2020

View reviewed changes

java/vector/src/main/codegen/templates/AbstractPromotableFieldWriter.java Outdated Show resolved Hide resolved

liyafan82 reviewed Oct 16, 2020

View reviewed changes

java/vector/src/main/codegen/templates/AbstractPromotableFieldWriter.java Outdated Show resolved Hide resolved

liyafan82 reviewed Oct 16, 2020

View reviewed changes

emkornfield force-pushed the decimal256 branch from 838c785 to c9fa7c2 Compare October 17, 2020 04:39

emkornfield commented Oct 17, 2020

View reviewed changes