Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-9747: [Java][C++] Initial Support for 256-bit Decimals #8475

Closed
wants to merge 31 commits into from

Conversation

emkornfield
Copy link
Contributor

This provides sufficient coverage to support round trip between C++ and Java. There are still some gaps in python. Based on review, I will open JIRAs to track missing functionality (i.e. parquet support in C++). Marking as draft until i can triage CI failures but early feedback is welcome.

Open questions I have:

[C++]

  • Should we retain logic in decimal() factory function to adjust type on scale/precision or take an explicit argument or keep it as an alias for decimal128?

[Java]

  • Naming: Would Decimal256 be better then BigDecimal?

@emkornfield emkornfield requested review from pitrou, praveenbingo and liyafan82 and removed request for praveenbingo October 15, 2020 22:57
@emkornfield emkornfield marked this pull request as draft October 15, 2020 23:02
@github-actions
Copy link

@liyafan82
Copy link
Contributor

IMO, Decimal256 is better, as it avoids confusing with java.math.BigDecimal.

@liyafan82
Copy link
Contributor

Maybe some CI failures can be fixed by referencing #8455

public ${name}Vector get${name}Vector() {
if (${uncappedName}Vector == null) {
throw new IllegalArgumentException("No Decimal Vector present. Provide ArrowType argument to create a new vector");
throw new IllegalArgumentException("No Decimal ${uncappedName} present. Provide ArrowType argument to create a new vector");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be "No ${uncappedName} vector present ..." ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think so. Nice catch.

holder.buffer = valueBuffer;
holder.precision = precision;
holder.scale = scale;
holder.start = index * TYPE_WIDTH;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be cast to long

byte temp;
final int startIndex = index * DECIMAL_BYTE_LENGTH;
final long startIndex = index * byteWidth;
Copy link
Contributor

@liyafan82 liyafan82 Oct 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need a cast here

Otherwise, it first multiplies two (32-bit) integers, and then promotes it to a long.
If the result of the multiplication overflows, it just promotes the overflown value to a long, which is useless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor Author

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liyafan82 I think I addressed your feedback except for the rename of the new class. Unfortunately I had to force push due to a rebase. I'll tag you when I rename.

static_cast<uint8_t>(i % 128)};
// Decimal256Builder takes 32 bytes, while Decimal128Builder takes only the first 16
// bytes.
const uint8_t bytes[32] = {0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure the remaining bytes will be zeroed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this->TestCreate(precision, draw, valid_bytes, 2);
}

INSTANTIATE_TEST_SUITE_P(Decimal256Test, Decimal256Test, ::testing::Range(1, 76));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to test every precision between 1 and 76? (note the same comment applies to Decimal128Test above).

I'm concerned about the readability of test output here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like ::testing::Values(1, 2, 5, 10, 75, 76) would sound sufficient (untested).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testing every value between 1 and 38 for decimal 128 appears to be the previous behavior I think these tests are fairly light weight but I'll update for Decimal256

@@ -64,6 +64,13 @@ struct ValidateArrayVisitor {
return Status::OK();
}

Status Visit(const Decimal256Array& array) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we could have a BaseDecimalArray class like we already have BaseListArray and BaseBinaryArray.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tried to make this work, and at the moment making this seems like it would make this change bigger then I would feel comfortable with. There are a lot of type_traits that have confusing hierarchies (is_primitive and is_binary_like both would include Decimal and SFINAE doesn't work out well, so it would be an intrusive change).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, https://github.com/apache/arrow/pull/8417/files is probably what some of it would look like but I haven't reviewed it fully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@@ -740,6 +741,7 @@ TEST_F(TestArrayExport, Primitive) {
TestPrimitive(large_utf8(), R"(["foo", "bar", null])");

TestPrimitive(decimal(16, 4), R"(["1234.5670", null])");
TestPrimitive(decimal256(16, 4), R"(["1234.5670", null])");
Copy link
Member

@pitrou pitrou Oct 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add import and/or roundtrip tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, added additional schema tests and a round trip test.

@@ -131,6 +133,7 @@ std::string ToString(Type::type id) {
TO_STRING_CASE(FLOAT)
TO_STRING_CASE(DOUBLE)
TO_STRING_CASE(DECIMAL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be changed to DECIMAL128?
(in general, do a search for DECIMAL in all the C++ code, this may catch some overloooked instances)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one should. We have DECIMAL for backwards compatibility, I think the remaining places that it is used are places we will need to update to support Decimal256. By leaving them as DECIMAL we can find them easily with by commenting out the alias. Does this sound reasonable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Maybe we can deprecate the alias at some point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that is the intent. I'll be opening up a bunch of JIRA work to track down usages and remove. Partial list so far:

  • CSV
  • Ruby/Gobj bindings
  • Implementation for Parquet
  • Finish Python implementation (rescaling is needed)
  • Gandiva
  • Computation kernels (in particular casts)

Likely some others ...

@@ -614,7 +635,7 @@ template <typename T>
using is_list_type =
std::integral_constant<bool, std::is_same<T, ListType>::value ||
std::is_same<T, LargeListType>::value ||
std::is_same<T, FixedSizeListType>::valuae>;
std::is_same<T, FixedSizeListType>::value>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:-)

reinterpret_cast<int64_t*>(out)[0] = little_endian_array_[3];
reinterpret_cast<int64_t*>(out)[1] = little_endian_array_[2];
reinterpret_cast<int64_t*>(out)[2] = little_endian_array_[1];
reinterpret_cast<int64_t*>(out)[3] = little_endian_array_[0];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: wrong indentation here.

Copy link
Contributor Author

@emkornfield emkornfield Oct 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some reason this is how "make format" wants it to be

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah :-/


for (auto _ : state) {
for (int x = 0; x < kValueSize; x += 5) {
benchmark::DoNotOptimize(v1[x + 2] * v2[x + 2]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only this line? Ideally we would to the same operations as in BinaryMathOp.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of right now we've only added support for operator*, I think as we add more operators this benchmark can be expanded to reach parity with the other.

@@ -26,6 +26,7 @@
#include <vector>

#include <gtest/gtest.h>
#include <boost/multiprecision/cpp_int.hpp>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include arrow/util/int128_internal.h instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see that we also use int256_t...

@emkornfield
Copy link
Contributor Author

Thanks for the reviews @liyafan82 and @pitrou. I rebased an i'll merge when green an open up some follow-up JIRAs.

@emkornfield
Copy link
Contributor Author

Mac CI failures seem unrelated. going to merge.

@jorisvandenbossche
Copy link
Member

This might have "broken" the spark integration builds: https://github.com/ursa-labs/crossbow/runs/1304128112

Error: ] /spark/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:47: not enough arguments for constructor Decimal: (x$1: Int, x$2: Int, x$3: Int)org.apache.arrow.vector.types.pojo.ArrowType.Decimal.
Unspecified value parameter x$3.

(now I am not familiar enough with spark to know what kind of "broken" it is, but in any case the integration build is failing)

@liyafan82
Copy link
Contributor

This might have "broken" the spark integration builds: https://github.com/ursa-labs/crossbow/runs/1304128112

Error: ] /spark/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:47: not enough arguments for constructor Decimal: (x$1: Int, x$2: Int, x$3: Int)org.apache.arrow.vector.types.pojo.ArrowType.Decimal.
Unspecified value parameter x$3.

(now I am not familiar enough with spark to know what kind of "broken" it is, but in any case the integration build is failing)

@jorisvandenbossche Thanks for reporting the problem.
The problem was caused by adding a new parameter to the constructor. Maybe we can solve it by restoring the default constructor and mark it as deprecated.
Let's open an issue for it.

GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This provides sufficient coverage to support round trip between C++ and Java.  There are still some gaps in python.  Based on review, I will open JIRAs to track missing functionality (i.e. parquet support in C++).  Marking as draft until i can triage CI failures but early feedback is welcome.

Open questions I have:

[C++]
* Should we retain logic in decimal() factory function to adjust type on scale/precision or take an explicit argument or keep it as an alias for decimal128?

[Java]
* Naming:  Would Decimal256 be better then BigDecimal?

Closes apache#8475 from emkornfield/decimal256

Lead-authored-by: Mingyu Zhong <69326943+MingyuZhong@users.noreply.github.com>
Co-authored-by: Micah Kornfield <micahk@google.com>
Co-authored-by: Micah Kornfield <emkornfield@gmail.com>
Co-authored-by: emkornfield <emkornfield@gmail.com>
Co-authored-by: Ezra <eumen@google.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants