-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-9747: [Java][C++] Initial Support for 256-bit Decimals #8475
Conversation
IMO, Decimal256 is better, as it avoids confusing with |
java/vector/src/main/codegen/templates/AbstractPromotableFieldWriter.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/codegen/templates/AbstractPromotableFieldWriter.java
Outdated
Show resolved
Hide resolved
Maybe some CI failures can be fixed by referencing #8455 |
public ${name}Vector get${name}Vector() { | ||
if (${uncappedName}Vector == null) { | ||
throw new IllegalArgumentException("No Decimal Vector present. Provide ArrowType argument to create a new vector"); | ||
throw new IllegalArgumentException("No Decimal ${uncappedName} present. Provide ArrowType argument to create a new vector"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be "No ${uncappedName} vector present ..." ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think so. Nice catch.
holder.buffer = valueBuffer; | ||
holder.precision = precision; | ||
holder.scale = scale; | ||
holder.start = index * TYPE_WIDTH; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be cast to long
byte temp; | ||
final int startIndex = index * DECIMAL_BYTE_LENGTH; | ||
final long startIndex = index * byteWidth; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need a cast here
Otherwise, it first multiplies two (32-bit) integers, and then promotes it to a long.
If the result of the multiplication overflows, it just promotes the overflown value to a long, which is useless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
838c785
to
c9fa7c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liyafan82 I think I addressed your feedback except for the rename of the new class. Unfortunately I had to force push due to a rebase. I'll tag you when I rename.
static_cast<uint8_t>(i % 128)}; | ||
// Decimal256Builder takes 32 bytes, while Decimal128Builder takes only the first 16 | ||
// bytes. | ||
const uint8_t bytes[32] = {0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure the remaining bytes will be zeroed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According o: https://en.cppreference.com/w/c/language/array_initialization is should be.
cpp/src/arrow/array/array_test.cc
Outdated
this->TestCreate(precision, draw, valid_bytes, 2); | ||
} | ||
|
||
INSTANTIATE_TEST_SUITE_P(Decimal256Test, Decimal256Test, ::testing::Range(1, 76)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to test every precision between 1 and 76? (note the same comment applies to Decimal128Test
above).
I'm concerned about the readability of test output here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like ::testing::Values(1, 2, 5, 10, 75, 76)
would sound sufficient (untested).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The testing every value between 1 and 38 for decimal 128 appears to be the previous behavior I think these tests are fairly light weight but I'll update for Decimal256
@@ -64,6 +64,13 @@ struct ValidateArrayVisitor { | |||
return Status::OK(); | |||
} | |||
|
|||
Status Visit(const Decimal256Array& array) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we could have a BaseDecimalArray
class like we already have BaseListArray
and BaseBinaryArray
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I tried to make this work, and at the moment making this seems like it would make this change bigger then I would feel comfortable with. There are a lot of type_traits that have confusing hierarchies (is_primitive and is_binary_like both would include Decimal and SFINAE doesn't work out well, so it would be an intrusive change).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, https://github.com/apache/arrow/pull/8417/files is probably what some of it would look like but I haven't reviewed it fully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
@@ -740,6 +741,7 @@ TEST_F(TestArrayExport, Primitive) { | |||
TestPrimitive(large_utf8(), R"(["foo", "bar", null])"); | |||
|
|||
TestPrimitive(decimal(16, 4), R"(["1234.5670", null])"); | |||
TestPrimitive(decimal256(16, 4), R"(["1234.5670", null])"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add import and/or roundtrip tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, added additional schema tests and a round trip test.
cpp/src/arrow/type.cc
Outdated
@@ -131,6 +133,7 @@ std::string ToString(Type::type id) { | |||
TO_STRING_CASE(FLOAT) | |||
TO_STRING_CASE(DOUBLE) | |||
TO_STRING_CASE(DECIMAL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be changed to DECIMAL128
?
(in general, do a search for DECIMAL
in all the C++ code, this may catch some overloooked instances)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one should. We have DECIMAL for backwards compatibility, I think the remaining places that it is used are places we will need to update to support Decimal256. By leaving them as DECIMAL we can find them easily with by commenting out the alias. Does this sound reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Maybe we can deprecate the alias at some point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that is the intent. I'll be opening up a bunch of JIRA work to track down usages and remove. Partial list so far:
- CSV
- Ruby/Gobj bindings
- Implementation for Parquet
- Finish Python implementation (rescaling is needed)
- Gandiva
- Computation kernels (in particular casts)
Likely some others ...
@@ -614,7 +635,7 @@ template <typename T> | |||
using is_list_type = | |||
std::integral_constant<bool, std::is_same<T, ListType>::value || | |||
std::is_same<T, LargeListType>::value || | |||
std::is_same<T, FixedSizeListType>::valuae>; | |||
std::is_same<T, FixedSizeListType>::value>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:-)
reinterpret_cast<int64_t*>(out)[0] = little_endian_array_[3]; | ||
reinterpret_cast<int64_t*>(out)[1] = little_endian_array_[2]; | ||
reinterpret_cast<int64_t*>(out)[2] = little_endian_array_[1]; | ||
reinterpret_cast<int64_t*>(out)[3] = little_endian_array_[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: wrong indentation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for some reason this is how "make format" wants it to be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah :-/
|
||
for (auto _ : state) { | ||
for (int x = 0; x < kValueSize; x += 5) { | ||
benchmark::DoNotOptimize(v1[x + 2] * v2[x + 2]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only this line? Ideally we would to the same operations as in BinaryMathOp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of right now we've only added support for operator*, I think as we add more operators this benchmark can be expanded to reach parity with the other.
@@ -26,6 +26,7 @@ | |||
#include <vector> | |||
|
|||
#include <gtest/gtest.h> | |||
#include <boost/multiprecision/cpp_int.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we include arrow/util/int128_internal.h
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see that we also use int256_t
...
438ca49
to
71735a7
Compare
Thanks for the reviews @liyafan82 and @pitrou. I rebased an i'll merge when green an open up some follow-up JIRAs. |
Mac CI failures seem unrelated. going to merge. |
This might have "broken" the spark integration builds: https://github.com/ursa-labs/crossbow/runs/1304128112
(now I am not familiar enough with spark to know what kind of "broken" it is, but in any case the integration build is failing) |
@jorisvandenbossche Thanks for reporting the problem. |
This provides sufficient coverage to support round trip between C++ and Java. There are still some gaps in python. Based on review, I will open JIRAs to track missing functionality (i.e. parquet support in C++). Marking as draft until i can triage CI failures but early feedback is welcome. Open questions I have: [C++] * Should we retain logic in decimal() factory function to adjust type on scale/precision or take an explicit argument or keep it as an alias for decimal128? [Java] * Naming: Would Decimal256 be better then BigDecimal? Closes apache#8475 from emkornfield/decimal256 Lead-authored-by: Mingyu Zhong <69326943+MingyuZhong@users.noreply.github.com> Co-authored-by: Micah Kornfield <micahk@google.com> Co-authored-by: Micah Kornfield <emkornfield@gmail.com> Co-authored-by: emkornfield <emkornfield@gmail.com> Co-authored-by: Ezra <eumen@google.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
This provides sufficient coverage to support round trip between C++ and Java. There are still some gaps in python. Based on review, I will open JIRAs to track missing functionality (i.e. parquet support in C++). Marking as draft until i can triage CI failures but early feedback is welcome.
Open questions I have:
[C++]
[Java]