-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BasicDecimal256 Multiplication Support (PR for decimal256 branch, not master) #8344
Conversation
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format?
See also: |
Added benchmark, multiplication takes ~21ns. |
@Luminarys have you looked at the CI errors (I think there might be a few flaky things going on but wanted to check that you were ok merging)? |
I'll take a closer look tommorow, but we should also wait for feedback from @MingyuZhong before proceeding. |
I've looked through the CI failures, it seems there are a few kinds:
|
cpp/src/arrow/util/basic_decimal.cc
Outdated
// Multiply two N bit word components into a 2*N bit result, with high bits | ||
// stored in hi and low bits in lo. | ||
template <typename Word> | ||
void ExtendAndMultiplyUint(Word x, Word y, Word* hi, Word* lo) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's simpler if this method handles only uint64_t, and there is another method that takes std::array<uint64_t, n> and uses for loops like https://github.com/google/zetasql/blob/master/zetasql/common/multiprecision_int.h#L723. This way, ExtendAndMultiplyUint128 doesn't need to repeat the similar pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. This saves a lot of code, though does take 60 ns for multiplication as opposed to 20 ns prior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try making ExtendAndMultiplyUint inline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized that I wasn't using the native path prior, which is why the benchmark was so slow. Updated, new results are 32 ns when __uint128_t is used and 65 ns when uint64_t is used, which I think is more reasonable.
cpp/src/arrow/util/basic_decimal.cc
Outdated
// Multiply two N bit word components into a 2*N bit result, with high bits | ||
// stored in hi and low bits in lo. | ||
template <typename Word> | ||
void ExtendAndMultiplyUint(Word x, Word y, Word* hi, Word* lo) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try making ExtendAndMultiplyUint inline?
cpp/src/arrow/util/basic_decimal.cc
Outdated
// Multiply two N bit word components into a 2*N bit result, with high bits | ||
// stored in hi and low bits in lo. | ||
template <typename Word> | ||
inline void ExtendAndMultiplyUint(Word x, Word y, Word* hi, Word* lo) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this method only needs to handle uint64 inputs, and it only needs to be defined in the #else block, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done,
cpp/src/arrow/util/basic_decimal.cc
Outdated
#endif | ||
|
||
// Multiplies two N * 64 bit unsigned integer types, represented by a uint64_t | ||
// array into a same sized output. Overflow in multiplication is considered UB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does UB mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Undefined Behavior, clarified in comments.
#endif | ||
|
||
// Multiplies two N * 64 bit unsigned integer types, represented by a uint64_t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please comment that the elements in the array inputs and output have little-endian order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cpp/src/arrow/util/basic_decimal.cc
Outdated
__uint128_t val_; | ||
}; | ||
|
||
uint128_t operator*(const uint128_t& left, const uint128_t& right) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try defining operator*= instead of operator*. Maybe this can help the compiler generate more efficient code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (or perhaps some other change I made) seems to have improved performance significantly, it takes 13 ns~ with native int128 and 40 ns~ with uint64 fallback.
It turns out one of the check failures is due to a compiler bug in Clang, I've tweaked the definition structure of the BasicDecimal256 header to handle this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
cpp/src/arrow/util/basic_decimal.cc
Outdated
// Multiplies two N * 64 bit unsigned integer types, represented by a uint64_t | ||
// array into a same sized output. Elements in the array should be in | ||
// little endian order, and output will be the same. Overflow in multiplication | ||
// is considered undefined behavior and will not be reported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really undefined? Isn't the output the lower N * 64 bits of the actual result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I say undefined here, I mean the value should not be relied on and is an implementation detail, i.e. people should only be calling this if they know the result will not overflow or do not care what happens if it does. Undefined Behavior maybe isn't correct because it implies the same kind of UB you get when you dereference a nullptr, etc.
I've tweaked the documentation though to reflect what actually happens since this file is the only consumer of the function anyways.
No description provided.