-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster i256 Division (2-100x) (#4663) #4672
Conversation
//! | ||
//! Implementation heavily inspired by [uint] | ||
//! | ||
//! [uint]: https://github.com/paritytech/parity-common/blob/d3a9327124a66e52ca1114bb8640c02c18c134b8/uint/src/uint.rs#L844 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I debated using uint directly, but this brought in a lot of code and logic that we didn't need, and wouldn't have easily generalised to support the 512-bit division necessary for precision-loss decimal arithmetic.
eef08b0
to
189f6ee
Compare
/// An array of N + 1 elements | ||
/// | ||
/// This is a hack around lack of support for const arithmetic | ||
struct ArrayPlusOne<T, const N: usize>([T; N], T); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a hack, but it is a hack I am quite pleased with 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit surprised that miri is pleased with it too 🤔
I would suggest adding #[repr(C)]
to the struct as the compiler is allowed to reorder fields otherwise. Adding -Z randomize-layout
to RUST_FLAGS might expose this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot, I did not realise Rust would reorder fields with the same alignment
arrow-buffer/src/bigint/div.rs
Outdated
} | ||
|
||
/// Divide a u128 by a u64 divisor, returning the quotient and remainder | ||
fn div_rem_word(hi: u64, lo: u64, y: u64) -> (u64, u64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On x86_64 this gets converted into a single instruction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange, as I couldn't get the compiler to play nice with that function:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are quite right - https://stackoverflow.com/questions/62257103/emit-div-instruction-instead-of-udivti3
There may be further room for improvement in that case 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, I was hopeful I was going to learn a new compiler option or feature to enable 😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a37aee3 uses some inline assembly to force the correct compilation 😄
Shaves off a further 7 microseconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll review this in this week.
|
||
// LLVM fails to use the div instruction as it is not able to prove | ||
// that hi < divisor, and therefore the result will fit into 64-bits | ||
#[cfg(target_arch = "x86_64")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This improves performance by ~25%, but would be the first use of inline assembly in this project. I personally think it is constrained and limited in scope enough to not be a concern, but defer to the consensus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay to me. Just wonder if we want to have assembly for other arch (e.g. AArch64).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe aarch64 has narrowing division support, so I don't think we can do better than udivti3 (the compiler built-in LLVM inserts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find a narrowing division instruction in aarch64 ISA reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, looks like udivti3 is what we could have on aarch64.
) -> ([u64; N], [u64; N]) { | ||
let numerator_bits = bits(numerator); | ||
let divisor_bits = bits(divisor); | ||
assert_ne!(divisor_bits, 0, "division by zero"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be better if returning a Err for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArrowError isn't defined in this crate, we could return an Option though, but this seemed overkill
arrow-buffer/src/bigint/div.rs
Outdated
(q, remainder) | ||
} | ||
|
||
/// Divide a u128 by a u64 divisor, returning the quotient and remainder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment here that this is not for general u128/u64 division.
fn sub_assign(a: &mut [u64], b: &[u64]) -> bool { | ||
binop_slice(a, b, u64::overflowing_sub) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, does this work for cases like a = [1, 0, 0]
and b = [0, 1, 1]
? I got a overflow (true) and a = [1, 18446744073709551615, 18446744073709551614]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are you expecting, you are doing 1 - 2^64 - 2^128
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't look where this function is used, but just from its description a -= b
, so playing it with fake inputs. The output seems not a -= b
? no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you mean a = [1, 0, 0]
is 1? Got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the digits are little endian
} | ||
q_hat | ||
} else { | ||
u64::MAX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I compare this implementation to some resources I can find, e.g. https://skanthak.homepage.t-online.de/division.html. Is this else branch for initial overflow check and return the largest possible quotient? If so, seems it is possibly to simply set rem to an impossible value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If u_jn is larger than v_n_1, our guess of q_hat would overflow the 64-bit word, which in turn would cause div_rem_word to trap, so we just use u64::MAX as our guess.
I'll try to add some further docs for what is going on here
Which issue does this PR close?
Closes #4663
Relates to #4664
Rationale for this change
The current algorithm for performing integer division is simple but has time complexity linear w.r.t the length of the quotient. This results in very poor performance when performing division by small divisors.
N-digit division is also a precursor to performing precision-loss decimal arithmetic (#4664)
What changes are included in this PR?
Are there any user-facing changes?