Faster i256 Division (2-100x) (#4663) #4672

tustvold · 2023-08-09T17:37:11Z

Which issue does this PR close?

Closes #4663
Relates to #4664

Rationale for this change

The current algorithm for performing integer division is simple but has time complexity linear w.r.t the length of the quotient. This results in very poor performance when performing division by small divisors.

i256_div_rem small quotient
                        time:   [27.982 µs 28.062 µs 28.145 µs]
                        change: [-57.210% -57.132% -57.051%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  8 (8.00%) high mild
  3 (3.00%) high severe

i256_div_rem small divisor
                        time:   [19.786 µs 19.793 µs 19.800 µs]
                        change: [-99.299% -99.299% -99.298%] (p = 0.00 < 0.05)
                        Performance has improved.

N-digit division is also a precursor to performing precision-loss decimal arithmetic (#4664)

What changes are included in this PR?

Are there any user-facing changes?

tustvold · 2023-08-09T17:38:23Z

arrow-buffer/src/bigint/div.rs

+//!
+//! Implementation heavily inspired by [uint]
+//!
+//! [uint]: https://github.com/paritytech/parity-common/blob/d3a9327124a66e52ca1114bb8640c02c18c134b8/uint/src/uint.rs#L844


I debated using uint directly, but this brought in a lot of code and logic that we didn't need, and wouldn't have easily generalised to support the 512-bit division necessary for precision-loss decimal arithmetic.

tustvold · 2023-08-09T17:48:49Z

arrow-buffer/src/bigint/div.rs

+/// An array of N + 1 elements
+///
+/// This is a hack around lack of support for const arithmetic
+struct ArrayPlusOne<T, const N: usize>([T; N], T);


This is a hack, but it is a hack I am quite pleased with 😅

I'm a bit surprised that miri is pleased with it too 🤔

I would suggest adding #[repr(C)] to the struct as the compiler is allowed to reorder fields otherwise. Adding -Z randomize-layout to RUST_FLAGS might expose this issue.

Good spot, I did not realise Rust would reorder fields with the same alignment

tustvold · 2023-08-09T17:57:21Z

arrow-buffer/src/bigint/div.rs

+}
+
+/// Divide a u128 by a u64 divisor, returning the quotient and remainder
+fn div_rem_word(hi: u64, lo: u64, y: u64) -> (u64, u64) {


On x86_64 this gets converted into a single instruction

Strange, as I couldn't get the compiler to play nice with that function:

https://rust.godbolt.org/z/xr7vEnMhb

You are quite right - https://stackoverflow.com/questions/62257103/emit-div-instruction-instead-of-udivti3

There may be further room for improvement in that case 😄

No worries, I was hopeful I was going to learn a new compiler option or feature to enable 😊

a37aee3 uses some inline assembly to force the correct compilation 😄

Shaves off a further 7 microseconds

viirya

I'll review this in this week.

tustvold · 2023-08-09T22:03:50Z

arrow-buffer/src/bigint/div.rs

+
+    // LLVM fails to use the div instruction as it is not able to prove
+    // that hi < divisor, and therefore the result will fit into 64-bits
+    #[cfg(target_arch = "x86_64")]


This improves performance by ~25%, but would be the first use of inline assembly in this project. I personally think it is constrained and limited in scope enough to not be a concern, but defer to the consensus

Looks okay to me. Just wonder if we want to have assembly for other arch (e.g. AArch64).

I don't believe aarch64 has narrowing division support, so I don't think we can do better than udivti3 (the compiler built-in LLVM inserts)

I couldn't find a narrowing division instruction in aarch64 ISA reference

Yea, looks like udivti3 is what we could have on aarch64.

arrow-buffer/benches/i256.rs

viirya · 2023-08-09T22:24:04Z

arrow-buffer/src/bigint/div.rs

+) -> ([u64; N], [u64; N]) {
+    let numerator_bits = bits(numerator);
+    let divisor_bits = bits(divisor);
+    assert_ne!(divisor_bits, 0, "division by zero");


Would be better if returning a Err for this?

ArrowError isn't defined in this crate, we could return an Option though, but this seemed overkill

arrow-buffer/src/bigint/div.rs

viirya · 2023-08-10T00:20:12Z

arrow-buffer/src/bigint/div.rs

+    (q, remainder)
+}
+
+/// Divide a u128 by a u64 divisor, returning the quotient and remainder


Maybe add a comment here that this is not for general u128/u64 division.

viirya · 2023-08-10T00:42:50Z

arrow-buffer/src/bigint/div.rs

+fn sub_assign(a: &mut [u64], b: &[u64]) -> bool {
+    binop_slice(a, b, u64::overflowing_sub)
+}


Hmm, does this work for cases like a = [1, 0, 0] and b = [0, 1, 1]? I got a overflow (true) and a = [1, 18446744073709551615, 18446744073709551614].

What are you expecting, you are doing 1 - 2^64 - 2^128?

I don't look where this function is used, but just from its description a -= b, so playing it with fake inputs. The output seems not a -= b? no?

Oh, you mean a = [1, 0, 0] is 1? Got it.

Yeah the digits are little endian

viirya · 2023-08-10T14:44:00Z

arrow-buffer/src/bigint/div.rs

+            }
+            q_hat
+        } else {
+            u64::MAX


Hmm, I compare this implementation to some resources I can find, e.g. https://skanthak.homepage.t-online.de/division.html. Is this else branch for initial overflow check and return the largest possible quotient? If so, seems it is possibly to simply set rem to an impossible value?

If u_jn is larger than v_n_1, our guess of q_hat would overflow the 64-bit word, which in turn would cause div_rem_word to trap, so we just use u64::MAX as our guess.

I'll try to add some further docs for what is going on here

Faster i256 Division (2-100x) (apache#4663)

189f6ee

github-actions bot added the arrow Changes to the arrow crate label Aug 9, 2023

tustvold commented Aug 9, 2023

View reviewed changes

tustvold force-pushed the fast-n-digit-div branch from eef08b0 to 189f6ee Compare August 9, 2023 17:38

Clippy

5c978a3

tustvold commented Aug 9, 2023

View reviewed changes

tustvold requested a review from viirya August 9, 2023 17:54

tustvold commented Aug 9, 2023

View reviewed changes

viirya reviewed Aug 9, 2023

View reviewed changes

tustvold added 2 commits August 9, 2023 22:54

Use inline assembly

a37aee3

Fix non-x64

2c07100

tustvold commented Aug 9, 2023

View reviewed changes

viirya reviewed Aug 9, 2023

View reviewed changes

arrow-buffer/benches/i256.rs Show resolved Hide resolved

viirya reviewed Aug 9, 2023

View reviewed changes

arrow-buffer/src/bigint/div.rs Show resolved Hide resolved

viirya reviewed Aug 10, 2023

View reviewed changes

Dandandan approved these changes Aug 10, 2023

View reviewed changes

Add repr(C)

190238f

viirya reviewed Aug 10, 2023

View reviewed changes

viirya approved these changes Aug 10, 2023

View reviewed changes

tustvold added 2 commits August 10, 2023 17:04

More docs

cc9ad46

Format

12a777e

tustvold merged commit c618438 into apache:master Aug 10, 2023

This was referenced Aug 15, 2023

Faster i256 Division #4663

Closed

Precision-Loss Decimal Arithmetic #4664

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster i256 Division (2-100x) (#4663) #4672

Faster i256 Division (2-100x) (#4663) #4672

tustvold commented Aug 9, 2023 •

edited

Loading

tustvold Aug 9, 2023

tustvold Aug 9, 2023

jhorstmann Aug 10, 2023

tustvold Aug 10, 2023

tustvold Aug 9, 2023

stuartcarnie Aug 9, 2023

tustvold Aug 9, 2023

stuartcarnie Aug 9, 2023

tustvold Aug 9, 2023 •

edited

Loading

viirya left a comment

tustvold Aug 9, 2023

viirya Aug 9, 2023

tustvold Aug 9, 2023 •

edited

Loading

stuartcarnie Aug 9, 2023

viirya Aug 9, 2023

viirya Aug 9, 2023

tustvold Aug 9, 2023

viirya Aug 10, 2023

viirya Aug 10, 2023

tustvold Aug 10, 2023

viirya Aug 10, 2023

viirya Aug 10, 2023 •

edited

Loading

tustvold Aug 10, 2023

viirya Aug 10, 2023

tustvold Aug 10, 2023

Faster i256 Division (2-100x) (#4663) #4672

Faster i256 Division (2-100x) (#4663) #4672

Conversation

tustvold commented Aug 9, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Aug 9, 2023 • edited Loading

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Aug 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Aug 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Aug 9, 2023 •

edited

Loading

tustvold Aug 9, 2023 •

edited

Loading

tustvold Aug 9, 2023 •

edited

Loading

viirya Aug 10, 2023 •

edited

Loading