-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize parse_str_radix_10
#456
Conversation
The culprit is where `maybe_round` prevented the number from overflowing , and ended up returning an `Ok` in the end. The fix is to round a number only after we have passed the decimal point.
This commit flattened add_by_internal noting the fact that it is only ever called with a fixed size array plus one u32. In addition, instead of accumulating the coefficients in `ArrayVec` and process it later, we try to compute the output number in one pass. This also removed the manual overflow rounding on the `coeff` array and replacing with a plain arithmetic +1. cargo bench --bench lib_benches -- decimal_from_str Baseline: decimal_from_str ... bench: 566 ns/iter (+/- 21) [M1 Pro] decimal_from_str ... bench: 554 ns/iter (+/- 21) [Ryzen 3990x] Current: decimal_from_str ... bench: 193 ns/iter (+/- 1) [M1 Pro] decimal_from_str ... bench: 320 ns/iter (+/- 3) [Ryzen 3990x]
Instead of manually doing 96-bit integer operations, use a 128 bit integer and check for overflowing 96 bits worth of state cargo bench --bench lib_benches -- decimal_from_str Baseline: decimal_from_str ... bench: 566 ns/iter (+/- 21) [M1 Pro] decimal_from_str ... bench: 554 ns/iter (+/- 21) [Ryzen 3990x] Current: decimal_from_str ... bench: 182 ns/iter (+/- 3) [M1 Pro] decimal_from_str ... bench: 299 ns/iter (+/- 4) [Ryzen 3990x]
Accumulate parse state into a 64-bit integer until we would observe an overflow of this integer, and fall back to maintaining 128-bit state cargo bench --bench lib_benches -- decimal_from_str Baseline: decimal_from_str ... bench: 566 ns/iter (+/- 21) [M1 Pro] decimal_from_str ... bench: 554 ns/iter (+/- 21) [Ryzen 3990x] Current: decimal_from_str ... bench: 133 ns/iter (+/- 4) [M1 Pro] decimal_from_str ... bench: 190 ns/iter (+/- 5) [Ryzen 3990x]
This commit changes the parsing code to operate as a series of tail calls. This allows loop and overflow conditions to be checked only when exactly needed, as well as makes it very easy for the compiler to generate quality code on the fast path. The parsing functions are also parameterized by const generics to generate optimized codepaths for various situations (i.e. has/hasn't seen digits yet, has/hasn't seen the decimal point) and remove carried state. cargo bench --bench lib_benches -- decimal_from_str Baseline: decimal_from_str ... bench: 566 ns/iter (+/- 21) [M1 Pro] decimal_from_str ... bench: 554 ns/iter (+/- 21) [Ryzen 3990x] Current: decimal_from_str ... bench: 85 ns/iter (+/- 2) [M1 Pro] decimal_from_str ... bench: 129 ns/iter (+/- 2) [Ryzen 3990x]
Pasting the interesting part of a comment here so I can respond in the new PR:
Cold basically informs LLVM that we do not expect this function will get called. Practically speaking, that means LLVM might
Because this is all variations of might, the impact varies. It has much less of an impact in the common case with small strings because we statically remove the various overflow checks, so it will impact larger strings more. Pre-size-check (the BIG variable), the cold annotations took the time from ~150 ns/iter to ~140, but after it is less impactful. It doesn't look like large decimals are very well represented in the benchmarks, although it's probably reasonable to optimize for strings with less than 18 digits in practice anyways. |
You can play around with this godbolt link and see: https://godbolt.org/z/rnhb5anhP. You can see that test3 is the non-cold one so the compiler optimizes for it (forward jumps are by default predicted false on intel so the compiler makes test3 the fallthrough case) |
For what it's worth, the tail call approach won't be great on architectures that don't support tails or want to pass parameters on the stack. I think the only relevant architecture here for rust is webassembly (and maybe arm32). The options I see are:
|
This change gives each of the possible operations for the u64 accumulation a dedicated dispatch site, giving each operation dedicated branch prediction. cargo bench --bench lib_benches -- decimal_from_str Baseline: decimal_from_str ... bench: 566 ns/iter (+/- 21) [M1 Pro] decimal_from_str ... bench: 554 ns/iter (+/- 21) [Ryzen 3990x] Current: decimal_from_str ... bench: 81 ns/iter (+/- 1) [M1 Pro] decimal_from_str ... bench: 120 ns/iter (+/- 2) [Ryzen 3990x]
Remove up-front handling of digits like '+', '-', and inline handling of '_' and put all of it into a function dedicated to not common digits. This improved the up-front cost for numbers with no prefix and has similar startup cost to the checking code for rare digits if they exist Baseline: decimal_from_str ... bench: 566 ns/iter (+/- 21) [M1 Pro] Current: decimal_from_str ... bench: 71 ns/iter (+/- 2) [M1 Pro]
Latest numbers are down to 70ns/iter on M1, will benchmark on an x86 processor in the next day or two. |
This makes sure that the parsing behaviour is same as before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite an innovative approach to this!
I see how the tail recursion could be problematic if optimization isn't available/used. Would it be worthwhile to feature flag this in the meantime? That way it could be an opt in (or opt out) feature.
You're right that it does have the overhead of maintaining two different versions however we should be able to slightly cater for that with CI. It's not without other downsides of course (e.g. discovery, deprecation) but for now it may make sense.
Thoughts? I can always help wrap up the PR with feature flags + CI if you like.
if str.is_empty() { | ||
return Err(Error::from("Invalid decimal: empty")); | ||
let bytes = str.as_bytes(); | ||
// handle the sign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: looks like this is leftover
offset += 1; | ||
len -= 1; | ||
#[inline] | ||
fn byte_dispatch_u64<const POINT: bool, const NEG: bool, const HAS: bool, const BIG: bool, const FIRST: bool>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My one qualm with these const generics is perhaps knowing what each does. Even though this is an internal method it'd be great to add a little bit of documentation just to cover what the purpose of each is for.
POINT
- a decimal point has been seenNEG
- we've encountered a-
and the number is negativeHAS
- a digit has been encountered (whenHAS
isfalse
it's invalid)BIG
- a number that uses 96 bits instead of only 64 bitsFIRST
- true if it is the first byte in the string
@@ -92,6 +93,25 @@ pub(crate) fn add_by_internal(value: &mut [u32], by: &[u32]) -> u32 { | |||
carry as u32 | |||
} | |||
|
|||
pub(crate) fn add_by_internal_flattened(value: &mut [u32; 3], by: u32) -> u32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I previously had something similar - add_by_internal3
. At some stage I'd love to move away from these functions however it makes sense to introduce it for this PR.
Sure! Let me work on the changes |
As for the tail-call-or-not question: Everything before the tail calls is still a significant improvement and shouldn't really introduce a lot of platform dependency. For the tail calls, it might make sense to keep the later commits here as a WIP-PR, add the non-tail optimizations to the generalized parser, and see what performance looks like there. If that's performing acceptably choice 3 (only use good one on x86/arm/riscv) might be fine? |
That makes sense. I'm going to merge this as is for now, and we can iterate from there. |
* Fix `from_str` overflow rounding issue The culprit is where `maybe_round` prevented the number from overflowing , and ended up returning an `Ok` in the end. The fix is to round a number only after we have passed the decimal point. * Remove coeff ArrayVec from `parse_str_radix_10` This commit flattened add_by_internal noting the fact that it is only ever called with a fixed size array plus one u32. In addition, instead of accumulating the coefficients in `ArrayVec` and process it later, we try to compute the output number in one pass. This also removed the manual overflow rounding on the `coeff` array and replacing with a plain arithmetic +1. cargo bench --bench lib_benches -- decimal_from_str Baseline: decimal_from_str ... bench: 566 ns/iter (+/- 21) [M1 Pro] decimal_from_str ... bench: 554 ns/iter (+/- 21) [Ryzen 3990x] Current: decimal_from_str ... bench: 193 ns/iter (+/- 1) [M1 Pro] decimal_from_str ... bench: 320 ns/iter (+/- 3) [Ryzen 3990x] * Use 128 bit integer to speed up `parse_str_radix_10` Instead of manually doing 96-bit integer operations, use a 128 bit integer and check for overflowing 96 bits worth of state cargo bench --bench lib_benches -- decimal_from_str Baseline: decimal_from_str ... bench: 566 ns/iter (+/- 21) [M1 Pro] decimal_from_str ... bench: 554 ns/iter (+/- 21) [Ryzen 3990x] Current: decimal_from_str ... bench: 182 ns/iter (+/- 3) [M1 Pro] decimal_from_str ... bench: 299 ns/iter (+/- 4) [Ryzen 3990x] * Use 64 bit integer with 128 bit fallback to speed up further Accumulate parse state into a 64-bit integer until we would observe an overflow of this integer, and fall back to maintaining 128-bit state cargo bench --bench lib_benches -- decimal_from_str Baseline: decimal_from_str ... bench: 566 ns/iter (+/- 21) [M1 Pro] decimal_from_str ... bench: 554 ns/iter (+/- 21) [Ryzen 3990x] Current: decimal_from_str ... bench: 133 ns/iter (+/- 4) [M1 Pro] decimal_from_str ... bench: 190 ns/iter (+/- 5) [Ryzen 3990x] * Convert into series of tail-calls This commit changes the parsing code to operate as a series of tail calls. This allows loop and overflow conditions to be checked only when exactly needed, as well as makes it very easy for the compiler to generate quality code on the fast path. The parsing functions are also parameterized by const generics to generate optimized codepaths for various situations (i.e. has/hasn't seen digits yet, has/hasn't seen the decimal point) and remove carried state. cargo bench --bench lib_benches -- decimal_from_str Baseline: decimal_from_str ... bench: 566 ns/iter (+/- 21) [M1 Pro] decimal_from_str ... bench: 554 ns/iter (+/- 21) [Ryzen 3990x] Current: decimal_from_str ... bench: 85 ns/iter (+/- 2) [M1 Pro] decimal_from_str ... bench: 129 ns/iter (+/- 2) [Ryzen 3990x] * Give each operation a dedicated dispatch This change gives each of the possible operations for the u64 accumulation a dedicated dispatch site, giving each operation dedicated branch prediction. cargo bench --bench lib_benches -- decimal_from_str Baseline: decimal_from_str ... bench: 566 ns/iter (+/- 21) [M1 Pro] decimal_from_str ... bench: 554 ns/iter (+/- 21) [Ryzen 3990x] Current: decimal_from_str ... bench: 81 ns/iter (+/- 1) [M1 Pro] decimal_from_str ... bench: 120 ns/iter (+/- 2) [Ryzen 3990x] * Move 'rare' digit handling into dedicated function Remove up-front handling of digits like '+', '-', and inline handling of '_' and put all of it into a function dedicated to not common digits. This improved the up-front cost for numbers with no prefix and has similar startup cost to the checking code for rare digits if they exist Baseline: decimal_from_str ... bench: 566 ns/iter (+/- 21) [M1 Pro] Current: decimal_from_str ... bench: 71 ns/iter (+/- 2) [M1 Pro] * Remove unimportant const generic from digit dispatch * Verify test cases in unpacked representation This makes sure that the parsing behaviour is same as before Co-authored-by: Tomatillo <tomatillo@cantortrading.fi>
Ok, that's weird. This is merged now in fca96a5? I'm going to close this PR since it remained open. |
Replacing #454, this PR resolves #453 and:
parse_str_radix_10
Running
cargo bench --bench lib_benches -- decimal_from_str
On M1 Pro, the numbers are:
On Ryzen 3990x, the numbers are:
I know this is a lot of code changes. No rush at all and feel free to ask us questions if pieces of the code seems unclear.
Have a happy new year too!