Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize char::to_digit and assert radix is at least 2 #132709

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

programmerjake
Copy link
Member

@programmerjake programmerjake commented Nov 6, 2024

approved by t-libs: rust-lang/libs-team#475 (comment)

let me know if this needs an assembly test or similar.

@rustbot
Copy link
Collaborator

rustbot commented Nov 6, 2024

r? @thomcc

rustbot has assigned @thomcc.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Nov 6, 2024
@programmerjake programmerjake marked this pull request as draft November 6, 2024 20:31
@programmerjake

This comment was marked as resolved.

@rust-log-analyzer

This comment has been minimized.

@okaneco
Copy link
Contributor

okaneco commented Nov 7, 2024

I wrote this comment before the last force push. The latest commit is branchless but from my testing in the past, that regresses the greater-than-10 radix case because the decimal fast path is removed. It'd be helpful to see the radix benchmark results to confirm this is an improvement.

https://rust.godbolt.org/z/9h14szYM7


Previously written comment
From my experience benchmarking this function, it's difficult to optimize this further than the current implementation. The current code generated tends to be the fastest for non-constant radix, decimal radix, and hexadecimal radix (or any greater-than-10 radix).

It looks like this will worsen performance for greater than decimal radices. Note the code path for converting 0-9 digits in example::to_digit_hex.
https://rust.godbolt.org/z/4a1947nsG - Current to_digit without the new assert conditions

@programmerjake
Copy link
Member Author

programmerjake commented Nov 7, 2024

fixed, and this should be decently optimized on x86_64:
https://rust.godbolt.org/z/MxW8MWa4Y
a function that calls to_digit in a loop over a slice of bytes to implement parsing
an integer (doesn't check for overflow) triggers LLVM's loop unswitching
to produce a separate loop for base <= 10 and base > 10:

base <= 10:

        ; slice pointer in rdi, slice len in rsi
        ; loop index in r8, radix in rcx
        xor     edx, edx ; retval = 0
.LBB0_5:
        movzx   r9d, byte ptr [rdi + r8]
        add     r9d, -48 ; value = byte - '0'
        cmp     r9, rcx  ; if value >= radix {
        jae     .LBB0_9  ; branch to error exit }
        imul    rdx, rcx ; retval *= radix
        add     rdx, r9  ; retval += value
        inc     r8
        cmp     rsi, r8
        jne     .LBB0_5

base > 10:

        ; slice pointer in rdi, slice len in rsi
        ; loop index in r8, radix in rcx
        xor     edx, edx        ; retval = 0
.LBB0_8:
        movzx   r10d, byte ptr [rdi + r8]
        mov     r11d, r10d
        or      r11d, 32        ; lower = byte | 0x20_u32
        add     r11d, -97       ; temp = lower - 'a' as u32
        add     r11, 10         ; letter_value = temp as u64 + 10
        lea     r9d, [r10 - 48] ; value = byte - '0' as u32
        cmp     r10d, 58        ; if byte > '9' as u32 {
        cmovae  r9, r11         ; value = letter_value }
        cmp     r9, rcx         ; if value >= radix {
        jae     .LBB0_9         ; branch to error exit }
        mov     r9d, r9d        ; value = value as u32 as u64 -- unnecessary but llvm can't optimize out
        imul    rdx, rcx        ; retval *= radix
        add     rdx, r9         ; retval += value
        inc     r8
        cmp     rsi, r8
        jne     .LBB0_8

@programmerjake programmerjake marked this pull request as ready for review November 7, 2024 00:43
@programmerjake
Copy link
Member Author

here's what I got on my desktop (Ryzen 7950X):
master is on 116fc31
optimize-charto_digit is on aeffff8
Assuming I'm interpreting it right, my change is slightly faster on all to_digit benches except base 36 where it's slightly slower. I'm guessing the pattern used is simple enough that the branch predictor can learn it exactly, which is not realistic for normal hex numbers, thereby making the conditional move exactly as expensive as a conditional branch.
I ran:

for b in master optimize-charto_digit; do git switch "$b"; for i in {1..3}; do RUST_BACKTRACE=1 ./x.py bench library/core -- to_digit > "$b".txt; done; done

on optimize-charto_digit:

benchmarks:
    char::methods::bench_to_digit_radix_10  33672.73/iter +/- 127.29
    char::methods::bench_to_digit_radix_16  33518.58/iter  +/- 67.57
    char::methods::bench_to_digit_radix_2   33725.32/iter  +/- 92.59
    char::methods::bench_to_digit_radix_36  35903.44/iter +/- 126.43
    char::methods::bench_to_digit_radix_var 23992.17/iter  +/- 63.16

on master:

benchmarks:
    char::methods::bench_to_digit_radix_10  34082.78/iter  +/- 78.49
    char::methods::bench_to_digit_radix_16  34077.87/iter  +/- 61.22
    char::methods::bench_to_digit_radix_2   34084.60/iter  +/- 77.45
    char::methods::bench_to_digit_radix_36  34445.12/iter +/- 252.94
    char::methods::bench_to_digit_radix_var 25113.80/iter  +/- 36.69

@okaneco
Copy link
Contributor

okaneco commented Nov 7, 2024

Nice, it does look faster except that one case.

For radix-36, the assembly is different in this part using your original loop example.
I wouldn't expect this difference to happen but I don't know enough about how the code is optimized.
https://rust.godbolt.org/z/qq4Y1erjG

; radix-36
        cmp     r8, 35
        ja      .LBB1_4
        lea     rdx, [rdx + 8*rdx]
        lea     rdx, [r8 + 4*rdx]
; radix-16
        cmp     r8, 15
        ja      .LBB1_4
        shl     rdx, 4
        or      rdx, r8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants