-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize char::to_digit and assert radix is at least 2 #132709
base: master
Are you sure you want to change the base?
optimize char::to_digit and assert radix is at least 2 #132709
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
This comment has been minimized.
This comment has been minimized.
91410e2
to
58e383c
Compare
58e383c
to
aeffff8
Compare
I wrote this comment before the last force push. The latest commit is branchless but from my testing in the past, that regresses the greater-than-10 radix case because the decimal fast path is removed. It'd be helpful to see the radix benchmark results to confirm this is an improvement. https://rust.godbolt.org/z/9h14szYM7 Previously written comment It looks like this will worsen performance for greater than decimal radices. Note the code path for converting |
fixed, and this should be decently optimized on x86_64: base <= 10: ; slice pointer in rdi, slice len in rsi
; loop index in r8, radix in rcx
xor edx, edx ; retval = 0
.LBB0_5:
movzx r9d, byte ptr [rdi + r8]
add r9d, -48 ; value = byte - '0'
cmp r9, rcx ; if value >= radix {
jae .LBB0_9 ; branch to error exit }
imul rdx, rcx ; retval *= radix
add rdx, r9 ; retval += value
inc r8
cmp rsi, r8
jne .LBB0_5 base > 10: ; slice pointer in rdi, slice len in rsi
; loop index in r8, radix in rcx
xor edx, edx ; retval = 0
.LBB0_8:
movzx r10d, byte ptr [rdi + r8]
mov r11d, r10d
or r11d, 32 ; lower = byte | 0x20_u32
add r11d, -97 ; temp = lower - 'a' as u32
add r11, 10 ; letter_value = temp as u64 + 10
lea r9d, [r10 - 48] ; value = byte - '0' as u32
cmp r10d, 58 ; if byte > '9' as u32 {
cmovae r9, r11 ; value = letter_value }
cmp r9, rcx ; if value >= radix {
jae .LBB0_9 ; branch to error exit }
mov r9d, r9d ; value = value as u32 as u64 -- unnecessary but llvm can't optimize out
imul rdx, rcx ; retval *= radix
add rdx, r9 ; retval += value
inc r8
cmp rsi, r8
jne .LBB0_8 |
here's what I got on my desktop (Ryzen 7950X): for b in master optimize-charto_digit; do git switch "$b"; for i in {1..3}; do RUST_BACKTRACE=1 ./x.py bench library/core -- to_digit > "$b".txt; done; done on optimize-charto_digit:
on master:
|
Nice, it does look faster except that one case. For radix-36, the assembly is different in this part using your original loop example. ; radix-36
cmp r8, 35
ja .LBB1_4
lea rdx, [rdx + 8*rdx]
lea rdx, [r8 + 4*rdx] ; radix-16
cmp r8, 15
ja .LBB1_4
shl rdx, 4
or rdx, r8 |
approved by t-libs: rust-lang/libs-team#475 (comment)
let me know if this needs an assembly test or similar.