Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

limb: always inline bitand #109

Merged
merged 1 commit into from
Aug 8, 2022

Conversation

haslersn
Copy link
Contributor

@haslersn haslersn commented Aug 8, 2022

Previously, rustc didn't inline Limb::bitand inside functions such as UInt::add_mod and UInt::sub_mod, when calling them from a different crate. This is probably because non-generic functions without inline attribute don't get inlined across crates (unless using link-time optimization).

Here's the assembly that shows this:

Assembly generated for U256::sub_mod before this commit
crypto_bigint::uint::sub_mod::<impl crypto_bigint::uint::UInt<_>>::sub_mod:
 push    rbp
 push    r15
 push    r14
 push    r13
 push    r12
 push    rbx
 push    rax
 mov     r14, qword, ptr, [rsi]
 sub     r14, qword, ptr, [rdx]
 mov     eax, 0
 sbb     rax, rax
 mov     r15, qword, ptr, [rsi, +, 8]
 sar     rax, 63
 sub     r15, qword, ptr, [rdx, +, 8]
 mov     ecx, 0
 sbb     rcx, rcx
 add     r15, rax
 adc     rcx, rax
 mov     r13, qword, ptr, [rsi, +, 16]
 sar     rcx, 63
 sub     r13, qword, ptr, [rdx, +, 16]
 mov     eax, 0
 sbb     rax, rax
 add     r13, rcx
 adc     rax, rcx
 mov     r12, qword, ptr, [rsi, +, 24]
 sar     rax, 63
 sub     r12, qword, ptr, [rdx, +, 24]
 mov     ebp, 0
 sbb     rbp, rbp
 mov     rbx, rdi
 add     r12, rax
 adc     rbp, rax
 mov     qword, ptr, [rdi], r14
 mov     qword, ptr, [rdi, +, 8], r15
 mov     qword, ptr, [rdi, +, 16], r13
 mov     qword, ptr, [rdi, +, 24], r12
 mov     rdi, -6292479
 mov     rsi, rbp
 call    qword, ptr, [rip, +, _ZN13crypto_bigint4limb7bit_and43_$LT$impl$u20$crypto_bigint..limb..Limb$GT$6bitand17h582db4b61f5c8596E@GOTPCREL]
 mov     rcx, qword, ptr, [rip, +, _ZN13crypto_bigint4limb7bit_and43_$LT$impl$u20$crypto_bigint..limb..Limb$GT$6bitand17h582db4b61f5c8596E@GOTPCREL]
 add     rax, r14
 mov     qword, ptr, [rbx], rax
 adc     r15, 0
 setb    r14b
 mov     rdi, -1
 mov     rsi, rbp
 call    rcx
 movzx   r14d, r14b
 add     rax, r15
 mov     qword, ptr, [rbx, +, 8], rax
 adc     r14, r13
 setb    r15b
 mov     rdi, -1
 mov     r13, rbp
 mov     rsi, rbp
 call    qword, ptr, [rip, +, _ZN13crypto_bigint4limb7bit_and43_$LT$impl$u20$crypto_bigint..limb..Limb$GT$6bitand17h582db4b61f5c8596E@GOTPCREL]
 movzx   ebp, r15b
 add     rax, r14
 mov     qword, ptr, [rbx, +, 16], rax
 adc     rbp, r12
 mov     rdi, -1
 mov     rsi, r13
 call    qword, ptr, [rip, +, _ZN13crypto_bigint4limb7bit_and43_$LT$impl$u20$crypto_bigint..limb..Limb$GT$6bitand17h582db4b61f5c8596E@GOTPCREL]
 add     rax, rbp
 mov     qword, ptr, [rbx, +, 24], rax
 add     rsp, 8
 pop     rbx
 pop     r12
 pop     r13
 pop     r14
 pop     r15
 pop     rbp
 ret
Assembly generated for U256::sub_mod after this commit
crypto_bigint::uint::sub_mod::<impl crypto_bigint::uint::UInt<_>>::sub_mod:
 xor     r10d, r10d
 mov     r8, qword, ptr, [rsi]
 sub     r8, qword, ptr, [rdx]
 mov     eax, 0
 sbb     rax, rax
 mov     r11, qword, ptr, [rsi, +, 8]
 sar     rax, 63
 sub     r11, qword, ptr, [rdx, +, 8]
 mov     ecx, 0
 sbb     rcx, rcx
 add     r11, rax
 adc     rcx, rax
 mov     r9, qword, ptr, [rsi, +, 16]
 sar     rcx, 63
 sub     r9, qword, ptr, [rdx, +, 16]
 mov     eax, 0
 sbb     rax, rax
 add     r9, rcx
 adc     rax, rcx
 mov     rsi, qword, ptr, [rsi, +, 24]
 sar     rax, 63
 sub     rsi, qword, ptr, [rdx, +, 24]
 sbb     r10, r10
 add     rsi, rax
 adc     r10, rax
 mov     rax, r10
 and     rax, -6292479
 add     rax, r8
 mov     qword, ptr, [rdi], rax
 adc     r11, 0
 setb    al
 movzx   eax, al
 add     r11, r10
 mov     qword, ptr, [rdi, +, 8], r11
 adc     rax, r9
 setb    cl
 movzx   ecx, cl
 add     rax, r10
 mov     qword, ptr, [rdi, +, 16], rax
 adc     rcx, rsi
 add     rcx, r10
 mov     qword, ptr, [rdi, +, 24], rcx
 ret

I benchmarked U256::add_mod and U256::sub_mod before and after this commit using criterion-rs on Intel Core i7-8565U @ 1.80GHz and obtained the following average times. Note that I used a const modulus known at compile-time, which enables some compiler optimizations after inlining. With a modulus known only at runtime, times might differ.

U256::add_mod U256::sub_mod
before 13.311 ns 12.820 ns
after 10.857 ns 9.6262 ns

Previously, rustc didn't inline `Limb::bitand` inside functions
such as `UInt::add_mod` and `UInt::sub_mod`, when calling them from
a different crate. This is probably because non-generic functions
without `inline` attribute don't get inlined across crates (unless
using link-time optimization).
@haslersn
Copy link
Contributor Author

haslersn commented Aug 8, 2022

Though I'm not sure why sub_mod is still so much slower than sub_mod_special from #108. Here's the assembly of sub_mod_special for comparison, using the same const modulus (just represented differently for sub_mod_special):

Assembly generated for U256::sub_mod_special from #108
crypto_bigint::uint::sub_mod::<impl crypto_bigint::uint::UInt<_>>::sub_mod_special:
 xor     r11d, r11d
 mov     r8, qword, ptr, [rsi]
 sub     r8, qword, ptr, [rdx]
 mov     r10d, 0
 sbb     r10, r10
 mov     r9, qword, ptr, [rsi, +, 8]
 sar     r10, 63
 sub     r9, qword, ptr, [rdx, +, 8]
 mov     ecx, 0
 sbb     rcx, rcx
 add     r9, r10
 adc     rcx, r10
 mov     r10, qword, ptr, [rsi, +, 16]
 sar     rcx, 63
 sub     r10, qword, ptr, [rdx, +, 16]
 mov     eax, 0
 sbb     rax, rax
 add     r10, rcx
 adc     rax, rcx
 mov     rsi, qword, ptr, [rsi, +, 24]
 sar     rax, 63
 sub     rsi, qword, ptr, [rdx, +, 24]
 mov     ecx, 0
 sbb     rcx, rcx
 add     rsi, rax
 adc     rcx, rax
 and     ecx, 6292479
 sub     r8, rcx
 sbb     r11, r11
 sar     r11, 63
 add     r9, r11
 adc     r11, 0
 sar     r11, 63
 add     r10, r11
 adc     r11, 0
 sar     r11, 63
 add     r11, rsi
 mov     qword, ptr, [rdi], r8
 mov     qword, ptr, [rdi, +, 8], r9
 mov     qword, ptr, [rdi, +, 16], r10
 mov     qword, ptr, [rdi, +, 24], r11

@tarcieri
Copy link
Member

tarcieri commented Aug 8, 2022

Thanks! We've had some complaints in the past about #[inline(always)] and its impact on compile times, but this seems like a justifiable use.

@tarcieri tarcieri merged commit 3d14612 into RustCrypto:master Aug 8, 2022
@haslersn haslersn deleted the limb-inline-bitand branch August 8, 2022 14:19
@tarcieri tarcieri mentioned this pull request Oct 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants