limb: always inline bitand #109

haslersn · 2022-08-08T14:07:41Z

Previously, rustc didn't inline Limb::bitand inside functions such as UInt::add_mod and UInt::sub_mod, when calling them from a different crate. This is probably because non-generic functions without inline attribute don't get inlined across crates (unless using link-time optimization).

Here's the assembly that shows this:

Assembly generated for U256::sub_mod before this commit

crypto_bigint::uint::sub_mod::<impl crypto_bigint::uint::UInt<_>>::sub_mod:
 push    rbp
 push    r15
 push    r14
 push    r13
 push    r12
 push    rbx
 push    rax
 mov     r14, qword, ptr, [rsi]
 sub     r14, qword, ptr, [rdx]
 mov     eax, 0
 sbb     rax, rax
 mov     r15, qword, ptr, [rsi, +, 8]
 sar     rax, 63
 sub     r15, qword, ptr, [rdx, +, 8]
 mov     ecx, 0
 sbb     rcx, rcx
 add     r15, rax
 adc     rcx, rax
 mov     r13, qword, ptr, [rsi, +, 16]
 sar     rcx, 63
 sub     r13, qword, ptr, [rdx, +, 16]
 mov     eax, 0
 sbb     rax, rax
 add     r13, rcx
 adc     rax, rcx
 mov     r12, qword, ptr, [rsi, +, 24]
 sar     rax, 63
 sub     r12, qword, ptr, [rdx, +, 24]
 mov     ebp, 0
 sbb     rbp, rbp
 mov     rbx, rdi
 add     r12, rax
 adc     rbp, rax
 mov     qword, ptr, [rdi], r14
 mov     qword, ptr, [rdi, +, 8], r15
 mov     qword, ptr, [rdi, +, 16], r13
 mov     qword, ptr, [rdi, +, 24], r12
 mov     rdi, -6292479
 mov     rsi, rbp
 call    qword, ptr, [rip, +, _ZN13crypto_bigint4limb7bit_and43_$LT$impl$u20$crypto_bigint..limb..Limb$GT$6bitand17h582db4b61f5c8596E@GOTPCREL]
 mov     rcx, qword, ptr, [rip, +, _ZN13crypto_bigint4limb7bit_and43_$LT$impl$u20$crypto_bigint..limb..Limb$GT$6bitand17h582db4b61f5c8596E@GOTPCREL]
 add     rax, r14
 mov     qword, ptr, [rbx], rax
 adc     r15, 0
 setb    r14b
 mov     rdi, -1
 mov     rsi, rbp
 call    rcx
 movzx   r14d, r14b
 add     rax, r15
 mov     qword, ptr, [rbx, +, 8], rax
 adc     r14, r13
 setb    r15b
 mov     rdi, -1
 mov     r13, rbp
 mov     rsi, rbp
 call    qword, ptr, [rip, +, _ZN13crypto_bigint4limb7bit_and43_$LT$impl$u20$crypto_bigint..limb..Limb$GT$6bitand17h582db4b61f5c8596E@GOTPCREL]
 movzx   ebp, r15b
 add     rax, r14
 mov     qword, ptr, [rbx, +, 16], rax
 adc     rbp, r12
 mov     rdi, -1
 mov     rsi, r13
 call    qword, ptr, [rip, +, _ZN13crypto_bigint4limb7bit_and43_$LT$impl$u20$crypto_bigint..limb..Limb$GT$6bitand17h582db4b61f5c8596E@GOTPCREL]
 add     rax, rbp
 mov     qword, ptr, [rbx, +, 24], rax
 add     rsp, 8
 pop     rbx
 pop     r12
 pop     r13
 pop     r14
 pop     r15
 pop     rbp
 ret

Assembly generated for U256::sub_mod after this commit

crypto_bigint::uint::sub_mod::<impl crypto_bigint::uint::UInt<_>>::sub_mod:
 xor     r10d, r10d
 mov     r8, qword, ptr, [rsi]
 sub     r8, qword, ptr, [rdx]
 mov     eax, 0
 sbb     rax, rax
 mov     r11, qword, ptr, [rsi, +, 8]
 sar     rax, 63
 sub     r11, qword, ptr, [rdx, +, 8]
 mov     ecx, 0
 sbb     rcx, rcx
 add     r11, rax
 adc     rcx, rax
 mov     r9, qword, ptr, [rsi, +, 16]
 sar     rcx, 63
 sub     r9, qword, ptr, [rdx, +, 16]
 mov     eax, 0
 sbb     rax, rax
 add     r9, rcx
 adc     rax, rcx
 mov     rsi, qword, ptr, [rsi, +, 24]
 sar     rax, 63
 sub     rsi, qword, ptr, [rdx, +, 24]
 sbb     r10, r10
 add     rsi, rax
 adc     r10, rax
 mov     rax, r10
 and     rax, -6292479
 add     rax, r8
 mov     qword, ptr, [rdi], rax
 adc     r11, 0
 setb    al
 movzx   eax, al
 add     r11, r10
 mov     qword, ptr, [rdi, +, 8], r11
 adc     rax, r9
 setb    cl
 movzx   ecx, cl
 add     rax, r10
 mov     qword, ptr, [rdi, +, 16], rax
 adc     rcx, rsi
 add     rcx, r10
 mov     qword, ptr, [rdi, +, 24], rcx
 ret

I benchmarked U256::add_mod and U256::sub_mod before and after this commit using criterion-rs on Intel Core i7-8565U @ 1.80GHz and obtained the following average times. Note that I used a const modulus known at compile-time, which enables some compiler optimizations after inlining. With a modulus known only at runtime, times might differ.

	`U256::add_mod`	`U256::sub_mod`
before	13.311 ns	12.820 ns
after	10.857 ns	9.6262 ns

Previously, rustc didn't inline `Limb::bitand` inside functions such as `UInt::add_mod` and `UInt::sub_mod`, when calling them from a different crate. This is probably because non-generic functions without `inline` attribute don't get inlined across crates (unless using link-time optimization).

haslersn · 2022-08-08T14:18:54Z

Though I'm not sure why sub_mod is still so much slower than sub_mod_special from #108. Here's the assembly of sub_mod_special for comparison, using the same const modulus (just represented differently for sub_mod_special):

Assembly generated for U256::sub_mod_special from #108

crypto_bigint::uint::sub_mod::<impl crypto_bigint::uint::UInt<_>>::sub_mod_special:
 xor     r11d, r11d
 mov     r8, qword, ptr, [rsi]
 sub     r8, qword, ptr, [rdx]
 mov     r10d, 0
 sbb     r10, r10
 mov     r9, qword, ptr, [rsi, +, 8]
 sar     r10, 63
 sub     r9, qword, ptr, [rdx, +, 8]
 mov     ecx, 0
 sbb     rcx, rcx
 add     r9, r10
 adc     rcx, r10
 mov     r10, qword, ptr, [rsi, +, 16]
 sar     rcx, 63
 sub     r10, qword, ptr, [rdx, +, 16]
 mov     eax, 0
 sbb     rax, rax
 add     r10, rcx
 adc     rax, rcx
 mov     rsi, qword, ptr, [rsi, +, 24]
 sar     rax, 63
 sub     rsi, qword, ptr, [rdx, +, 24]
 mov     ecx, 0
 sbb     rcx, rcx
 add     rsi, rax
 adc     rcx, rax
 and     ecx, 6292479
 sub     r8, rcx
 sbb     r11, r11
 sar     r11, 63
 add     r9, r11
 adc     r11, 0
 sar     r11, 63
 add     r10, r11
 adc     r11, 0
 sar     r11, 63
 add     r11, rsi
 mov     qword, ptr, [rdi], r8
 mov     qword, ptr, [rdi, +, 8], r9
 mov     qword, ptr, [rdi, +, 16], r10
 mov     qword, ptr, [rdi, +, 24], r11

tarcieri · 2022-08-08T14:18:54Z

Thanks! We've had some complaints in the past about #[inline(always)] and its impact on compile times, but this seems like a justifiable use.

haslersn force-pushed the limb-inline-bitand branch from cf5b5d6 to fa3b219 Compare August 8, 2022 14:10

tarcieri merged commit 3d14612 into RustCrypto:master Aug 8, 2022

haslersn deleted the limb-inline-bitand branch August 8, 2022 14:19

haslersn mentioned this pull request Aug 8, 2022

uint: Implement modulo operations for special moduli #108

Merged

tarcieri mentioned this pull request Oct 11, 2022

v0.4.9 #131

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

limb: always inline bitand #109

limb: always inline bitand #109

haslersn commented Aug 8, 2022 •

edited

Loading

haslersn commented Aug 8, 2022 •

edited

Loading

tarcieri commented Aug 8, 2022 •

edited

Loading

limb: always inline bitand #109

limb: always inline bitand #109

Conversation

haslersn commented Aug 8, 2022 • edited Loading

haslersn commented Aug 8, 2022 • edited Loading

tarcieri commented Aug 8, 2022 • edited Loading

haslersn commented Aug 8, 2022 •

edited

Loading

haslersn commented Aug 8, 2022 •

edited

Loading

tarcieri commented Aug 8, 2022 •

edited

Loading