Description
I recently tried to port some string searching code using PCMPESTRI from explicit asm!
to new stable SIMD intrinsics and noticed that it became quite slower (~30% on average, depending on a case).
After looking into the generated assembly to find a difference that would cause it, I've noticed that Rust intrinsic is not getting inlined and so prevents further optimisations. Here goes minimal equivalent reproducible code in C and Rust:
#include <nmmintrin.h>
int pcmp_find(const char *needle, size_t needle_len, const char *haystack, size_t haystack_len) {
__m128i needle_m = _mm_loadu_si128((void*)needle);
__m128i haystack_m = _mm_loadu_si128((void*)haystack);
return _mm_cmpestri(needle_m, needle_len, haystack_m, haystack_len, _SIDD_CMP_EQUAL_ORDERED);
}
Rust:
use std::arch::x86_64::*;
pub unsafe fn pcmp_find(needle: *const u8, needle_len: usize, haystack: *const u8, haystack_len: usize) -> i32 {
let needle_m = _mm_loadu_si128(needle as *const _);
let haystack_m = _mm_loadu_si128(haystack as *const _);
_mm_cmpestri(needle_m, needle_len as _, haystack_m, haystack_len as _, _SIDD_CMP_EQUAL_ORDERED)
}
Generated assembly for C lowers to few simple instructions, as expected:
pcmp_find:
push ebx
mov ecx, DWORD PTR [esp+16]
mov ebx, DWORD PTR [esp+8]
mov eax, DWORD PTR [esp+12]
mov edx, DWORD PTR [esp+20]
movdqu xmm0, XMMWORD PTR [ebx]
pcmpestri xmm0, XMMWORD PTR [ecx], 12
pop ebx
mov eax, ecx
ret
Generated assembly for Rust code looks much more verbose and involves a function call:
core::coresimd::x86::sse42::_mm_cmpestri:
mov r8, rdx
movdqa xmm0, xmmword ptr [rdi]
mov eax, esi
mov edx, ecx
pcmpestri xmm0, xmmword ptr [r8], 12
mov eax, ecx
ret
example::pcmp_find:
sub rsp, 40
movups xmm0, xmmword ptr [rdi]
movups xmm1, xmmword ptr [rdx]
movaps xmmword ptr [rsp], xmm0
movaps xmmword ptr [rsp + 16], xmm1
mov rdi, rsp
lea rdx, [rsp + 16]
call core::coresimd::x86::sse42::_mm_cmpestri
add rsp, 40
ret
I see that _mm_cmpestri
in Rust already has #[inline]
attribute like other SIMD intrinsics, but perhaps they should have stronger variant with #[inline(always)]
since they are always supposed to lower to simple instructions?