SIMD intrinsics like _mm_cmpestri is not getting inlined

I recently tried to port some string searching code using PCMPESTRI from explicit `asm!` to new stable SIMD intrinsics and noticed that it became quite slower (~30% on average, depending on a case).

After looking into the generated assembly to find a difference that would cause it, I've noticed that Rust intrinsic is not getting inlined and so prevents further optimisations. Here goes minimal equivalent reproducible code in C and Rust:

```c
#include <nmmintrin.h>

int pcmp_find(const char *needle, size_t needle_len, const char *haystack, size_t haystack_len) {
    __m128i needle_m = _mm_loadu_si128((void*)needle);
    __m128i haystack_m = _mm_loadu_si128((void*)haystack);

    return _mm_cmpestri(needle_m, needle_len, haystack_m, haystack_len, _SIDD_CMP_EQUAL_ORDERED);
}
```

Rust:

```rust
use std::arch::x86_64::*;

pub unsafe fn pcmp_find(needle: *const u8, needle_len: usize, haystack: *const u8, haystack_len: usize) -> i32 {
    let needle_m = _mm_loadu_si128(needle as *const _);
    let haystack_m = _mm_loadu_si128(haystack as *const _);

    _mm_cmpestri(needle_m, needle_len as _, haystack_m, haystack_len as _, _SIDD_CMP_EQUAL_ORDERED)
}
```

Generated assembly for C lowers to few simple instructions, as expected:

```asm
pcmp_find:
        push    ebx
        mov     ecx, DWORD PTR [esp+16]
        mov     ebx, DWORD PTR [esp+8]
        mov     eax, DWORD PTR [esp+12]
        mov     edx, DWORD PTR [esp+20]
        movdqu  xmm0, XMMWORD PTR [ebx]
        pcmpestri       xmm0, XMMWORD PTR [ecx], 12
        pop     ebx
        mov     eax, ecx
        ret
```

Generated assembly for Rust code looks much more verbose and involves a function call:
```asm
core::coresimd::x86::sse42::_mm_cmpestri:
        mov     r8, rdx
        movdqa  xmm0, xmmword ptr [rdi]
        mov     eax, esi
        mov     edx, ecx
        pcmpestri       xmm0, xmmword ptr [r8], 12
        mov     eax, ecx
        ret

example::pcmp_find:
        sub     rsp, 40
        movups  xmm0, xmmword ptr [rdi]
        movups  xmm1, xmmword ptr [rdx]
        movaps  xmmword ptr [rsp], xmm0
        movaps  xmmword ptr [rsp + 16], xmm1
        mov     rdi, rsp
        lea     rdx, [rsp + 16]
        call    core::coresimd::x86::sse42::_mm_cmpestri
        add     rsp, 40
        ret
```

I see that `_mm_cmpestri` in Rust already has `#[inline]` attribute like other SIMD intrinsics, but perhaps they should have stronger variant with `#[inline(always)]` since they are always supposed to lower to simple instructions?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SIMD intrinsics like _mm_cmpestri is not getting inlined #54353

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SIMD intrinsics like _mm_cmpestri is not getting inlined #54353

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions