Skip to content

SIMD intrinsics like _mm_cmpestri is not getting inlined #54353

Closed
@RReverser

Description

@RReverser

I recently tried to port some string searching code using PCMPESTRI from explicit asm! to new stable SIMD intrinsics and noticed that it became quite slower (~30% on average, depending on a case).

After looking into the generated assembly to find a difference that would cause it, I've noticed that Rust intrinsic is not getting inlined and so prevents further optimisations. Here goes minimal equivalent reproducible code in C and Rust:

#include <nmmintrin.h>

int pcmp_find(const char *needle, size_t needle_len, const char *haystack, size_t haystack_len) {
    __m128i needle_m = _mm_loadu_si128((void*)needle);
    __m128i haystack_m = _mm_loadu_si128((void*)haystack);

    return _mm_cmpestri(needle_m, needle_len, haystack_m, haystack_len, _SIDD_CMP_EQUAL_ORDERED);
}

Rust:

use std::arch::x86_64::*;

pub unsafe fn pcmp_find(needle: *const u8, needle_len: usize, haystack: *const u8, haystack_len: usize) -> i32 {
    let needle_m = _mm_loadu_si128(needle as *const _);
    let haystack_m = _mm_loadu_si128(haystack as *const _);

    _mm_cmpestri(needle_m, needle_len as _, haystack_m, haystack_len as _, _SIDD_CMP_EQUAL_ORDERED)
}

Generated assembly for C lowers to few simple instructions, as expected:

pcmp_find:
        push    ebx
        mov     ecx, DWORD PTR [esp+16]
        mov     ebx, DWORD PTR [esp+8]
        mov     eax, DWORD PTR [esp+12]
        mov     edx, DWORD PTR [esp+20]
        movdqu  xmm0, XMMWORD PTR [ebx]
        pcmpestri       xmm0, XMMWORD PTR [ecx], 12
        pop     ebx
        mov     eax, ecx
        ret

Generated assembly for Rust code looks much more verbose and involves a function call:

core::coresimd::x86::sse42::_mm_cmpestri:
        mov     r8, rdx
        movdqa  xmm0, xmmword ptr [rdi]
        mov     eax, esi
        mov     edx, ecx
        pcmpestri       xmm0, xmmword ptr [r8], 12
        mov     eax, ecx
        ret

example::pcmp_find:
        sub     rsp, 40
        movups  xmm0, xmmword ptr [rdi]
        movups  xmm1, xmmword ptr [rdx]
        movaps  xmmword ptr [rsp], xmm0
        movaps  xmmword ptr [rsp + 16], xmm1
        mov     rdi, rsp
        lea     rdx, [rsp + 16]
        call    core::coresimd::x86::sse42::_mm_cmpestri
        add     rsp, 40
        ret

I see that _mm_cmpestri in Rust already has #[inline] attribute like other SIMD intrinsics, but perhaps they should have stronger variant with #[inline(always)] since they are always supposed to lower to simple instructions?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions