Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD intrinsics like _mm_cmpestri is not getting inlined #54353

Closed
RReverser opened this issue Sep 19, 2018 · 11 comments
Closed

SIMD intrinsics like _mm_cmpestri is not getting inlined #54353

RReverser opened this issue Sep 19, 2018 · 11 comments

Comments

@RReverser
Copy link
Contributor

I recently tried to port some string searching code using PCMPESTRI from explicit asm! to new stable SIMD intrinsics and noticed that it became quite slower (~30% on average, depending on a case).

After looking into the generated assembly to find a difference that would cause it, I've noticed that Rust intrinsic is not getting inlined and so prevents further optimisations. Here goes minimal equivalent reproducible code in C and Rust:

#include <nmmintrin.h>

int pcmp_find(const char *needle, size_t needle_len, const char *haystack, size_t haystack_len) {
    __m128i needle_m = _mm_loadu_si128((void*)needle);
    __m128i haystack_m = _mm_loadu_si128((void*)haystack);

    return _mm_cmpestri(needle_m, needle_len, haystack_m, haystack_len, _SIDD_CMP_EQUAL_ORDERED);
}

Rust:

use std::arch::x86_64::*;

pub unsafe fn pcmp_find(needle: *const u8, needle_len: usize, haystack: *const u8, haystack_len: usize) -> i32 {
    let needle_m = _mm_loadu_si128(needle as *const _);
    let haystack_m = _mm_loadu_si128(haystack as *const _);

    _mm_cmpestri(needle_m, needle_len as _, haystack_m, haystack_len as _, _SIDD_CMP_EQUAL_ORDERED)
}

Generated assembly for C lowers to few simple instructions, as expected:

pcmp_find:
        push    ebx
        mov     ecx, DWORD PTR [esp+16]
        mov     ebx, DWORD PTR [esp+8]
        mov     eax, DWORD PTR [esp+12]
        mov     edx, DWORD PTR [esp+20]
        movdqu  xmm0, XMMWORD PTR [ebx]
        pcmpestri       xmm0, XMMWORD PTR [ecx], 12
        pop     ebx
        mov     eax, ecx
        ret

Generated assembly for Rust code looks much more verbose and involves a function call:

core::coresimd::x86::sse42::_mm_cmpestri:
        mov     r8, rdx
        movdqa  xmm0, xmmword ptr [rdi]
        mov     eax, esi
        mov     edx, ecx
        pcmpestri       xmm0, xmmword ptr [r8], 12
        mov     eax, ecx
        ret

example::pcmp_find:
        sub     rsp, 40
        movups  xmm0, xmmword ptr [rdi]
        movups  xmm1, xmmword ptr [rdx]
        movaps  xmmword ptr [rsp], xmm0
        movaps  xmmword ptr [rsp + 16], xmm1
        mov     rdi, rsp
        lea     rdx, [rsp + 16]
        call    core::coresimd::x86::sse42::_mm_cmpestri
        add     rsp, 40
        ret

I see that _mm_cmpestri in Rust already has #[inline] attribute like other SIMD intrinsics, but perhaps they should have stronger variant with #[inline(always)] since they are always supposed to lower to simple instructions?

@alexcrichton
Copy link
Member

Thanks for the report! This is a case where you need to manage target features in rust with the target_feature attribute. If there is a mismatch (like there is here in rust) then no inlining will happen

@RReverser
Copy link
Contributor Author

Hmm. I probably don't understand something about how that attribute works. Doesn't the intrinsic already have it? Why can't it be inlined if so?

@alexcrichton
Copy link
Member

It does indeed have it! Your definition of pcmp_find, however, does not, which prevents it from being inlined into your function.

@RReverser
Copy link
Contributor Author

RReverser commented Sep 19, 2018

Your definition of pcmp_find, however, does not, which prevents it from being inlined into your function.

What I mean, I don't understand why is it a requirement for the caller to also have it, and not just callee. If I understand you correctly, currently this means that such function (or any #[target_feature] wrapper around it) couldn't be inlined inside of

if is_x86_feature_detected!("sse4.2")) {
  pcmp_find(...);
}

context either? That seems like an unnecessary restriction since inlined instructions would be jumped over anyway in cases like this.

@alexcrichton
Copy link
Member

If function A calls function B, then B can only be inlined into A if the set of target features enabled for A and B is the same. Because the intrinsic enables a target feature, that means for inlining to work when you call the intrinsic you must also enable the target feature.

And yes, using an if statement will not allow inlining, it's not possible to inline just because a target-specific check was made at runtime beforehand.

@RReverser
Copy link
Contributor Author

RReverser commented Sep 19, 2018

Hmm, I'd still say that feels like a serious limitation and still unclear why it's needed (why not inline everything, even if target features mismatch). Looks like it currently requires "infecting" every caller up to the public API with target_feature(...) and unsafe markers just to enable cross-function optimisations.

@RReverser
Copy link
Contributor Author

Anyway, it helps, thanks. Feel free to close this issue if you don't consider such inlining as an optimisation opportunity for Rust.

RReverser added a commit to RReverser/twoway that referenced this issue Sep 19, 2018
This looks ugly due to infecting every private function in the pcmp chain, but apparently is required for inlining and does help performance: rust-lang/rust#54353 (comment)
@alexcrichton
Copy link
Member

Ok yeah in that case I'm gonna close this because this is otherwise working as intended, (albeit not very ergonomically). SIMD is pretty low-level after all!

@RReverser
Copy link
Contributor Author

SIMD is pretty low-level after all!

It is, it was just kinda sad to see how much more work it required to use stable intrinsics than asm version (which got inlined in any contexts as expected). At least now I know the way, thanks for that.

Btw, this feels like something that should be warned about in docs or maybe even with a lint, since currently there is no way to know that you're missing out on inlining opportunities without actually looking into the generated assembly.

@alexcrichton
Copy link
Member

Oh for sure yeah, improved docs about things like this are always appreciated!

@dbdr
Copy link

dbdr commented Apr 27, 2020

@RReverser. commenting on an old issue: I suspect the reason the intrinsic cannot be inlined is that it would become part of the code of the caller, and the caller is expected to work on CPUs without that instruction. Even if you test for the feature inside the caller and do not actually execute the instruction, this would still be invalid, since for instance the CPU can fetch and decode instructions before they are executed, including after conditional jumps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants