-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIMD intrinsics like _mm_cmpestri is not getting inlined #54353
Comments
Thanks for the report! This is a case where you need to manage target features in rust with the |
Hmm. I probably don't understand something about how that attribute works. Doesn't the intrinsic already have it? Why can't it be inlined if so? |
It does indeed have it! Your definition of |
What I mean, I don't understand why is it a requirement for the caller to also have it, and not just callee. If I understand you correctly, currently this means that such function (or any if is_x86_feature_detected!("sse4.2")) {
pcmp_find(...);
} context either? That seems like an unnecessary restriction since inlined instructions would be jumped over anyway in cases like this. |
If function A calls function B, then B can only be inlined into A if the set of target features enabled for A and B is the same. Because the intrinsic enables a target feature, that means for inlining to work when you call the intrinsic you must also enable the target feature. And yes, using an |
Hmm, I'd still say that feels like a serious limitation and still unclear why it's needed (why not inline everything, even if target features mismatch). Looks like it currently requires "infecting" every caller up to the public API with |
Anyway, it helps, thanks. Feel free to close this issue if you don't consider such inlining as an optimisation opportunity for Rust. |
This looks ugly due to infecting every private function in the pcmp chain, but apparently is required for inlining and does help performance: rust-lang/rust#54353 (comment)
Ok yeah in that case I'm gonna close this because this is otherwise working as intended, (albeit not very ergonomically). SIMD is pretty low-level after all! |
It is, it was just kinda sad to see how much more work it required to use stable intrinsics than asm version (which got inlined in any contexts as expected). At least now I know the way, thanks for that. Btw, this feels like something that should be warned about in docs or maybe even with a lint, since currently there is no way to know that you're missing out on inlining opportunities without actually looking into the generated assembly. |
Oh for sure yeah, improved docs about things like this are always appreciated! |
@RReverser. commenting on an old issue: I suspect the reason the intrinsic cannot be inlined is that it would become part of the code of the caller, and the caller is expected to work on CPUs without that instruction. Even if you test for the feature inside the caller and do not actually execute the instruction, this would still be invalid, since for instance the CPU can fetch and decode instructions before they are executed, including after conditional jumps. |
I recently tried to port some string searching code using PCMPESTRI from explicit
asm!
to new stable SIMD intrinsics and noticed that it became quite slower (~30% on average, depending on a case).After looking into the generated assembly to find a difference that would cause it, I've noticed that Rust intrinsic is not getting inlined and so prevents further optimisations. Here goes minimal equivalent reproducible code in C and Rust:
Rust:
Generated assembly for C lowers to few simple instructions, as expected:
Generated assembly for Rust code looks much more verbose and involves a function call:
I see that
_mm_cmpestri
in Rust already has#[inline]
attribute like other SIMD intrinsics, but perhaps they should have stronger variant with#[inline(always)]
since they are always supposed to lower to simple instructions?The text was updated successfully, but these errors were encountered: