-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pclmulqdq intrinsics don't inline well across target_feature changes anymore #139029
Comments
Thank you for the report! It would be useful to bisect the regression to the specific PR using cargo-bisect-rustc to make it easier to figure out what happened and ping the relevant people. |
@rustbot label: -E-needs-bisection +S-has-bisection +A-target-feature +T-compiler +I-slow +O-x86_64 |
i also arrived at that change with (notes on how i got there)
ends up with...
|
I suspect that this is because of https://github.com/rust-lang/rust/pull/135408/files#diff-f5ccef4931d8d453a6d2cadcd1fda96f59b96dfdc73fd0543db257cb4a362020R851, but I'm not entirely sure why. @RalfJung do you have an idea here? |
That PR changed the ABI so SSE vectors are passed by-val rather than by-ptr. I would have expected that to make optimizations easier, not harder. But sadly I know very little about the later parts of the backend, so I am clueless here. Cc @nikic |
wild speculation: we were relying on some opt via mem2reg |
This probably runs afoul of the ABI type compatiblity check, see: https://github.com/llvm/llvm-project/blob/b6c0ce0bb67d822fac1e3b42461f66c261c1157c/llvm/lib/Target/X86/X86TargetTransformInfo.cpp#L6524-L6528 Basically, it's not always safe to inline a function that contains calls passing vectors if the callee enables additional target features, as this may change the way the vectors are passed. There is no issue in this particular case, because enabling the sse4.1 feature does not change the ABI for __m128i vectors. LLVM just currently doesn't implement a precise check for this. I think that for now it may be better to stick with passing vectors via memory, as we did before. |
The unfortunate part is that we only ever use vector arguments that we know are fully safe to be inlined everywhere since we are compiling for a target where the corresponding target feature is always available. So ideally we'd tell LLVM "don't worry about this argument for inlining purposes, it's fine". Basically, Rust now has it's own, IMO better fix for dealing with "ABI differences introduced by target features". The LLVM inlining check is entirely unnecessary when building Rust code. |
i'd noticed the
pclmulqdq
intrinsics incrc32fast
were notable in aperf report
of a benchmark last night. somewhat shockingly, there were functions whose body waspclmulqdq xmm0, xmm1, 17; ret
andpclmulqdq xmm0, xmm1, 0; ret
, complete with constraining callers' choice of xmm registers! after a bit of digging it seems to be a regression in nightly.the specific regression i'd started at can be reproduced with
cargo bench
in https://github.com/srijs/rust-crc32fast .cargo +1.85.1 bench
produceswhereas
cargo +nightly bench
producesafter looking at perf a bit i believe this is representative: https://rust.godbolt.org/z/8dxcE4vo1 . i'm including everything there in this issue as well.
Code
I tried this code:
I expected to see this happen (with
-C opt-level=3
):Instead, this happened (also
-C opt-level=3
):Version it worked on
1.85.1, 1.31.0, and a half dozen in between.
additionally, beta (rust version 1.86.0-beta.7 (7824ede 2025-03-22) seems good.
nightly with
-C opt-level=3 -C target-feature=+pclmul
still does great.Version with regression
in the above godbolt link, i see
--version
in therustc nightly
tab providesrustc 1.87.0-nightly (a2e63569f 2025-03-26)
. this is consistent with how i first saw this locally:rustc +nightly --version --verbose
:Related improvement along the way
adding the same
target_feature
block on the inner function sees nightly produce somewhat better-than-baseline code: https://rust.godbolt.org/z/sGrYedeaPwith
rustc +nightly -C opt-level 3
yields:whereas before the codegen was identical regardless of the
target_feature
attribute on the inner function. so at least in some cases there is a modest improvement?@rustbot modify labels: +regression-from-stable-to-nightly -regression-untriaged
The text was updated successfully, but these errors were encountered: