-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Result
that uses niches resulting in a final size of 16 bytes emits poor LLVM IR due to ABI
#97540
Comments
To be clear, it's debatable if this is a regression (if it is, it's just a performance issue), since we make no guarantee about the ABI of That said, this likely applies to all 128 bit |
This is interesting, because tuples or structs seem to have some exceptions for two usize-sized elements specifically. The |
Performance regression is a regression. And I think this is a very bad one. If I replace the I don't think this is related to niche though; |
Ah, |
Okay, I located the reason. Before #94570, something this small is passed in registers, so it is optimized to be the same as if it's passed as scalar pair, but it's not, it's actually just an small aggregated passed directly. #94570 forces it to be an indirect passing, causing the perf regression. Given the motivation of #94570 is to workaround LLVM inability to autovectorize things passed in register, I think a fix would to be use some heurstics to determine if a 2-usize type can be a target of autovectorization -- if so, pass indirectly, otherwise, pass by value. |
Thanks for the pointer. That PR looks like a better approach to me. I think the compile-time perf regression in that PR results from excessive |
Assigning priority as discussed in the Zulip thread of the Prioritization Working Group. @rustbot label -I-prioritize +P-high +T-compiler |
This issue only dubiously shows a codegen regression (it is Not Great, to be clear, but it's not clear from this example and my knowledge of how common x86 microarchitectures work that it's Actively Awful). If it is, it does not show it is a performance impact in practical cases, as it is still very few instructions and then a simple |
An alternative would be experimenting more with this in context of other code we expect it to live in, because LLVM eagerly inlining can make problems go away, as cited here: #91447 (comment) Small functions like this, in particular, will tend to be inlined by any backend, precisely because Rust functions are not made external by default. Thus they do not have to respect the System V ABI. Or any ABI, for that matter, and they can make things up as they go along. Constant folding becomes quite practical to do. So, many functions offer seemingly terrible codegen in isolation that doesn't actually come up as a problem in actual practice. This is much more of an issue when we're mostly touching how they do parameter passing, which is precisely what inlining can make disappear, which is why I bring it up. |
BTW, in our project, we also noticed performance degradation because of this regression (small 16B |
I would sincerely appreciate it if this was quantified in some manner or if you could give a more complex example. It's not that I don't think you are correct, it's just that unfortunately the quickest fix is to simply revert the changes that regressed this outright. But reverting the changes will simply regress the performance in other areas, and the losses there were quite significant. That's not acceptable, but this can still be fixed. We need to find something else instead as a solution, and more benchmarks or even a nuanced array of codegen examples means we are less likely to perform an overly sophomoric "fix" that could make things even worse for real code at the expense of improving academic cases. It will be fixed, but not with the quickest fix, except in the sense of making things faster. Honestly, the question I have is why it isn't e.g. returned in a single xmm register instead. It doesn't make sense to me that LLVM would drop a value that can clearly already fit in a register on the stack, ever. |
It's the rustc that makes return value indirect, instead of LLVM. |
The ABI is decided by rustc here, not LLVM. If rustc tells LLVM to use an sret argument, then it must do so. Returning an i64 pair in an XMM register is also a bad idea because it requires GPR to XMM moves, even if we ignore issues related to the target-feature dependence of XMM registers. |
@workingjubilee Yes, we need to quantify this, but it's not that trivial. Our project is fairly big and the benchmark directly or indirectly touches 200K+ LOCs of our code and likely millions of LOCs of code from other crates, so it's not easy to quantify. I'd estimate the average performance loss 5% across the board from 1.60 -> 1.62. We do use quite a few functions returning |
Yes, one of the smaller loops that is like what you use, if it can show an even somewhat similar regression (may need generous application of
Hm. Honestly, I guess that after the amount of noodling around I have done with LLVMIR and the way the LangRef phrases |
@workingjubilee I've tried to implement something with What we also see and that is probably at least equally big problem is the handling of larger
This wastes cycles like there is no tomorrow - instead of materializing the callee future on the stack, the async functions should materialize the callee future directly in-place in the caller's future. That would save the copying (and, especially, copying of still uninitialized space!). I'm wondering if that has something to do with the I'm definitely no expert in compiler building, but if I understand it correctly, the LLVM |
Note that this also happens for Anyway, I do think it's kinda bad1. Most Rust code is very result heavy. I think quantifying this will be very hard though -- For example, rust code often looks like: https://godbolt.org/z/ETKfod3r5. The Notably, For codegen issues, something that models the CPU resources like llvm-mca may be the best bet, but it tends to be terrible at things with calls (so, as always, real measurements > artificial benchmarks). For this case, because they both have basically equivalent calls it might be fine, and MCA does seem to suggest that the Footnotes
|
Hmm. It kinda sucks that we don't use higher-level information from MIR or thereabouts to annotate types with "wants vectorization" and "doesn't want vectorization". MIR is about where we know which cases of |
Represent `Result<usize, Box<T>>` as ScalarPair(i64, ptr) This allows types like `Result<usize, std::io::Error>` (and integers of differing sign, e.g. `Result<u64, i64>`) to be passed in a pair of registers instead of through memory, like `Result<u64, u64>` or `Result<Box<T>, Box<U>>` are today. Fixes rust-lang#97540. r? `@ghost`
…m,oli-obk Represent `Result<usize, Box<T>>` as ScalarPair(i64, ptr) This allows types like `Result<usize, std::io::Error>` (and integers of differing sign, e.g. `Result<u64, i64>`) to be passed in a pair of registers instead of through memory, like `Result<u64, u64>` or `Result<Box<T>, Box<U>>` are today. Fixes rust-lang#97540. r? `@ghost`
note about #121668 that confused me a couple times: The integer part holds the discriminant, and the ptr holds the data in both cases. |
I'm looking at #121668 and trying to figure out why something like I used Layout (ABI is ScalarPair)Layout {
size: Size(16 bytes),
align: AbiAndPrefAlign {
abi: Align(8 bytes),
pref: Align(8 bytes),
},
abi: ScalarPair(
Initialized {
value: Pointer(
AddressSpace(
0,
),
),
valid_range: (..=0) | (1..),
},
Union {
value: Int(
I64,
false,
),
},
),
fields: Arbitrary {
offsets: [
Size(0 bytes),
],
memory_index: [
0,
],
},
largest_niche: None,
variants: Multiple {
tag: Initialized {
value: Pointer(
AddressSpace(
0,
),
),
valid_range: (..=0) | (1..),
},
tag_encoding: Niche {
untagged_variant: 0,
niche_variants: 1..=1,
niche_start: 0,
},
tag_field: 0,
variants: [
Layout {
size: Size(16 bytes),
align: AbiAndPrefAlign {
abi: Align(8 bytes),
pref: Align(8 bytes),
},
abi: ScalarPair(
Initialized {
value: Pointer(
AddressSpace(
0,
),
),
valid_range: 1..=18446744073709551615,
},
Initialized {
value: Int(
I64,
false,
),
valid_range: 0..=18446744073709551615,
},
),
fields: Arbitrary {
offsets: [
Size(0 bytes),
],
memory_index: [
0,
],
},
largest_niche: Some(
Niche {
offset: Size(0 bytes),
value: Pointer(
AddressSpace(
0,
),
),
valid_range: 1..=18446744073709551615,
},
),
variants: Single {
index: 0,
},
max_repr_align: None,
unadjusted_abi_align: Align(8 bytes),
},
Layout {
size: Size(0 bytes),
align: AbiAndPrefAlign {
abi: Align(1 bytes),
pref: Align(8 bytes),
},
abi: Aggregate {
sized: true,
},
fields: Arbitrary {
offsets: [
Size(0 bytes),
],
memory_index: [
0,
],
},
largest_niche: None,
variants: Single {
index: 1,
},
max_repr_align: None,
unadjusted_abi_align: Align(1 bytes),
},
],
},
max_repr_align: None,
unadjusted_abi_align: Align(8 bytes),
} However, for Layout (ABI is Aggregate)Layout {
size: Size(16 bytes),
align: AbiAndPrefAlign {
abi: Align(8 bytes),
pref: Align(8 bytes),
},
abi: Aggregate {
sized: true,
},
fields: Arbitrary {
offsets: [
Size(0 bytes),
],
memory_index: [
0,
],
},
largest_niche: None,
variants: Multiple {
tag: Initialized {
value: Pointer(
AddressSpace(
0,
),
),
valid_range: (..=0) | (1..),
},
tag_encoding: Niche {
untagged_variant: 0,
niche_variants: 1..=1,
niche_start: 0,
},
tag_field: 0,
variants: [
Layout {
size: Size(16 bytes),
align: AbiAndPrefAlign {
abi: Align(8 bytes),
pref: Align(8 bytes),
},
abi: ScalarPair(
Initialized {
value: Pointer(
AddressSpace(
0,
),
),
valid_range: 1..=18446744073709551615,
},
Initialized {
value: Int(
I64,
false,
),
valid_range: 0..=18446744073709551615,
},
),
fields: Arbitrary {
offsets: [
Size(0 bytes),
],
memory_index: [
0,
],
},
largest_niche: Some(
Niche {
offset: Size(0 bytes),
value: Pointer(
AddressSpace(
0,
),
),
valid_range: 1..=18446744073709551615,
},
),
variants: Single {
index: 0,
},
max_repr_align: None,
unadjusted_abi_align: Align(8 bytes),
},
Layout {
size: Size(16 bytes),
align: AbiAndPrefAlign {
abi: Align(8 bytes),
pref: Align(8 bytes),
},
abi: Aggregate {
sized: true,
},
fields: Arbitrary {
offsets: [
Size(8 bytes),
],
memory_index: [
0,
],
},
largest_niche: None,
variants: Single {
index: 1,
},
max_repr_align: None,
unadjusted_abi_align: Align(8 bytes),
},
],
},
max_repr_align: None,
unadjusted_abi_align: Align(8 bytes),
} The only main difference between these types (other than being assigned a different ABI) is the
Does anyone know if it would be possible for rust/compiler/rustc_abi/src/layout.rs Line 790 in 6a6cd65
but I'm not sure exactly how to make the change. EDIT: It seems like that
|
The following code:
godbolt LLVM IR
godbolt ASM
Both functions should be identical, returning the 128-bit value in RAX and RDX per the sysv-64 ABI, which is the default for the environment godbolt uses by default.
Instead, when returning the value without specifying any ABI, the compiler chooses to return the value using an out pointer argument.
Note: This affects
Result<usize, std::io::Error>
, which is a particularly useful type and affects a significant portion ofstd::io
, such asRead
andWrite
Meta
This has improved and regressed since 1.43.0, the first version I could find which would use 128 bits for the type.
All versions tested used the expected code generation for the
extern "C"
function.All tests done using https://godbolt.org and no target overrides, which in practice is a x86_64 linux machine using the sysv-64 ABI.
..= 1.42.0: niche not used, cannot be compared
1.43.0 ..= 1.47.0: codegen like today, uses out ptr for return
1.48.0 ..= 1.60.0: returns using an LLVM
i128
, which produces the most desirable code on sysv-641.61.0 .. (current 2022-05-28 nightly): poor codegen that uses an out ptr for return
@rustbot label +regression-from-stable-to-stable +A-codegen
The text was updated successfully, but these errors were encountered: