-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rust should use registers more aggressively #26494
Comments
For fat pointers this has been fixed in #26411
|
On |
@arielb1 - yes, the ability to take addresses prevents doing this for all functions. But you could see doing this as an optimization: make the compiler check whether a function uses the address of a reference, and if not then change its calling convention. The compiler can use the worker/wrapper technique (as in GHC) to support call sites which aren't aware of this change in calling convention, by splitting the function into two: // Non-inlined worker, uses fast in-register calling convention
pub fn skip_whitespace_worker(start: u8*, end: u8*) -> (u8*, u8*, u64) {
...
}
// Inlined wrapper, moves reference argument to registers and then calls the worker.
pub fn skip_whitespace(v: &mut std::str::Chars) -> u64 {
// Approximately:
let (start, end, len) = skip_whitespace_worker(v.start, v.end);
mut.start = start;
mut.end = end;
len
} |
Don't do this. You will break FastISel. Fix LLVM if it's not doing the optimizations you want. |
@pcwalton what exactly would break FastISel? |
Pretty sure that since @eddyb's work on niche-filling optimisation, Cc @rust-lang/wg-codegen |
@eddyb didn't |
Yes, this appears to be fixed. The following Rust code (https://rust.godbolt.org/z/LhQGOM): // Passing small structs by value.
pub fn parameters_by_value(v: (u64, u64)) -> u64 {
v.0 + v.1
}
// Returning small structs by value.
pub fn return_by_value() -> (u64, u64) {
(3, 4)
} generates example::parameters_by_value:
lea rax, [rdi + rsi]
ret
example::return_by_value:
mov eax, 3
mov edx, 4
ret and define i64 @parameters_by_value(i64 %v.0, i64 %v.1) unnamed_addr #0 {
start:
%0 = add i64 %v.1, %v.0
ret i64 %0
}
define { i64, i64 } @return_by_value() unnamed_addr #0 {
start:
ret { i64, i64 } { i64 3, i64 4 }
} |
Ah no, I see that this issue also suggests:
We don't do that yet: https://rust.godbolt.org/z/qnK9Ta . FWIW Maybe it is worth to split this issue into two. One part has been closed already AFAICT. The other part (passing |
Here's another slightly more complex example, where pub struct Stats { x: u32, y: u32, z: u32, }
pub extern "C" fn sum_c(a: &Stats, b: &Stats) -> Stats {
return Stats {x: a.x + b.x, y: a.y + b.y, z: a.z + b.z };
}
pub fn sum_rust(a: &Stats, b: &Stats) -> Stats {
return Stats {x: a.x + b.x, y: a.y + b.y, z: a.z + b.z };
} |
A related thread: |
Pass arguments up to 2*usize by value In rust-lang#77434 (comment), `@eddyb` said: > I wonder if it makes sense to limit this to returns [...] Let's do a perf run and find out. It seems the `extern "C"` ABI will pass arguments up to 2*usize in registers: https://godbolt.org/z/n8E6zc. (modified from rust-lang#26494 (comment)) r? `@nagisa`
@matklad is this fixed? Linked code is still bigger on Rust 1.48. |
@oilaba my example is fixed, the return value is no longer via stack:
The difference is that rust abi version now uses a vectorized add, which seems like a win? |
Hi all. I've traced back the cause of 'P-high' issue #85265 to two pulls that relate to this issue, namely #76986 and #79547. Basically, these commits break auto-vectorization post Rust version 1.48 in a way that is not easily worked around. In short, if we take: pub fn case_1(a: [f32; 4], b: [f32; 4]) -> [f32; 4] {
[
a[0] + b[0],
a[1] + b[1],
a[2] + b[2],
a[3] + b[3],
]
} 1.47 yields an efficient: https://rust.godbolt.org/z/v89zoajse example::case_1:
mov rax, rdi
movups xmm0, xmmword ptr [rsi]
movups xmm1, xmmword ptr [rdx]
addps xmm1, xmm0
movups xmmword ptr [rdi], xmm1
ret 1.48 yields an awkward: https://rust.godbolt.org/z/hqT85doqf example::case_1:
movss xmm0, dword ptr [rdi]
movss xmm1, dword ptr [rdi + 4]
addss xmm0, dword ptr [rsi]
addss xmm1, dword ptr [rsi + 4]
movsd xmm2, qword ptr [rdi + 8]
movsd xmm3, qword ptr [rsi + 8]
addps xmm3, xmm2
movd eax, xmm0
movd ecx, xmm1
movd esi, xmm3
shufps xmm3, xmm3, 229
movd edx, xmm3
shl rdx, 32
or rdx, rsi
shl rcx, 32
or rax, rcx
ret 1.50 yields an even more awkward cascade effect: https://rust.godbolt.org/z/vvEePzGEM example::case_1:
movd xmm0, esi
shr rsi, 32
movd xmm1, edi
shr rdi, 32
movd xmm2, edi
punpckldq xmm1, xmm2
movd xmm2, esi
movd xmm3, edx
shr rdx, 32
movd xmm4, edx
punpckldq xmm3, xmm4
movd xmm4, ecx
shr rcx, 32
addps xmm3, xmm1
addss xmm4, xmm0
movd xmm0, ecx
addss xmm0, xmm2
movd edx, xmm4
movd eax, xmm0
shl rax, 32
or rdx, rax
movd ecx, xmm3
shufps xmm3, xmm3, 229
movd eax, xmm3
shl rax, 32
or rax, rcx
ret There are other examples in #85265. My overly simplistic analysis, I'm no expert, is here. In summary, if I rebuild Rust with the above pulls neutered, auto-vectorization appears to return to normal, as per 1.47. |
Rust should pass more structs in registers. Consider these examples: ideally, both functions would execute entirely in registers and wouldn't touch memory:
Rust, as of a recent 1.2.0-dev nightly, is unable to pass either of these in registers (see LLVM IR and ASM below). It would be pretty safe to pass and return small structs (ones that fit into <=2 registers) in registers, and is likely to improve performance on average. This is what the System V ABI does.
It would also be nice to exploit Rust's control over aliasing, and where possible also promote reference arguments to registers, i.e. put the
u64
values in registers for the following functions:In the
&mut
case, this would mean passing twou64
values in registers as function parameters, and returning twou64
values in registers as the return values (ideally we'd arrange for the parameter registers to match the return registers). Uniqueness of&mut
makes this optimization valid, although we may have to give up on this optimization in cases such as when there are raw pointers present.Here's a more realistic example where I've wanted Rust to do this:
This function is too large to justify inlining. I'd like the begin and end pointers of
iter
to be kept in registers across the function call.Probably not surprising to the compiler team, but for completeness here is the LLVM IR of the above snippets, as of today's Rust (1.2.0-dev), compiled in release mode / opt-level=3:
and here is the ASM:
The text was updated successfully, but these errors were encountered: