Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not using the byval attribute loses memcpy optimizations #103103

Closed
pcwalton opened this issue Oct 16, 2022 · 19 comments
Closed

Not using the byval attribute loses memcpy optimizations #103103

pcwalton opened this issue Oct 16, 2022 · 19 comments
Labels
A-codegen Area: Code generation C-bug Category: This is a bug. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@pcwalton
Copy link
Contributor

pcwalton commented Oct 16, 2022

Consider this code (Godbolt link):

pub struct S([i32; 16]);

#[inline(never)]
fn g(x: S) {
    println!("{}", x.0[3]);
}

#[inline(never)]
fn f(x: S) {
    g(x)
}

pub fn main() {
    f(S([0; 16]))
}

LLVM can't eliminate the memcpy between f and g, even though it should legally be able to do so. This is because memcpyopt can only forward to parameters marked byval. We don't use the byval attribute, and this seems to be by design. But we lose this optimization, which I've observed hurting codegen in several places even in hello world (core::fmt::Write::write_fmt, core::panicking::assert_failed(), <core::fmt::Arguments as core::fmt::Display>::fmt). I suspect losing this optimization hurts us all over the place.

There are two solutions I can see:

(1) Use byval for indirect arguments in the Rust ABI.

(2) Change LLVM to allow the optimization to happen for at least nocapture noalias readonly parameters. Since nocapture implies that the behavior of the callee doesn't depend on the address and noalias readonly implies that the memory is strongly immutable, this should work. We mark all indirect by-value arguments as nocapture noalias already.

I'm working on a patch for (2), but I was wondering why we can't just do (1).

@pcwalton pcwalton added the C-bug Category: This is a bug. label Oct 16, 2022
@pcwalton
Copy link
Contributor Author

Note that if we choose option (2) then there will need to be a little more work to tell LLVM that an indirectly-passed function argument was marked immutable by the programmer.

@pcwalton
Copy link
Contributor Author

pcwalton commented Oct 16, 2022

Actually, (2) is more complicated than I thought. There are actually two sub-options:

(2a) At the Rust ABI level, specify that callees are not allowed to write to indirectly-passed arguments. Currently, they can if they declare a by-val parameter mut (or it has UnsafeCell in it). With this change a callee would have to codegen a memcpy to a temporary if it writes to a by-val parameter.

(2b) Keep the ABI the same as it is today. In that case we would make LLVM responsible for adding readonly attributes itself. Generally this can only be done when building with LTO.

@pcwalton
Copy link
Contributor Author

pcwalton commented Oct 16, 2022

Here's the LLVM change. Coupled with the suggested change in Rust ABI this should let us eliminate quite a few memcpys. pcwalton/llvm-project@854b7c4

Also, I tried out (2b) and the results were disappointing: LLVM can't even tell that println is readonly in fat LTO. I think if we want to get rid of these memcpys (and we probably do, as they're a regression over C++) then we should modify the Rust ABI either to use byval or to forbid callees writing to indirectly-passed args.

@scottmcm
Copy link
Member

We don't use the byval attribute, and this seems to be by design.

I'm curious about why; any chance you have a link handy to a discussion about it?

@pcwalton
Copy link
Contributor Author

I'm curious about why; any chance you have a link handy to a discussion about it?

I don't, but a comment from @bjorn3 seemed to indicate to me that it was deliberate.

@bjorn3
Copy link
Member

bjorn3 commented Oct 16, 2022

byval means that the argument is passed at a specific stack location. This requires a copy to this stack location and is as such incompatible with unsized arguments. To solve the issue of these redundant copies I have long been advocating to use move arguments (which pass the orginal location in case of by-ref) instead of copy arguments (which copy the arguments to the stack and then pass the copy's location in case of by-ref) during MIR building where possible, but @RalfJung would like to instead remove move arguments entirely and handle unsized arguments some other way.

@pcwalton
Copy link
Contributor Author

@RalfJung Can we get some forward progress on what to do here? These copies are hurting codegen across the board. I don't think the status quo is going to be tenable long-term; the pressure for a solution will be too great.

If I understand correctly, my proposal would be for all arguments to continue to be by-copy, but the copy moves from the caller side to the callee side and as such can be optimized away if known to be immutable. (Callers can't know this because Rust function signatures don't specify mutability of by-value arguments, only definitions do.)

@bjorn3
Copy link
Member

bjorn3 commented Oct 16, 2022

Your proposal of using byval would still do the copy at call site if the argument couldn't be constructed in place at the right stack location. For example when there are multiple calls with different byval arguments.

Your proposal of making arguments immutable is incompatible with unsized arguments as those can be mutated in place and can't be copied at the callee side to make the copy at the callee side.

@pcwalton
Copy link
Contributor Author

So per the unsized arguments RFC unsized arguments are passed as &move. This implies that we're already using a different calling convention for unsized arguments. So my immutable arguments proposal could be a replacement for the cases in which we're using by-copy--the ABI for unsized arguments would remain the same as today.

@bjorn3
Copy link
Member

bjorn3 commented Oct 16, 2022

by-ref sized arguments are currently also passed &move. We just insert the extra copy at MIR building time as precaution due to unclear semantics of move arguments. Note that we need both to be abi compatible for trait object dispatch I think as is already stable through Box<dyn FnOnce()>.

@pcwalton
Copy link
Contributor Author

pcwalton commented Oct 16, 2022

What a mess. Maybe we could just mark by-ref arguments as readonly in the callee if they're declared immutable and freeze, and then rely on LTO to propagate that attribute to callers. Then memcpyopt can detect the readonly nocapture noalias combo and forward the memcpy.

@JakobDegen
Copy link
Contributor

The right solution here for the unsized case is to stop representing &move arguments in MIR with an Operand::Move. Operand::Move is also very clear about the fact that it makes a copy. This would require someone to go and actually come up with a design for a decent story for unsized locals in MIR though...

@pcwalton
Copy link
Contributor Author

pcwalton commented Oct 16, 2022

I still think it'd make sense to mark semantically-by-value but passed-by-ref-at-the-ABI-level immutable freeze arguments as readonly, since we know that, and LLVM can use it for LTO more generally than just memcpyopt.

@bjorn3
Copy link
Member

bjorn3 commented Oct 16, 2022

@JakobDegen We can easily use a separate type for call arguments to solve the problem of Operand::Move being specified as doing a copy, but that won't solve this perf issue. The problem is that we aren't using Operand::Move too often, but that we use it too little, thus causing many needless copies.

@pcwalton
Copy link
Contributor Author

I wonder what the best way to propagate deduced readonly around across crates would be. Either it would have be a per-parameter bit in the Rust metadata, or it'd have to be added to the ThinLTO summaries.

@pcwalton
Copy link
Contributor Author

pcwalton commented Oct 16, 2022

I verified that with my change to LLVM with the soundness fix suggested by @nikic on Zulip, it can do the optimization for the cases I'm looking at as long as Rust codegen can either inspect the MIR for direct callees for readonly safety or to pull that information from the crate metadata. I think we have a way forward for most cases.

It would still be better to not generate the copy at the MIR level, but at least this will fix the worst offenders in the generated code.

@RalfJung
Copy link
Member

RalfJung commented Oct 19, 2022

I'm sorry I am not familiar with LLVM byref. What new requirements do we need to impose on the MIR level to justify this attribute?

From the LangRef it sounds like using byref for Operand::Copy arguments is a no-brainer, and what is correct for Operand::Copy is certainly also correct for Operand::Move. When we use pointers to implement functions that, on the MIR level, perform a copy -- why would byval ever be wrong?

pcwalton added a commit to pcwalton/rust that referenced this issue Oct 21, 2022
…adonly` on

indirect immutable freeze by-value function parameters.

Right now, `rustc` only examines function signatures and the platform ABI when
determining the LLVM attributes to apply to parameters. This results in missed
optimizations, because there are some attributes that can be determined via
analysis of the MIR making up the function body. In particular, `readonly`
could be applied to most indirectly-passed by-value function arguments
(specifically, those that are freeze and are observed not to be mutated), but
it currently is not.

This patch introduces the machinery that allows `rustc` to determine those
attributes. It consists of a query, `deduced_param_attrs`, that, when
evaluated, analyzes the MIR of the function to determine supplementary
attributes. The results of this query for each function are written into the
crate metadata so that the deduced parameter attributes can be applied to
cross-crate functions. In this patch, we simply check the parameter for
mutations to determine whether the `readonly` attribute should be applied to
parameters that are indirect immutable freeze by-value.  More attributes could
conceivably be deduced in the future: `nocapture` and `noalias` come to mind.

Adding `readonly` to indirect function parameters where applicable enables some
potential optimizations in LLVM that are discussed in [issue 103103] and [PR
103070] around avoiding stack-to-stack memory copies that appear in functions
like `core::fmt::Write::write_fmt` and `core::panicking::assert_failed`. These
functions pass a large structure unchanged by value to a subfunction that also
doesn't mutate it. Since the structure in this case is passed as an indirect
parameter, it's a pointer from LLVM's perspective. As a result, the
intermediate copy of the structure that our codegen emits could be optimized
away by LLVM's MemCpyOptimizer if it knew that the pointer is `readonly
nocapture noalias` in both the caller and callee. We already pass `nocapture
noalias`, but we're missing `readonly`, as we can't determine whether a
by-value parameter is mutated by examining the signature in Rust. I didn't have
much success with having LLVM infer the `readonly` attribute, even with fat
LTO; it seems that deducing it at the MIR level is necessary.

No large benefits should be expected from this optimization *now*; LLVM needs
some changes (discussed in [PR 103070]) to more aggressively use the `noalias
nocapture readonly` combination in its alias analysis. I have some LLVM patches
for these optimizations and have had them looked over. With all the patches
applied locally, I enabled LLVM to remove all the `memcpy`s from the following
code:

```rust
fn main() {
    println!("Hello {}", 3);
}
```

which is a significant codegen improvement over the status quo. I expect that
if this optimization kicks in in multiple places even for such a simple
program, then it will apply to Rust code all over the place.

[issue 103103]: rust-lang#103103

[PR 103070]: rust-lang#103070
bors added a commit to rust-lang-ci/rust that referenced this issue Oct 22, 2022
Introduce deduced parameter attributes, and use them for deducing `readonly` on indirect immutable freeze by-value function parameters.

Introduce deduced parameter attributes, and use them for deducing `readonly` on
indirect immutable freeze by-value function parameters.

Right now, `rustc` only examines function signatures and the platform ABI when
determining the LLVM attributes to apply to parameters. This results in missed
optimizations, because there are some attributes that can be determined via
analysis of the MIR making up the function body. In particular, `readonly`
could be applied to most indirectly-passed by-value function arguments
(specifically, those that are freeze and are observed not to be mutated), but
it currently is not.

This patch introduces the machinery that allows `rustc` to determine those
attributes. It consists of a query, `deduced_param_attrs`, that, when
evaluated, analyzes the MIR of the function to determine supplementary
attributes. The results of this query for each function are written into the
crate metadata so that the deduced parameter attributes can be applied to
cross-crate functions. In this patch, we simply check the parameter for
mutations to determine whether the `readonly` attribute should be applied to
parameters that are indirect immutable freeze by-value.  More attributes could
conceivably be deduced in the future: `nocapture` and `noalias` come to mind.

Adding `readonly` to indirect function parameters where applicable enables some
potential optimizations in LLVM that are discussed in [issue 103103] and [PR
103070] around avoiding stack-to-stack memory copies that appear in functions
like `core::fmt::Write::write_fmt` and `core::panicking::assert_failed`. These
functions pass a large structure unchanged by value to a subfunction that also
doesn't mutate it. Since the structure in this case is passed as an indirect
parameter, it's a pointer from LLVM's perspective. As a result, the
intermediate copy of the structure that our codegen emits could be optimized
away by LLVM's MemCpyOptimizer if it knew that the pointer is `readonly
nocapture noalias` in both the caller and callee. We already pass `nocapture
noalias`, but we're missing `readonly`, as we can't determine whether a
by-value parameter is mutated by examining the signature in Rust. I didn't have
much success with having LLVM infer the `readonly` attribute, even with fat
LTO; it seems that deducing it at the MIR level is necessary.

No large benefits should be expected from this optimization *now*; LLVM needs
some changes (discussed in [PR 103070]) to more aggressively use the `noalias
nocapture readonly` combination in its alias analysis. I have some LLVM patches
for these optimizations and have had them looked over. With all the patches
applied locally, I enabled LLVM to remove all the `memcpy`s from the following
code:

```rust
fn main() {
    println!("Hello {}", 3);
}
```

which is a significant codegen improvement over the status quo. I expect that if this optimization kicks in in multiple places even for such a simple program, then it will apply to Rust code all over the place.

[issue 103103]: rust-lang#103103

[PR 103070]: rust-lang#103070
@pervognsen
Copy link

pervognsen commented Feb 2, 2023

I was pointed to this GH issue by @lqd; I don't think opening a separate issue is warranted since I believe the codegen quality problem I observed is this memcpy issue. So maybe this is useful as a test case for your work, @pcwalton: https://godbolt.org/z/bTn7bP574

With the "move-through" pattern it seems very common to do sparse updates with mut self like in with_size1, so it's surprising and not ideal that it produces worse code than with_size2. You can see that rather than writing the size once to the width/height fields, it does a full store/load/store and the last load/store pair comes from the memcpy intrinsic. (In this and similar examples inlining can usually save the day, but that could be said for most cross-function issues.)

It would be even better if this kind of move-through pattern could be lowered to the copy-free code that just does

mov [rdi], rsi
mov [rdi+8], rsi
ret

so that codegen would be no different from a &mut-based implementation. But I'm guessing there are good reasons that can't be done (and in any case is a separate issue).

@Enselic Enselic added A-codegen Area: Code generation T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. and removed needs-triage-legacy labels Jul 30, 2024
@tmiasko
Copy link
Contributor

tmiasko commented Mar 11, 2025

Triage: The memcpy inside f is now optimized by either MIR opt (-C opt-level >= 0) or LLVM opt (-Z mir-opt-level=0 and -C opt-level >= 1).

@tmiasko tmiasko closed this as completed Mar 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation C-bug Category: This is a bug. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

9 participants