Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rustc fails to inline trivial functions #37538

Closed
ruuda opened this issue Nov 2, 2016 · 12 comments
Closed

Rustc fails to inline trivial functions #37538

ruuda opened this issue Nov 2, 2016 · 12 comments
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@ruuda
Copy link
Contributor

ruuda commented Nov 2, 2016

I read in multiple places that rustc generating worse code than a C++ compiler would do for an equivalent C++ program is a bug. So here we go:

Summary

Rustc fails to inline trivial functions that compile down to just a few instructions, to the point where calling convention overhead is much worse than the actual function itself.

Steps to reproduce

I tried to write a minimal example at play.rust-lang.org, but everything short that I can come up with does not suffer from this issue. So instead I am going to link the project that caused me to discover this issue:

git clone https://github.com/ruuda/claxon
git checkout 2b18a49
cargo build --release --example decode
objdump -Cd target/release/examples/decode | less
# Now search for rice_to_signed or shift_left.

Actual and expected behavior

I’ll outline some of the disassembly below:

# Code for `if shift >= 8 { 0 } else { x << shift }`.
000000000000d3a0 <claxon::input::shift_left::h07bb472717a335da>:
    d3a0:       89 f1                   mov    %esi,%ecx
    d3a2:       80 e1 07                and    $0x7,%cl
    d3a5:       40 d2 e7                shl    %cl,%dil
    d3a8:       48 83 fe 07             cmp    $0x7,%rsi
    d3ac:       76 02                   jbe    d3b0 <claxon::input::shift_left::h07bb472717a335da+0x10>
    d3ae:       31 ff                   xor    %edi,%edi
    d3b0:       89 f8                   mov    %edi,%eax
    d3b2:       c3                      retq   
    d3b3:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
    d3ba:       00 00 00 
    d3bd:       0f 1f 00                nopl   (%rax)

# Code for `if x & 1 == 1 { - 1 - x / 2 } else { x / 2 }`.
# It did optimize the branch into two shifts and an xor though!
000000000000d7b0 <claxon::subframe::rice_to_signed::haed2067302c41014>:
    d7b0:       48 89 f8                mov    %rdi,%rax
    d7b3:       48 c1 e8 3f             shr    $0x3f,%rax
    d7b7:       48 01 f8                add    %rdi,%rax
    d7ba:       48 d1 f8                sar    %rax
    d7bd:       48 c1 e7 3f             shl    $0x3f,%rdi
    d7c1:       48 c1 ff 3f             sar    $0x3f,%rdi
    d7c5:       48 31 c7                xor    %rax,%rdi
    d7c8:       48 89 f8                mov    %rdi,%rax
    d7cb:       c3                      retq   
    d7cc:       0f 1f 40 00             nopl   0x0(%rax)

Note that this is not dead code, there are calls to these functions in very hot loops:

79e8:  e8 c3 5d 00 00     callq  d7b0 <claxon::subframe::rice_to_signed::haed2067302c41014>

66a8:  e8 f3 6c 00 00     callq  d3a0 <claxon::input::shift_left::h07bb472717a335da>

I would expect that functions like these would be inlined automatically, but they were not. Note that all of this code is in the same crate.

I encountered about a dozen of these during profiling, where very small functions like the ones above were showing up as hotspots. I’ve been able to speed up my program by as much as 30% just by placing a few #[inline(always)] attributes.

There are also simple getters like Block::len which are not inlined, but these are called from the example program which is a different crate, so that is working as intended I think.

Meta

rustc 1.14.0-nightly (3210fd5c2 2016-10-05)
binary: rustc
commit-hash: 3210fd5c20ffc6da420eb00e60bdc8704577fd3b
commit-date: 2016-10-05
host: x86_64-unknown-linux-gnu
release: 1.14.0-nightly
@bluss
Copy link
Member

bluss commented Nov 2, 2016

This function can not be inlined cross crate because it is not emitted in the metadata section. You should mark it #[inline] to make it /available/ for cross crate inlining if the compiler wants to. You don't notice this a lot since all generic items are emitted into metadata anyway, so they can be inlined without being explicitly marked.

Using lto also works around this issue.

/// Left shift that does not panic when shifting by the integer width.
fn shift_left(x: u8, shift: usize) -> u8 {
    debug_assert!(shift <= 8);

    // Rust panics when shifting by the integer width, so we have to treat
    // that case separately.
    if shift >= 8 { 0 } else { x << shift }
}

@ruuda
Copy link
Contributor Author

ruuda commented Nov 2, 2016

This is not about cross-crate inlining. The code that calls the function and the function itself are in the same crate.

@bluss
Copy link
Member

bluss commented Nov 2, 2016

Rustc turns it into being about cross crate inlining, and I agree that it is a bug, it should be fixed. The unit of compilation is a different crate than where shift_left is defined.

@ruuda
Copy link
Contributor Author

ruuda commented Nov 2, 2016

Ah, I think I understand.

The code that calls shift_left is generic, which automatically makes it eligible for inlining. That code is then inlined into the example program. However, because shift_left is not generic nor marked #[inline], it cannot be inlined into the calling code any more, because the calling code is now in the example program, which is a different crate.

Is that correct?

@sfackler
Copy link
Member

sfackler commented Nov 2, 2016

Yep, that's right.

@bluss bluss added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Nov 2, 2016
@gnzlbg
Copy link
Contributor

gnzlbg commented Jan 30, 2017

So isn't it the problem that shift_left is not marked inline?

I just ran into this issue because I have a crate whose public API exposes mostly 1 liner functions, and if I don't mark any of them inline, my users don't get them inlined in their code.

I would like an option to automatically make all public items of my crate available for inlining in other crates. Would that be a solution for that?

@ruuda
Copy link
Contributor Author

ruuda commented Jan 30, 2017

@gnzlbg While marking shift_left as #[inline] does resolve the symptom, the problem is deeper. shift_left is not part of the public API, it is an internal function. Generic code is made eligible for cross-crate inlining automatically, but if that code calls the internal non-generic function, that will turn into a function call when the caller is inlined into a different crate. When the caller is in the same crate, the function does get inlined. This can make a big difference in performance, which is counter-intuitive because the internal function is not accessible from other crates directly.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jan 30, 2017

I see. So probably generic code should be #[inline] by default, and anything it calls should be transitively inline as well?

@ruuda
Copy link
Contributor Author

ruuda commented Jan 30, 2017

If I understand correctly (but I am not very familiar with compiler internals), generic code is eligible for inlining, because other crates need to be able to monomorphise the code, and both inlining and monomorphisation rely on the same data. A possible solution for this issue would indeed be to transitively mark things called by public generic functions as inline. That could have unintended side effects though, such as dramatically increasing the size of rlibs.

@Mark-Simulacrum Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 26, 2017
@steveklabnik
Copy link
Member

Triage: no changes

@workingjubilee
Copy link
Member

@saethlin Has this been addressed?

@saethlin
Copy link
Member

saethlin commented May 5, 2024

I read in multiple places that rustc generating worse code than a C++ compiler would do for an equivalent C++ program is a bug.

To be clear, I agree with this statement. I don't think it's relevant though; given that there is no reference C++ code. I suspect that a C++ compiler would have also failed to inline the equivalent function, given that Rust and C++ have basically the same compilation model. Putting these little functions in a header or making them inline would be equivalent to adding the Rust #[inline] attribute, which is not present. Instead of comparing to C++, I'd say that unless you are very well-versed in optimizer internals, it is better to make sure we have an issue tracking any imperfect optimization.

The specific failed inlinings reported here were indeed addressed by #116505; these functions are inlined away by default but if you set RUSTFLAGS=-Zcross-crate-inline-threshold=0 they reappear. Since the specific example here has been addressed, I'm going to close this issue. If people have further examples of such failed inlining, I'd love to have a new issue with examples to track that. I'm sure such examples exist, but I would prefer where possible to work from real-world examples instead of what I can cook up based on knowing how the compiler works.

I just ran into this issue because I have a crate whose public API exposes mostly 1 liner functions, and if I don't mark any of them inline, my users don't get them inlined in their code.

@gnzlbg Fixing this is the objective of the PR I linked above. Functions are made cross-crate-inlinable based on heuristics, and one of them is that the optimized MIR must not contain any calls. We do have a MIR inliner so some functions with calls get made inlinable anyway, but sometimes the MIR inliner falls over.

I would like an option to automatically make all public items of my crate available for inlining in other crates. Would that be a solution for that?

FWIW we have -Zcross-crate-inline-threshold=always if you want to try out a flag that emulates this effect (by making every function in a crate cross-crate-inlinable). I don't know how to make this only apply to one crate in a Cargo build.

@saethlin saethlin closed this as completed May 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

8 participants