Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

println!() prevents optimization by capturing pointers #50519

Open
df5602 opened this issue May 7, 2018 · 14 comments
Open

println!() prevents optimization by capturing pointers #50519

df5602 opened this issue May 7, 2018 · 14 comments
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@df5602
Copy link

df5602 commented May 7, 2018

This weekend I ran some benchmarks on some of my code. After making a seemingly insignificant code change I noticed a small, but measurable performance regression. After investigating the generated assembly, I stumbled upon a case, where the compiler emits code that is not optimal.

This minimal example shows the same behaviour (Playground link):

extern crate rand;

use std::f32;
use rand::Rng;

fn main() {
    let mut list = [0.0; 16];
    let mut rg = rand::thread_rng();

    // Random initialization to prevent the compiler from optimizing the whole example away
    for i in 0..list.len() {
        list[i] = rg.gen_range(0.0, 0.1);
    }

    let mut lowest = f32::INFINITY;

    for i in 0..list.len() {
        lowest = if list[i] < lowest {    // <<<<<<<<<<<<<<<
            list[i]
        } else {
            lowest
        };
    }

    println!("{}", lowest);
}

When compiling with the --release flag, the compiler generates the following instructions for the marked block:

...
minss	%xmm0, %xmm1
movss	88(%rsp), %xmm0
minss	%xmm1, %xmm0
movss	92(%rsp), %xmm1
...

However, if I replace those lines with the following:

if list[i] < lowest {
    lowest = list[i];
}

the compiler emits a strange series of float compare and jump instructions:

.LBB5_38:
	movss	92(%rsp), %xmm1
	ucomiss	%xmm1, %xmm0
	ja	.LBB5_39
...
.LBB5_42:
	movss	100(%rsp), %xmm1
	ucomiss	%xmm1, %xmm0
	ja	.LBB5_43
...
.LBB5_39:
	movss	%xmm1, 12(%rsp)
	movaps	%xmm1, %xmm0
	movss	96(%rsp), %xmm1
	ucomiss	%xmm1, %xmm0
	jbe	.LBB5_42

As a comparison, both gcc and clang can optimize a similar C++ example:

#include <stdlib.h>
#include <iostream>

using namespace std;

int main() {
    float list[16];
    for(size_t i = 0; i < 16; ++i) {
        list[i] = rand();
    }

    float lowest = 1000.0f;

    for (size_t i = 0; i < 16; ++i) {
        
        /* Variant A: */
        //lowest = list[i] < lowest ? list[i] : lowest;

        /* Variant B: */
        if (list[i] < lowest) {
            lowest = list[i];
        }
    }

    cout << lowest;
}

Both compilers generate minss instructions for both variants.
(Godbolt)

I wasn't sure whether rustc or LLVM were responsible for this behaviour, however after a quick glance at the generated LLVM IR, I'm tending towards rustc, since in the first case it emits fcmp and select instructions, while in the latter it generates fcmp and br.

What do you think?

@sanxiyn sanxiyn added the I-slow Issue: Problems and improvements with respect to performance of generated code. label May 19, 2018
@XAMPPRocky XAMPPRocky added C-enhancement Category: An issue proposing an enhancement or a PR with one. A-codegen Area: Code generation T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Oct 2, 2018
@nikic
Copy link
Contributor

nikic commented Dec 15, 2018

In this reduced example, minss is generated for both cases since rust 1.25: https://godbolt.org/z/wz8Kmk

Replacing the return with println brings the problem back.

@nikic nikic added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Dec 15, 2018
@nikic
Copy link
Contributor

nikic commented Dec 15, 2018

Okay, the relevant difference that println! introduces is that it takes the address of lowest. If you use println!("{}", {lowest}) instead, the issue goes away.

Taking the address prevents the conversion of lowest from an alloca into an SSA value, and that's going to inhibit lots of optimizations (including the select formation desired here).

The good news is that this is probably not going to affect real code much, though I am concerned about cases where you have conditional debugging code that includes formatting.

Two ways this could be fixed:

  • On the LLVM side: Perform calculation on SSA values and only write it back into the alloca slot when the address is taken. Might not always be profitable and probably isn't even possible with the information LLVM has (e.g. it does not know that we won't modify memory through the pointer).
  • On the rust side: Avoid taking the reference, at least in cases where the value is small.

@nikic nikic changed the title Missed optimization: Compiler sometimes emits float compare + jump instead of MINSS/MAXSS println!() prevents optimization by capturing pointers Dec 23, 2018
@nikic
Copy link
Contributor

nikic commented Dec 23, 2018

@rkruppe @nagisa @eddyb Any idea what we can do here? I think it's pretty bad that println!() (or anything else using formatting, such as logging) breaks unrelated optimizations by capturing pointers. It's enough that any part of a structure is formatted to break optimization on all of it, as LLVM doesn't know which part of the structure we're actually going to use use, it only sees that some pointer into it is stored.

It would be great if we could force a copy of the formatted value before taking the pointer, but I'm not sure how to do that on a technical level. We'd only want to do this for specific types (integers and floats), but println! is expanded long before this type information is available.

@hanna-kruppe
Copy link
Contributor

If changing how println etc. are expanded is on the table, maybe this is one more reason to consider a more performance-oriented expansion that generates direct inlineable calls, as outlined by @seanmonstar here. If for example the <f32 as Debug>::fmt calls are inlined (at least enough to not need to pass a reference), would that solve this issue?

@nagisa
Copy link
Member

nagisa commented Dec 23, 2018

It would be great if we could force a copy of the formatted value before taking the pointer, but I'm not sure how to do that on a technical level. We'd only want to do this for specific types (integers and floats), but println! is expanded long before this type information is available.

The formatting machinery has been specifically crafted to minimize the size rather than increase the speed (desired for panics), which will eventually come at some cost somewhere, which is what we are seeing here. If we can find ways to improve println! expansion to do better without increasing sacrificing size, that would be great...

@elichai
Copy link
Contributor

elichai commented Aug 9, 2020

Any news here? I'm getting similar things where passing expressions to println! prevents optimizations, while assigning to a variable first somewhat helps.
https://godbolt.org/z/v73bG7
Did anyone experiment with inlining the _print functions? does it really bloat real world code? does it even helps with the optimization problems?

@tesuji
Copy link
Contributor

tesuji commented Aug 9, 2020

I don't think inlining _print function help as that function depends on print_to:

pub fn _print(args: fmt::Arguments<'_>) {
print_to(args, &LOCAL_STDOUT, stdout, "stdout");
}

@elichai
Copy link
Contributor

elichai commented Aug 9, 2020

Sorry, I meant inlining all the way down the call graph.

@oxalica
Copy link
Contributor

oxalica commented Apr 13, 2021

Ran into the same issue.

In the code below, println prevents a hot variable becoming a register and also prevents SIMD optimization.
The referenced variable count is stored in stack and being written every time when modified.
The let count = count; work-around makes this program about 7% faster (142ms -> 132ms).

Link: https://godbolt.org/z/4Yxbf1dox

const N: usize = 20_000_000;

pub fn sieve(vis: &mut [bool; N + 1]) {
    let mut i = 2;
    let mut count = 0;
    while i * i <= N {
        if vis[i] {
            count += 1; // This writes to stack every time.
            let mut k = 2 * i;
            while k <= N {
                vis[k] = true;
                k += i;
            }
        }
        i += 1;
    }

    // This loop can be optimized into SIMD when using walk-around below.
    while i <= N {
        if vis[i] {
            count += 1;  // This writes to stack every time.
        }
        i += 1;
    }

    // let count = count; // Uncomment this line to walk-around this issue.
    println!("{}", count);
}

@elichai
Copy link
Contributor

elichai commented Apr 13, 2021

(FWIW even doing println!("{}", (||count)()); is enough to allow the compiler to optimize correctly)

@nikic
Copy link
Contributor

nikic commented Apr 13, 2021

@oxalica Interesting case! I would have expected scalar promotion in LICM to sink the store outside the loop for that example. Unfortunately it ends up running afoul of this thread safety check: https://github.com/llvm/llvm-project/blob/64c24f493e5f4637ee193f10f469cdd2695b4ba6/llvm/lib/Transforms/Scalar/LICM.cpp#L2196-L2198 Because you only store count conditionally, there is no guaranteed write on each iteration, and because printf captures it, the pointer may leak to a different thread.

@nikic
Copy link
Contributor

nikic commented Apr 13, 2021

I think scalar store promotion candidates are sufficiently rare, and the resulting optimization sufficiently valuable, that doing a loop header reachability capture check is probably worth the compile-time cost in this case.

@nikic
Copy link
Contributor

nikic commented Apr 17, 2021

https://reviews.llvm.org/D100706 for what I mentioned above. This should help any cases that involve a println after a loop.

@nikic
Copy link
Contributor

nikic commented Apr 19, 2021

Upstream commit: llvm/llvm-project@d440f9a

@workingjubilee workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Oct 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

10 participants