Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Noticeable performance regression since the last LLVM update. #8665

Closed
sebcrozet opened this issue Aug 21, 2013 · 6 comments · Fixed by #8700
Closed

Noticeable performance regression since the last LLVM update. #8665

sebcrozet opened this issue Aug 21, 2013 · 6 comments · Fixed by #8700

Comments

@sebcrozet
Copy link
Contributor

I get a huge performance regression since #8328 landed (revision a8c3fe4) on all my projects. Things are 50 to 75% slower. I’m pretty sure #8328 is in cause since when I revert the compiler to the version right before (revision 67c954e) performances go back to normal.

For what it’s worth, the concerned projects are 100% generic, and rely a lot on cross-crate inlining. They do a lot of numeric computations and array indexing. Sorry if I am a bit vague but I cannot valgrind my projects because my valgrind started to segfault a few days ago (perhaps since the re-enabling of jemalloc)…

I tried to come up with a small bench exhibiting the problem. It is not that significative, but the following shows some noticeable performances regression already:

extern mod extra;

use std::hashmap;
use extra::test::BenchHarness;

#[bench]
fn bench_insert_std(bh: &mut BenchHarness) {
    let mut m = hashmap::HashMap::with_capacity(32);

    do bh.iter {
        for i in range(0u, 500) {
            m.insert((i, i), i);
        }
    }
}

#[bench]
fn bench_insert_find_remove_std(bh: &mut BenchHarness) {
    let mut m = hashmap::HashMap::with_capacity(32);

    do bh.iter {
        for i in range(0u, 200) {
            m.insert((i, i), i);
        }

        for i in range(0u, 200) {
            assert!(*m.find(&(i, i)).unwrap() == i)
        }

        for i in range(100u, 200) {
            m.remove(&(i, i));
        }

        for i in range(100u, 200) {
            assert!(m.find(&(i, i)).is_none())
        }

        for i in range(0u, 100) {
            m.insert((i, i), i * 2);
        }

        for i in range(0u, 100) {
            assert!(*m.find(&(i, i)).unwrap() == i * 2)
        }

        for i in range(0u, 100) {
            m.remove(&(i, i));
        }

        for i in range(0u, 100) {
            assert!(m.find(&(i, i)).is_none())
        }
    }
}

fn main() {
} 

Compiled with --opt-level=3.
With the (new) compiler a8c3fe4, I get:

test bench_insert_find_remove_std ... bench: 89242 ns/iter (+/- 3605)
test bench_insert_std ... bench: 46177 ns/iter (+/- 1555)

With the (old) compiler 67c954e, I get something more than 10% faster. The asm dump is smaller too:

test bench_insert_find_remove_std ... bench: 73939 ns/iter (+/- 2872)
test bench_insert_std ... bench: 38482 ns/iter (+/- 1005)
@alexcrichton
Copy link
Member

LLVM may have updated their optimization passes, and we may just need to update what order we're optimizing things in. @thestinger, you mentioned loop vectorization as being a big thing at lower optimization levels, do you know of perhaps any analysis passes that we're missing by default?

@alexcrichton
Copy link
Member

After talking with @thestinger on IRC, we've reached the conclusion that one major change we can make is reworking how the LLVM passes are run. I'm not certain that this is the source of the regression you're seeing, but it's the best one that I can think of related to updating LLVM.

After looking into this, we do a large number of things differently than clang

  1. We only have one pass manager (a module one), clang has 3. Clang runs a function, a module, and then a codegen pass manager. From what I can tell, this doesn't necessarily mean that we're not running optimizations, but rather we may finish optimizations more quickly by running in this order.
  2. We're not populating our pass manager with any target-specific analysis passes. Clang will populate all of its pass managers with target-specific analyses, but we don't do anything related to that.
  3. Our list of O1/2/3 optimizations fall out of date with llvm updates, so we shouldn't have them hard-coded. The hardcoded list is just a duplicate of what's already in LLVM itself anyway.
  4. There's no way for us to debug what passes are run. We need an option to debug this and see what's happening.
  5. Right now we're running two pass managers, but they're possibly overlapping with one another and it's not clear which one is responsible for which work.

Long story short, investigation into how we're running LLVM passes shows that it probably needs to be reworked and simplified to use more existing LLVM infrastructure and to match clang more closely. I'm still not certain that this is the cause of this regression, but I'll attempt to determine if it is by making these changes.

cc @Aatch, you were the one who recently added all the PassManager stuff to passes.rs, just wanted to make sure that none of this is alarming to you.

@sebcrozet
Copy link
Contributor Author

Thanks for all those investigations!

I merged together a few files from my project to give you a better stand-alone example of regression.
The example is big (298 lines), and mostly unreadable; but it exhibits a slowdown greater than 85% after LLVM update. I thought it might be useful for you to have an example of big slowdown for debugging.

Here is the example: https://gist.github.com/sebcrozet/6300848

Again, compiled with --opt-level=3.
With the compiler after LLVM update (revision a8c3fe4) I get:

test test::bench_johnson_simplex ... bench: 191270 ns/iter (+/- 4553) 

With the compiler before LLVM update (revision 67c954e) I get:

test test::bench_johnson_simplex ... bench: 23890 ns/iter (+/- 94) 

@alexcrichton
Copy link
Member

Sadly my suspicions were not correct. You're example is fantastic though, and I'm very perplexed as to what's going on with them. I did profile your code though, and I get the same 10x slowness with thew new LLVM upgrade. What's very suspicious to me are the top functions in the profile of that code. Sorry, but I haven't figured out yet how to get profiles on OSX not in instruments...

Before the llvm upgrades (stage0 today):
screen shot 2013-08-22 at 9 30 44 pm

After #8700 (with llvm upgrades and refactored pass handling)
screen shot 2013-08-22 at 9 30 55 pm

The exact test I ran was the one above, but with the iterations turned up to 1000 to take a bit longer. I'm very confused why the timing information is showing up so massively in the profiles of one, so I decided to run with different code.

With this code, I get the following profiles

Before:
screen shot 2013-08-22 at 9 35 12 pm
After:
screen shot 2013-08-22 at 9 35 19 pm

I'm disturbed by the fact that there are transmutes in the profile, but there shouldn't be any timing information weirdness going on here. I'll continue investigating later, but I wanted to post my findings here before I forgot them.

@graydon
Copy link
Contributor

graydon commented Aug 23, 2013

Transmute showing up means a change in inlining.

@thestinger
Copy link
Contributor

I identified the root cause of this issue, so I'm closing this in favour of #8720.

bors added a commit that referenced this issue Aug 26, 2013
Beforehand, it was unclear whether rust was performing the "recommended set" of
optimizations provided by LLVM for code. This commit changes the way we run
passes to closely mirror that of clang, which in theory does it correctly. The
notable changes include:

* Passes are no longer explicitly added one by one. This would be difficult to
  keep up with as LLVM changes and we don't guaranteed always know the best
  order in which to run passes
* Passes are now managed by LLVM's PassManagerBuilder object. This is then used
  to populate the various pass managers run.
* We now run both a FunctionPassManager and a module-wide PassManager. This is
  what clang does, and I presume that we *may* see a speed boost from the
  module-wide passes just having to do less work. I have no measured this.
* The codegen pass manager has been extracted to its own separate pass manager
  to not get mixed up with the other passes
* All pass managers now include passes for target-specific data layout and
  analysis passes

Some new features include:

* You can now print all passes being run with `-Z print-llvm-passes`
* When specifying passes via `--passes`, the passes are now appended to the
  default list of passes instead of overwriting them.
* The output of `--passes list` is now generated by LLVM instead of maintaining
  a list of passes ourselves
* Loop vectorization is turned on by default as an optimization pass and can be
  disabled with `-Z no-vectorize-loops`


All of these "copies" of clang are based off their [source code](http://clang.llvm.org/doxygen/BackendUtil_8cpp_source.html) in case anyone is curious what my source is. I was hoping that this would fix #8665, but this does not help the performance issues found there. Hopefully i'll allow us to tweak passes or see what's going on to try to debug that problem.
bors added a commit that referenced this issue Aug 27, 2013
Beforehand, it was unclear whether rust was performing the "recommended set" of
optimizations provided by LLVM for code. This commit changes the way we run
passes to closely mirror that of clang, which in theory does it correctly. The
notable changes include:

* Passes are no longer explicitly added one by one. This would be difficult to
  keep up with as LLVM changes and we don't guaranteed always know the best
  order in which to run passes
* Passes are now managed by LLVM's PassManagerBuilder object. This is then used
  to populate the various pass managers run.
* We now run both a FunctionPassManager and a module-wide PassManager. This is
  what clang does, and I presume that we *may* see a speed boost from the
  module-wide passes just having to do less work. I have no measured this.
* The codegen pass manager has been extracted to its own separate pass manager
  to not get mixed up with the other passes
* All pass managers now include passes for target-specific data layout and
  analysis passes

Some new features include:

* You can now print all passes being run with `-Z print-llvm-passes`
* When specifying passes via `--passes`, the passes are now appended to the
  default list of passes instead of overwriting them.
* The output of `--passes list` is now generated by LLVM instead of maintaining
  a list of passes ourselves
* Loop vectorization is turned on by default as an optimization pass and can be
  disabled with `-Z no-vectorize-loops`


All of these "copies" of clang are based off their [source code](http://clang.llvm.org/doxygen/BackendUtil_8cpp_source.html) in case anyone is curious what my source is. I was hoping that this would fix #8665, but this does not help the performance issues found there. Hopefully i'll allow us to tweak passes or see what's going on to try to debug that problem.
flip1995 pushed a commit to flip1995/rust that referenced this issue Apr 21, 2022
…, r=llogiq

Introduce needless_option_take lint

- \[x] Followed [lint naming conventions][lint_naming]
- \[x] Added passing UI tests (including committed `.stderr` file)
- \[x] `cargo test` passes locally
- \[x] Executed `cargo dev update_lints`
- \[x] Added lint documentation
- \[x] Run `cargo dev fmt`

Fixes rust-lang#8618

changelog: Introduce [`needless_option_take`] lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants