Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in 1.48.0 #79246

Closed
marmeladema opened this issue Nov 20, 2020 · 19 comments
Closed

Performance regression in 1.48.0 #79246

marmeladema opened this issue Nov 20, 2020 · 19 comments
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code. regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@marmeladema
Copy link
Contributor

marmeladema commented Nov 20, 2020

Hello everyone!

We detected a performance regression on a closed-source project after an upgrade of rust from version 1.47.0 to version 1.48.0.
The benchmark we run internally are using the low-level perf_event_open Linux API to measure various hardware performance counters on specific code section.
The most stable and reliable counter we care about is the number of retired instructions, corresponding to instructions:u in classic perf output.

On certain datasets, the regressions are up to +16% in the number of instructions executed.

I used cargo-bisect-rustc as instructed on https://rust-lang.zulipchat.com/#narrow/stream/122651-general/topic/Performance.20regressions.20in.201.2E48.2E0 in order to search for the commit that introduced the regression:

searched nightlies: from nightly-2020-09-03 to nightly-2020-11-20
regressed nightly: nightly-2020-09-04
searched commits: from 80fc9b0 to 0d0f6b1
regressed commit: 0d0f6b1

bisected with cargo-bisect-rustc v0.6.0

Host triple: x86_64-unknown-linux-gnu
Reproduce with:

cargo bisect-rustc --script=./bisect.sh 

The bisection points to #70793

I was able to reproduce the regression on two different machines.
I will also try to provide a minimized code example to reproduce the issue but that will take some time as the code being measured is quite big.

If anyone else has noticed this regression and could help me towards tools / ideas / whatever that could be of help to reproduce the issue would greatly appreciated!

EDIT:
MCVE at https://github.com/marmeladema/bitvec-perf-regression

@Mark-Simulacrum Mark-Simulacrum added the regression-from-stable-to-stable Performance or correctness regression from one stable version to another. label Nov 20, 2020
@rustbot rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Nov 20, 2020
@Mark-Simulacrum Mark-Simulacrum added I-slow Issue: Problems and improvements with respect to performance of generated code. E-needs-mcve Call for participation: This issue has a repro, but needs a Minimal Complete and Verifiable Example and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Nov 20, 2020
@Mark-Simulacrum
Copy link
Member

I suspect a good next step would be to try to narrow down the code being regressed -- maybe with something like a cachegrind diff -- to attempt to isolate the problem, and perhaps propose a fix.

@JulianKnodt
Copy link
Contributor

JulianKnodt commented Nov 21, 2020

Just curious, are you using LTO when you notice the regression? I was looking at the code in the PR and there are a lot of small methods which aren't marked with #[inline], which from my experience would cause slowdown.

Edit: after reading specifically that you're looking at instructions:u, I'm less certain since I imagine that more inlining would lead to more instructions retired, but now I'm not sure.

@marmeladema
Copy link
Contributor Author

Just curious, are you using LTO when you notice the regression? I was looking at the code in the PR and there are a lot of small methods which aren't marked with #[inline], which from my experience would cause slowdown.

Yes, LTO is enabled for release builds which what is being measured here.

@the8472
Copy link
Member

the8472 commented Nov 21, 2020

If it helps narrowing down which code to test: The changes should only affect code using into_iter() on Vec, Box<[]> or BinaryHeap and then collecting into one of those types again, optionally involving some of the iterator adapters from the core crate. Anything else would be unintended.

The primary goal of that PR is to reduce malloc/free traffic by reusing allocations where possible. I only measured with the system allocator. Do you use that or a custom one?

Part of that optimization also involves switching from loops over next() to a try_fold() for. That might affect vectorization. But most of the microbenchmarks that I ran under perf seemed to vectorize fine even with iterators involving several adapters.

There also are a bunch of smaller tweaks relating to allocations in various Vec iteration/collection methods such as changing a memcpy between allocations to an internal memmove.

I was looking at the code in the PR and there are a lot of small methods which aren't marked with #[inline]

Afaiu inline annotations are only needed on non-generic methods and the whole SourceIter methods are generic.

@the8472
Copy link
Member

the8472 commented Nov 21, 2020

The changes should only affect code using into_iter() on Vec, Box<[]> or BinaryHeap and then collecting into one of those types again

Small correction, it may also affect some iterators involving Vec.into_iter() and zip without collecting into one of the above types again.

@marmeladema
Copy link
Contributor Author

If it helps narrowing down which code to test: The changes should only affect code using into_iter() on Vec, Box<[]> or BinaryHeap and then collecting into one of those types again, optionally involving some of the iterator adapters from the core crate. Anything else would be unintended.

Ok thanks! That is valuable information. FYI the measured code has some (maybe heavy?) use of Vec/Box<[]> and iterating/collecting from those but no usage of BinaryHeap

@marmeladema
Copy link
Contributor Author

I continued my investigation and did a perf diff between two run of the crate comparing stable 1.47.0 to 1.48.0:

  • with rust 1.47.0, most of the time is spent in:
itvec::vec::iter::<impl core::iter::traits::collect::Extend<bool> for bitvec::vec::BitVec<O,T>>::extend
  • whereas with rust 1.48.0, most of the time is spent in:
bitvec::vec::api::<impl bitvec::vec::BitVec<O,T>>::push

This led me to focus on the bitvec crate itself and more especially the BitVec::extend method which seems to be the one being severely impacted by the regression.
The bitvec version being tested was 0.17.4 and after updating to version 1.19.4, I could not trigger the regression anymore.

I managed to provide a self-contained and (relatively) minimized example on https://github.com/marmeladema/bitvec-perf-regression

I run the bisection again and it failed with:

********************************************************************************
Regression in nightly-2020-09-02
********************************************************************************

fetching https://static.rust-lang.org/dist/2020-09-01/channel-rust-nightly-git-commit-hash.txt
ERROR: Tarball not found at https://static.rust-lang.org/dist/2020-09-01/channel-rust-nightly-git-commit-hash.txt

but this time it seems to point to nightly-2020-09-02 instead of nightly-2020-09-04 ...
Not sure exactly which one is the culprit?

I have also bisected the bitvec crate itself and the regression disappears at ferrilab/bitvec@f821122 which is basically a rewrite of the crate

@marmeladema
Copy link
Contributor Author

I managed to manually bisect rustc and it seems the bitvec regression has been introduced by commit 1d22f75 which corresponds to the merge of #76030

@marmeladema
Copy link
Contributor Author

marmeladema commented Nov 22, 2020

The regression still triggers with the latest nightly rustc 1.50.0-nightly (da3846948 2020-11-21) and I can confirm that reverting #76030:

patch
diff --git a/compiler/rustc_codegen_llvm/src/type_of.rs b/compiler/rustc_codegen_llvm/src/type_of.rs
index 8ea4768f77d..0876907e119 100644
--- a/compiler/rustc_codegen_llvm/src/type_of.rs
+++ b/compiler/rustc_codegen_llvm/src/type_of.rs
@@ -40,9 +40,7 @@ fn uncached_llvm_type<'a, 'tcx>(
         // FIXME(eddyb) producing readable type names for trait objects can result
         // in problematically distinct types due to HRTB and subtyping (see #47638).
         // ty::Dynamic(..) |
-        ty::Adt(..) | ty::Closure(..) | ty::Foreign(..) | ty::Generator(..) | ty::Str
-            if !cx.sess().fewer_names() =>
-        {
+        ty::Adt(..) | ty::Closure(..) | ty::Foreign(..) | ty::Generator(..) | ty::Str => {
             let mut name = with_no_trimmed_paths(|| layout.ty.to_string());
             if let (&ty::Adt(def, _), &Variants::Single { index }) =
                 (layout.ty.kind(), &layout.variants)
@@ -58,12 +56,6 @@ fn uncached_llvm_type<'a, 'tcx>(
             }
             Some(name)
         }
-        ty::Adt(..) => {
-            // If `Some` is returned then a named struct is created in LLVM. Name collisions are
-            // avoided by LLVM (with increasing suffixes). If rustc doesn't generate names then that
-            // can improve perf.
-            Some(String::new())
-        }
         _ => None,
     };
 

fixes the regression.

I am not sure what to do from there.

cc @eddyb @davidtwco

@eddyb
Copy link
Member

eddyb commented Nov 23, 2020

Ugh, choice of names shouldn't cause perf changes. At a glance, this looks like a LLVM bug.

I was going to suggest testing with -Z fewer-names=off, but that doesn't actually do anything (the lack of --emit=llvm-ir means it's forced to "fewer names", we should maybe fix that).
And you tested the revert anyway, so that's confirmation enough.

@marmeladema if you have the time, one thing you could try is "bisecting" for the name (i.e. choose between Some(name) and Some(String::new()) based on some property, such as name.len() or name.contains("...") etc.). Presumably the exact type that needs a name to not cause the perf regression, could be determined.

Then you pretty much want to diff the output of -Z no-parallel-llvm -C llvm-args=-print-after-all runs with and without names for that one type, and see where optimizations diverge. Or you can start with that kind of diff, if you can figure out a way to ignore most of the differences from having named vs numbered types.

@marmeladema
Copy link
Contributor Author

marmeladema commented Nov 23, 2020

I have managed to isolate the precise name that needs to be generated to not cause the performance regression:

  • bitvec::prelude::BitVec (exactly)

@marmeladema
Copy link
Contributor Author

I have tried to decompile BitVec::extend with ghidra but it's the first time i've used that tool and I am not really sure of what I did, nor if it's really useful 😄 The only difference I can see in the decompiled code is an additional endless loop when unwrapping the Option<bool>:

--- bitvec.extend.slow.decompiled.c     2020-11-23 23:19:02.971052349 +0000
+++ bitvec.extend.fast.decompiled.c     2020-11-23 23:10:45.442938517 +0000
@@ -99,11 +99,13 @@
   do {
     bVar2 = *param_2;
     if (bVar2 == 2) {
-                    /* WARNING: Subroutine does not return */
       core::panicking::panic
                 (
                  "called `Option::unwrap()` on a `None`valuelibrary/std/src/panicking.rslibrary/std/src/env.rsfailed to get environmentvariable ``: already mutablyborrowedlibrary/std/src/sys_common/thread_info.rsassertion failed:c.borrow().is_none()Rust panics must be rethrownRust cannot catch foreignexceptionslibrary/std/src/sys/unix/rwlock.rs<unnamed>note: run with`RUST_BACKTRACE=1` environment variable to display a backtrace\n\' panicked at\'\', rwlock maximum reader count exceededrwlock read lock would result indeadlockthread panicked while panicking. aborting.\nthread panicked whileprocessing panic.aborting.\n.debug_library/std/src/../../backtrace/src/symbolize/gimli/elf.rs/home/adema/code/rust/library/alloc/src/vec.rslibrary/std/src/path.rsinternal error:entered unreachable code"
                  ,0x2b,&PTR_UINT_0014f3c8);
+      do {
+        invalidInstructionException();
+      } while( true );
     }
     pointer::BitPtr<T>::into_bitslice(*param_1,param_1[1]);
     pointer::BitPtr<T>::into_bitslice(*param_1);

Does that help at all, or am I not looking in the right place?

@the8472
Copy link
Member

the8472 commented Nov 23, 2020

Are you sure you're looking at the right method? Previously said most of the time is spent in bitvec::vec::api::<impl bitvec::vec::BitVec<O,T>>::push, not extend.

@marmeladema
Copy link
Contributor Author

I decompiled this one because that's actually the method that is called in my reproducible code. push shows up in the perf diff but because it must be called at some point during extend. At least that how I understand it but again, I might be wrong.
@the8472 have you had a chance to look at the mcve yet?

@bjorn3 bjorn3 removed the E-needs-mcve Call for participation: This issue has a repro, but needs a Minimal Complete and Verifiable Example label Nov 24, 2020
@pnkfelix pnkfelix added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Nov 24, 2020
@apiraino apiraino added I-prioritize Issue: Indicates that prioritization has been requested for this issue. I-nominated labels Dec 16, 2020
@spastorino
Copy link
Member

@tmiasko
Copy link
Contributor

tmiasko commented Sep 28, 2021

LLVM bug report https://bugs.llvm.org/show_bug.cgi?id=51667 and proposed patch https://reviews.llvm.org/D109294.

@marmeladema
Copy link
Contributor Author

@tmiasko I see the llvm bug has been resolved. Has our llvm fork been updated with the fix? If so should this be closed?

@nikic
Copy link
Contributor

nikic commented Jan 8, 2022

@marmeladema No, the fix will be pulled in with the LLVM 14 upgrade.

@marmeladema
Copy link
Contributor Author

Closing this issue now that we are at LLVM 15.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code. regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests