Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate: dependency on lto="fat" and codegen-units=1 #339

Closed
Robbepop opened this issue Jan 18, 2022 · 26 comments
Closed

Investigate: dependency on lto="fat" and codegen-units=1 #339

Robbepop opened this issue Jan 18, 2022 · 26 comments
Labels
help wanted Extra attention is needed

Comments

@Robbepop
Copy link
Member

Robbepop commented Jan 18, 2022

Currently wasmi_v1 is performing pretty well compared to the old wasmi on the following profile:

[profile.release]
lto = "fat"
codegen-units = 1

However, for the default profile which is

[profile.release]
lto = false
codegen-units = 16

We can see a slow down of factor x4.21 which is pretty terrible.

For more information and stats see this GitHub Gist.

We need to investigate the reasons for this particularly bad slowdown under default profile configuration.
Usually you can see performance differences in the range of 10-15% but not in the range of 400-500%.

The Rust source files implementing the wasmi_v1 engine can be found here:

  • The entry points and orchestration of the wasmi_v1 engine:
  • The instruction execution of the wasmi_v1 engine:
@Robbepop Robbepop added help wanted Extra attention is needed wasmi-v1 labels Jan 18, 2022
@ggwpez
Copy link

ggwpez commented Jan 18, 2022

I'm currently looking into the same flags for Substrate paritytech/substrate#10608.
Please let me know if you find something beyond lto=fat and cu=1.

Your tiny_keccak test seems to produce results much faster than the 8hr of Substrate benchmarks 😆

@Robbepop
Copy link
Member Author

Robbepop commented Jan 18, 2022

Glad I am not alone on this.
So far I have used cargo-flamegraph to get some flamegraph data and try to make some sense out of the information it brought to me. There is just one strange thing that I currently cannot explain but maybe that's just lack of knowledge about the tool itself.

Please also keep me updated if you find something!

Your tiny_keccak test seems to produce results much faster than the 8hr of Substrate benchmarks laughing

How bad are the slowdowns for Substrate without those flags?

@ggwpez
Copy link

ggwpez commented Jan 18, 2022

Glad I am not alone on this.
So far I have used cargo-flamegraph to get some flamegraph data and try to make some sense out of the information it brought to me.

Yes great! Maybe we can share some more knowledge about this, since I am new to profiling in rust.

How bad are the slowdowns for Substrate without those flags?

In Substrate it is the other way around, we currently do not run the benchmarks with any flags 🙈
Relative speedup for the balances pallet is documented here.

Also looking at other pallets; 100% speedup (2x) is the highest that I saw consistently.
But these are benchmarks of just one function, so not that much room for improvement I guess.

PS: paritytech/polkadot#4311 claims 200-400% ish.

@Robbepop
Copy link
Member Author

PS: paritytech/polkadot#4311 claims 200-400% ish.

Well that's absolutely the same that we see in wasmi_v1 ...

Looking forward to your updates. 😅

@Robbepop
Copy link
Member Author

Robbepop commented Jan 18, 2022

Slowdown concerning configurations without codegen-units=1 might be (partially) affected by rust-lang/rust#46139.

@ggwpez
Copy link

ggwpez commented Jan 18, 2022

One more thing: Did you try RUSTFLAGS='-C target-cpu=native' yet?
It is only possible to define as env variable, not in the Cargo.toml AFAIK.

@Robbepop
Copy link
Member Author

One more thing: Did you try RUSTFLAGS='-C target-cpu=native' yet?
It is only possible to define as env variable, not in the Cargo.toml AFAIK.

I just tried it out but there was no signifiant performance difference. A few benchmarks even showed slightly worse performance.

@Robbepop
Copy link
Member Author

Demonstrates how bad it really is:
https://gist.github.com/Robbepop/ffa340bdfc306856a98e21597091f6f0

@Robbepop
Copy link
Member Author

@athei Also ran the benchmarks on a different CPU and using the stable Rust compiler version 1.58.

Corei9-9900k@5GHZ
rustc 1.58-stable

Results

Default Profile

compile_and_validate/v1        time: [7.7698 ms 7.7948 ms 7.8208 ms]
instantiate/v1                 time: [144.96 us 147.68 us 150.42 us]
execute/tiny_keccak/v1         time: [3.4168 ms 3.4222 ms 3.4277 ms]
execute/rev_complement/v1      time: [4.4750 ms 4.4810 ms 4.4869 ms]
execute/regex_redux/v1         time: [4.2505 ms 4.2580 ms 4.2659 ms]
execute/count_until/v1         time: [4.5702 ms 4.5844 ms 4.5991 ms]
execute/factorial_recursive/v1 time: [3.5012 us 3.5085 us 3.5164 us]
execute/factorial_optimized/v1 time: [1.7917 us 1.8038 us 1.8161 us]
execute/recursive_ok/v1        time: [865.95 us 870.12 us 874.17 us]
execute/recursive_trap/v1      time: [74.352 us 74.633 us 74.977 us]
execute/host_calls/v1          time: [103.87 us 104.21 us 104.60 us]

Custom Profile

compile_and_validate/v1        time: [5.7856 ms 5.8243 ms 5.8628 ms]
instantiate/v1                 time: [127.04 us 130.94 us 134.55 us]
execute/tiny_keccak/v1         time: [741.29 us 745.45 us 749.46 us]
execute/rev_complement/v1      time: [925.63 us 929.02 us 932.65 us]
execute/regex_redux/v1         time: [1.0749 ms 1.0759 ms 1.0770 ms]
execute/count_until/v1         time: [1.2899 ms 1.2915 ms 1.2933 ms]
execute/factorial_recursive/v1 time: [1.2258 us 1.2276 us 1.2297 us]
execute/factorial_optimized/v1 time: [568.03 ns 568.82 ns 569.64 ns]
execute/recursive_ok/v1        time: [349.49 us 350.52 us 351.57 us]
execute/recursive_trap/v1      time: [30.064 us 30.093 us 30.125 us]
execute/host_calls/v1          time: [46.170 us 46.236 us 46.316 us]

We see more or less the same slowdowns between both profiles.
In conclusion Rust compiler version used and target CPU does not matter too much here.

@Robbepop
Copy link
Member Author

Robbepop commented Jan 19, 2022

I tried to use custom package Cargo profiles for our Wasm coremark benchmarks between wasmi_v0, wasmi_v1, Wasm3 and Wasmtime that can be found here.

On this profile

[profile.release]
lto = "fat"
codegen-units = 1

wasm_v1 achieved a score of over 500 on my CPU whereas on

[profile.release]
lto = "fat"

[profile.release.package.wasim_v1]
codegen-units = 1

We achieved only a score of roughly 300 which is the same score we achieve with this profile configuration:

[profile.release]
lto = "fat"

So it is as if Cargo's package specific profile settings are simply not applied.

@Robbepop
Copy link
Member Author

Robbepop commented Jan 19, 2022

Some cargo flamegraph execution runs show some very strange graphs for the slow case.

The fast run shows a flamegraph as it is to be expected by the developer of the new wasmi engine (me).
The slow run shows strange artifacts in the graph such as Engine::execute_func in the head of all lines above the first Engine::execute_func stack. I am confused as to how this actually even exists since there is no way how Engine::execute_func could call itself recursively. It may be some optimization artifacts but then again we do not see this (as expected) in the fast run.

Slow

[profile.bench]
lto = false
codegen-units = 16

2022-01-19-140649_2528x274_scrot

Fast

[profile.bench]
lto = "fat"
codegen-units = 1

2022-01-19-140703_2533x263_scrot

@ggwpez
Copy link

ggwpez commented Jan 19, 2022

In conclusion Rust compiler version used and target CPU does not matter too much here.

Thanks for the check!

So it is as if Cargo's package specific profile settings are simply not applied.

I wondered if cargo is actually pushing all the config flags on its dependencies or if they all use their own release profile.
Something like:

[profile.release.package."*"]
lto = "fat"
codegen-units = 1

Hopefully makes no difference.

@bjorn3
Copy link

bjorn3 commented Jan 19, 2022

Profile settings specified by dependencies aren't applied. Only profile settings defined in the workspace root are respected.

@Robbepop
Copy link
Member Author

Robbepop commented Jan 19, 2022

@bjorn3 Thanks a lot for your comment!

I tried adding

[profile.bench]
lto = "fat"

[profile.release.package.wasmi_v1]
codegen-units = 1

[profile.bench.package.wasmi_v1]
codegen-units = 1

To the root Cargo.toml of the wasmi workspace.
However, when running the benchmarks I saw that Cargo did not even bother to recompile the thing using the new config and also the benchmarks showed that codegen-units = 1 was not applied at all.

I also tried to apply

[profile.release.package.wasmi_v1]
codegen-units = 1

[profile.bench.package.wasmi_v1]
codegen-units = 1

To the wasmi_v1 crate's Cargo.toml but same effect.

I hope/guess/think there is a misunderstanding on my side.

@bjorn3
Copy link

bjorn3 commented Jan 19, 2022

Are you using wasmi as dependency of your own project? If so it will need to be in the workspace root of your own project.

@Robbepop
Copy link
Member Author

The wasmi_v1 crate is the one with the strange performance.
The wasmi crate is the root of the workspace and depends on wasmi_v1.
As stated above I tried both: Putting the [profile.release.package.wasmi_v1] stuff into the root Cargo.toml of wasmi crate as well as into the wasmi_v1 Cargo.toml directly.
What would you suggest?
I am slightly confused to be honest.

@bjorn3
Copy link

bjorn3 commented Jan 19, 2022

The Cargo.toml of the wasmi crate already defines lto = "fat and codegen-units = 1 for the bench profile: https://github.com/paritytech/wasmi/blob/70faf3410c4f0017f70c59041fa6a1d608562969/Cargo.toml#L61-L63

@Robbepop
Copy link
Member Author

The Cargo.toml of the wasmi crate already defines lto = "fat and codegen-units = 1 for the bench profile:

https://github.com/paritytech/wasmi/blob/70faf3410c4f0017f70c59041fa6a1d608562969/Cargo.toml#L61-L63

Yes that is true.
On my local branch I removed the codegen-units=1 for it since I placed it on the .package.wasmi_v1 profile settings.
Unfortunately it is not possible to do this with the lto setting, too.

@tavianator
Copy link

A lot of the difference seems to be inlining. This patch already improves performance 40% for me in the default profile:

Click to expand
diff --git a/wasmi_v1/src/engine/bytecode/utils.rs b/wasmi_v1/src/engine/bytecode/utils.rs
index 6ff10399..ce115b07 100644
--- a/wasmi_v1/src/engine/bytecode/utils.rs
+++ b/wasmi_v1/src/engine/bytecode/utils.rs
@@ -89,6 +89,7 @@ impl Target {
 pub struct FuncIdx(u32);
 
 impl From<u32> for FuncIdx {
+    #[inline]
     fn from(index: u32) -> Self {
         Self(index)
     }
@@ -96,6 +97,7 @@ impl From<u32> for FuncIdx {
 
 impl FuncIdx {
     /// Returns the inner `u32` index.
+    #[inline]
     pub fn into_inner(self) -> u32 {
         self.0
     }
@@ -107,6 +109,7 @@ impl FuncIdx {
 pub struct SignatureIdx(u32);
 
 impl From<u32> for SignatureIdx {
+    #[inline]
     fn from(index: u32) -> Self {
         Self(index)
     }
@@ -114,6 +117,7 @@ impl From<u32> for SignatureIdx {
 
 impl SignatureIdx {
     /// Returns the inner `u32` index.
+    #[inline]
     pub fn into_inner(self) -> u32 {
         self.0
     }
@@ -129,6 +133,7 @@ impl SignatureIdx {
 pub struct LocalIdx(u32);
 
 impl From<u32> for LocalIdx {
+    #[inline]
     fn from(index: u32) -> Self {
         Self(index)
     }
@@ -136,6 +141,7 @@ impl From<u32> for LocalIdx {
 
 impl LocalIdx {
     /// Returns the inner `u32` index.
+    #[inline]
     pub fn into_inner(self) -> u32 {
         self.0
     }
@@ -153,6 +159,7 @@ impl LocalIdx {
 pub struct GlobalIdx(u32);
 
 impl From<u32> for GlobalIdx {
+    #[inline]
     fn from(index: u32) -> Self {
         Self(index)
     }
@@ -160,6 +167,7 @@ impl From<u32> for GlobalIdx {
 
 impl GlobalIdx {
     /// Returns the inner `u32` index.
+    #[inline]
     pub fn into_inner(self) -> u32 {
         self.0
     }
diff --git a/wasmi_v1/src/engine/exec_context.rs b/wasmi_v1/src/engine/exec_context.rs
index 999e5fd9..9790150f 100644
--- a/wasmi_v1/src/engine/exec_context.rs
+++ b/wasmi_v1/src/engine/exec_context.rs
@@ -828,6 +828,7 @@ where
         self.execute_load_extend::<i8, i32>(offset)
     }
 
+    #[inline]
     fn visit_i32_load_u8(&mut self, offset: Offset) -> Self::Outcome {
         self.execute_load_extend::<u8, i32>(offset)
     }
@@ -880,6 +881,7 @@ where
         self.execute_store::<F64>(offset)
     }
 
+    #[inline]
     fn visit_i32_store_8(&mut self, offset: Offset) -> Self::Outcome {
         self.execute_store_wrap::<i32, i8>(offset)
     }
diff --git a/wasmi_v1/src/engine/value_stack.rs b/wasmi_v1/src/engine/value_stack.rs
index 40cc15dd..2d38c7eb 100644
--- a/wasmi_v1/src/engine/value_stack.rs
+++ b/wasmi_v1/src/engine/value_stack.rs
@@ -26,11 +26,13 @@ pub struct StackEntry(u64);
 
 impl StackEntry {
     /// Returns the underlying bits of the [`StackEntry`].
+    #[inline]
     pub fn to_bits(self) -> u64 {
         self.0
     }
 
     /// Converts the untyped [`StackEntry`] value into a typed [`Value`].
+    #[inline]
     pub fn with_type(self, value_type: ValueType) -> Value {
         match value_type {
             ValueType::I32 => Value::I32(<_>::from_stack_entry(self)),
@@ -42,6 +44,7 @@ impl StackEntry {
 }
 
 impl From<Value> for StackEntry {
+    #[inline]
     fn from(value: Value) -> Self {
         match value {
             Value::I32(value) => value.into(),
@@ -73,12 +76,14 @@ macro_rules! impl_from_stack_entry_integer {
        ($($t:ty),* $(,)?) =>   {
                $(
                        impl FromStackEntry for $t {
+                               #[inline]
                                fn from_stack_entry(entry: StackEntry) -> Self {
                                        entry.to_bits() as _
                                }
                        }
 
                        impl From<$t> for StackEntry {
+                               #[inline]
                                fn from(value: $t) -> Self {
                                        Self(value as _)
                                }
@@ -92,12 +97,14 @@ macro_rules! impl_from_stack_entry_float {
        ($($t:ty),*) => {
                $(
                        impl FromStackEntry for $t {
+                               #[inline]
                                fn from_stack_entry(entry: StackEntry) -> Self {
                                        Self::from_bits(entry.to_bits() as _)
                                }
                        }
 
                        impl From<$t> for StackEntry {
+                               #[inline]
                                fn from(value: $t) -> Self {
                                        Self(value.to_bits() as _)
                                }
@@ -108,12 +115,14 @@ macro_rules! impl_from_stack_entry_float {
 impl_from_stack_entry_float!(f32, f64, F32, F64);
 
 impl From<bool> for StackEntry {
+    #[inline]
     fn from(value: bool) -> Self {
         Self(value as _)
     }
 }
 
 impl FromStackEntry for bool {
+    #[inline]
     fn from_stack_entry(entry: StackEntry) -> Self {
         entry.to_bits() != 0
     }
@@ -259,6 +268,7 @@ impl ValueStack {
     /// # Note
     ///
     /// This has the same effect as [`ValueStack::peek`]`(0)`.
+    #[inline]
     pub fn last(&self) -> StackEntry {
         self.entries[self.stack_ptr - 1]
     }
@@ -268,6 +278,7 @@ impl ValueStack {
     /// # Note
     ///
     /// This has the same effect as [`ValueStack::peek`]`(0)`.
+    #[inline]
     pub fn last_mut(&mut self) -> &mut StackEntry {
         &mut self.entries[self.stack_ptr - 1]
     }
@@ -277,6 +288,7 @@ impl ValueStack {
     /// # Note
     ///
     /// Given a `depth` of 0 has the same effect as [`ValueStack::last`].
+    #[inline]
     pub fn peek(&self, depth: usize) -> StackEntry {
         self.entries[self.stack_ptr - depth - 1]
     }
@@ -286,6 +298,7 @@ impl ValueStack {
     /// # Note
     ///
     /// Given a `depth` of 0 has the same effect as [`ValueStack::last_mut`].
+    #[inline]
     pub fn peek_mut(&mut self, depth: usize) -> &mut StackEntry {
         &mut self.entries[self.stack_ptr - depth - 1]
     }
@@ -296,6 +309,7 @@ impl ValueStack {
     ///
     /// This operation heavily relies on the prior validation of
     /// the executed WebAssembly bytecode for correctness.
+    #[inline]
     pub fn pop(&mut self) -> StackEntry {
         self.stack_ptr -= 1;
         self.entries[self.stack_ptr]

@Robbepop
Copy link
Member Author

Robbepop commented Jan 27, 2022

A lot of the difference seems to be inlining. This patch already improves performance 40% for me in the default profile:
Click to expand

Hi @tavianator and thanks a lot for your research & report!
It is great to have those numbers in this issue!

I should have mentioned that I also tried out earlier to annotate literally all function in the workspace with #[inline] just to see what happens and got similar results as you did.

I was still not very happy about the outcome due to 3 reasons:

  • The old wasmi version does not at all depend that much on #[inline] annotations and especially for the definitions in wasmi_core that are shared by both old and new wasmi engine this is strange to me.
  • The performance difference is still like 80% worse than what we get with the best profile settings and 80% is still a ton for performance critical components. So I opted to leave out the massive amount of #[inline] annotations. It would be nice to know exactly which of the #[inline] annotations are actually needed and which are just superflous although this experiment would be extremely frustrating and time consuming.
  • I conclude from the flamegraphs posted above that especially the internal interpreter hot path (instruction dispatch and execution) is especially impacted by failed optimizations. Therefore the question I put my focus on is: what causes the compiler to fail at optimizing the most important part of the new engine? While those #[inline] annotations sure help, they do not really affect the optimizations of this critical hot path.

I am not at all sure how we want to go ahead with these #[inline] experiments. I am against putting so many #[inline] annotations into the codebase with probably many of them being unnecessary. However, I also see that if we want to reach best performance we should definitely add the #[inline] annotations that are indeed required. @athei: any ideas on your side concerning this?

@athei
Copy link
Collaborator

athei commented Jan 27, 2022

We will build the runtime with cgu=1 and we will be fine and don't need those. Don't think we want to sprinkling inline everywhere is a sane approach. Optimizer makes the right decisions if we use cgu=1 lto="fat".

@Robbepop
Copy link
Member Author

Robbepop commented Jan 27, 2022

We will build the runtime with cgu=1 and we will be fine and don't need those. Don't think we want to sprinkling inline everywhere is a sane approach. Optimizer makes the right decisions if we use cgu=1 lto="fat".

We can even see that sprinkling #[inline] all over even pessimizes the best case performance: #347 (comment)

@tavianator
Copy link

Right, I agree that sprinkling #[inline] on everything is a bad idea. I tried to do it on just the functions that showed up in a profile of the slow configuration but not the fast one, and that looked like they made sense.

I think especially the ones in core are useful. I don't think Rust will do any cross-crate inlining at all in the default configuration.

@bjorn3
Copy link

bjorn3 commented Jan 27, 2022

I don't think Rust will do any cross-crate inlining at all in the default configuration.

It does for all functions marked as #[inline] and for any function with -Clto. #[inline] makes rustc codegen the function in every codegen unit that uses it, thus allowing LLVM to inline it. -Clto allows importing bitcode from foreign modules (thin lto) or merges the bitcode into a single module (fat lto), both of which make every function potentially available to be inlined into every codegen unit.

@tavianator
Copy link

Right sorry, that's what I meant. No cross-crate inlining without non-local LTO or #[inline], which is why I think the annotations in core might be important if non-LTO or local LTO performance is important.

@Robbepop Robbepop changed the title Investigate: dependency on lto="fat" and codegen-units=1 Investigate: dependency on lto="fat" and codegen-units=1 Sep 2, 2022
@Robbepop Robbepop removed the wasmi-v1 label Oct 25, 2022
@Robbepop
Copy link
Member Author

It is unlikely that this bottomless cask of pitfalls and footguns can ever be resolved. The only actual solution to controlling codegen of wasmi is Rust tail calls which are in the makings ... and the making is stalled at the moment. So focus should be put on getting Rust tail calls back on track.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants