-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate: dependency on lto="fat"
and codegen-units=1
#339
Comments
I'm currently looking into the same flags for Substrate paritytech/substrate#10608. Your |
Glad I am not alone on this. Please also keep me updated if you find something!
How bad are the slowdowns for Substrate without those flags? |
Yes great! Maybe we can share some more knowledge about this, since I am new to profiling in rust.
In Substrate it is the other way around, we currently do not run the benchmarks with any flags 🙈 Also looking at other pallets; 100% speedup (2x) is the highest that I saw consistently. PS: paritytech/polkadot#4311 claims 200-400% ish. |
Well that's absolutely the same that we see in Looking forward to your updates. 😅 |
Slowdown concerning configurations without |
One more thing: Did you try |
I just tried it out but there was no signifiant performance difference. A few benchmarks even showed slightly worse performance. |
Demonstrates how bad it really is: |
@athei Also ran the benchmarks on a different CPU and using the stable Rust compiler version 1.58.
ResultsDefault Profile
Custom Profile
We see more or less the same slowdowns between both profiles. |
I tried to use custom package Cargo profiles for our Wasm On this profile [profile.release]
lto = "fat"
codegen-units = 1
[profile.release]
lto = "fat"
[profile.release.package.wasim_v1]
codegen-units = 1 We achieved only a score of roughly 300 which is the same score we achieve with this profile configuration: [profile.release]
lto = "fat" So it is as if Cargo's package specific profile settings are simply not applied. |
Some The fast run shows a flamegraph as it is to be expected by the developer of the new Slow[profile.bench]
lto = false
codegen-units = 16 Fast[profile.bench]
lto = "fat"
codegen-units = 1 |
Thanks for the check!
I wondered if cargo is actually pushing all the config flags on its dependencies or if they all use their own [profile.release.package."*"]
lto = "fat"
codegen-units = 1 Hopefully makes no difference. |
Profile settings specified by dependencies aren't applied. Only profile settings defined in the workspace root are respected. |
@bjorn3 Thanks a lot for your comment! I tried adding [profile.bench]
lto = "fat"
[profile.release.package.wasmi_v1]
codegen-units = 1
[profile.bench.package.wasmi_v1]
codegen-units = 1 To the root I also tried to apply [profile.release.package.wasmi_v1]
codegen-units = 1
[profile.bench.package.wasmi_v1]
codegen-units = 1 To the I hope/guess/think there is a misunderstanding on my side. |
Are you using wasmi as dependency of your own project? If so it will need to be in the workspace root of your own project. |
The |
The |
Yes that is true. |
A lot of the difference seems to be inlining. This patch already improves performance 40% for me in the default profile: Click to expanddiff --git a/wasmi_v1/src/engine/bytecode/utils.rs b/wasmi_v1/src/engine/bytecode/utils.rs
index 6ff10399..ce115b07 100644
--- a/wasmi_v1/src/engine/bytecode/utils.rs
+++ b/wasmi_v1/src/engine/bytecode/utils.rs
@@ -89,6 +89,7 @@ impl Target {
pub struct FuncIdx(u32);
impl From<u32> for FuncIdx {
+ #[inline]
fn from(index: u32) -> Self {
Self(index)
}
@@ -96,6 +97,7 @@ impl From<u32> for FuncIdx {
impl FuncIdx {
/// Returns the inner `u32` index.
+ #[inline]
pub fn into_inner(self) -> u32 {
self.0
}
@@ -107,6 +109,7 @@ impl FuncIdx {
pub struct SignatureIdx(u32);
impl From<u32> for SignatureIdx {
+ #[inline]
fn from(index: u32) -> Self {
Self(index)
}
@@ -114,6 +117,7 @@ impl From<u32> for SignatureIdx {
impl SignatureIdx {
/// Returns the inner `u32` index.
+ #[inline]
pub fn into_inner(self) -> u32 {
self.0
}
@@ -129,6 +133,7 @@ impl SignatureIdx {
pub struct LocalIdx(u32);
impl From<u32> for LocalIdx {
+ #[inline]
fn from(index: u32) -> Self {
Self(index)
}
@@ -136,6 +141,7 @@ impl From<u32> for LocalIdx {
impl LocalIdx {
/// Returns the inner `u32` index.
+ #[inline]
pub fn into_inner(self) -> u32 {
self.0
}
@@ -153,6 +159,7 @@ impl LocalIdx {
pub struct GlobalIdx(u32);
impl From<u32> for GlobalIdx {
+ #[inline]
fn from(index: u32) -> Self {
Self(index)
}
@@ -160,6 +167,7 @@ impl From<u32> for GlobalIdx {
impl GlobalIdx {
/// Returns the inner `u32` index.
+ #[inline]
pub fn into_inner(self) -> u32 {
self.0
}
diff --git a/wasmi_v1/src/engine/exec_context.rs b/wasmi_v1/src/engine/exec_context.rs
index 999e5fd9..9790150f 100644
--- a/wasmi_v1/src/engine/exec_context.rs
+++ b/wasmi_v1/src/engine/exec_context.rs
@@ -828,6 +828,7 @@ where
self.execute_load_extend::<i8, i32>(offset)
}
+ #[inline]
fn visit_i32_load_u8(&mut self, offset: Offset) -> Self::Outcome {
self.execute_load_extend::<u8, i32>(offset)
}
@@ -880,6 +881,7 @@ where
self.execute_store::<F64>(offset)
}
+ #[inline]
fn visit_i32_store_8(&mut self, offset: Offset) -> Self::Outcome {
self.execute_store_wrap::<i32, i8>(offset)
}
diff --git a/wasmi_v1/src/engine/value_stack.rs b/wasmi_v1/src/engine/value_stack.rs
index 40cc15dd..2d38c7eb 100644
--- a/wasmi_v1/src/engine/value_stack.rs
+++ b/wasmi_v1/src/engine/value_stack.rs
@@ -26,11 +26,13 @@ pub struct StackEntry(u64);
impl StackEntry {
/// Returns the underlying bits of the [`StackEntry`].
+ #[inline]
pub fn to_bits(self) -> u64 {
self.0
}
/// Converts the untyped [`StackEntry`] value into a typed [`Value`].
+ #[inline]
pub fn with_type(self, value_type: ValueType) -> Value {
match value_type {
ValueType::I32 => Value::I32(<_>::from_stack_entry(self)),
@@ -42,6 +44,7 @@ impl StackEntry {
}
impl From<Value> for StackEntry {
+ #[inline]
fn from(value: Value) -> Self {
match value {
Value::I32(value) => value.into(),
@@ -73,12 +76,14 @@ macro_rules! impl_from_stack_entry_integer {
($($t:ty),* $(,)?) => {
$(
impl FromStackEntry for $t {
+ #[inline]
fn from_stack_entry(entry: StackEntry) -> Self {
entry.to_bits() as _
}
}
impl From<$t> for StackEntry {
+ #[inline]
fn from(value: $t) -> Self {
Self(value as _)
}
@@ -92,12 +97,14 @@ macro_rules! impl_from_stack_entry_float {
($($t:ty),*) => {
$(
impl FromStackEntry for $t {
+ #[inline]
fn from_stack_entry(entry: StackEntry) -> Self {
Self::from_bits(entry.to_bits() as _)
}
}
impl From<$t> for StackEntry {
+ #[inline]
fn from(value: $t) -> Self {
Self(value.to_bits() as _)
}
@@ -108,12 +115,14 @@ macro_rules! impl_from_stack_entry_float {
impl_from_stack_entry_float!(f32, f64, F32, F64);
impl From<bool> for StackEntry {
+ #[inline]
fn from(value: bool) -> Self {
Self(value as _)
}
}
impl FromStackEntry for bool {
+ #[inline]
fn from_stack_entry(entry: StackEntry) -> Self {
entry.to_bits() != 0
}
@@ -259,6 +268,7 @@ impl ValueStack {
/// # Note
///
/// This has the same effect as [`ValueStack::peek`]`(0)`.
+ #[inline]
pub fn last(&self) -> StackEntry {
self.entries[self.stack_ptr - 1]
}
@@ -268,6 +278,7 @@ impl ValueStack {
/// # Note
///
/// This has the same effect as [`ValueStack::peek`]`(0)`.
+ #[inline]
pub fn last_mut(&mut self) -> &mut StackEntry {
&mut self.entries[self.stack_ptr - 1]
}
@@ -277,6 +288,7 @@ impl ValueStack {
/// # Note
///
/// Given a `depth` of 0 has the same effect as [`ValueStack::last`].
+ #[inline]
pub fn peek(&self, depth: usize) -> StackEntry {
self.entries[self.stack_ptr - depth - 1]
}
@@ -286,6 +298,7 @@ impl ValueStack {
/// # Note
///
/// Given a `depth` of 0 has the same effect as [`ValueStack::last_mut`].
+ #[inline]
pub fn peek_mut(&mut self, depth: usize) -> &mut StackEntry {
&mut self.entries[self.stack_ptr - depth - 1]
}
@@ -296,6 +309,7 @@ impl ValueStack {
///
/// This operation heavily relies on the prior validation of
/// the executed WebAssembly bytecode for correctness.
+ #[inline]
pub fn pop(&mut self) -> StackEntry {
self.stack_ptr -= 1;
self.entries[self.stack_ptr] |
Hi @tavianator and thanks a lot for your research & report! I should have mentioned that I also tried out earlier to annotate literally all function in the workspace with I was still not very happy about the outcome due to 3 reasons:
I am not at all sure how we want to go ahead with these |
We will build the runtime with |
We can even see that sprinkling |
Right, I agree that sprinkling I think especially the ones in |
It does for all functions marked as |
Right sorry, that's what I meant. No cross-crate inlining without non-local LTO or |
lto="fat"
and codegen-units=1
It is unlikely that this bottomless cask of pitfalls and footguns can ever be resolved. The only actual solution to controlling codegen of |
Currently
wasmi_v1
is performing pretty well compared to the oldwasmi
on the following profile:However, for the default profile which is
We can see a slow down of factor x4.21 which is pretty terrible.
For more information and stats see this GitHub Gist.
We need to investigate the reasons for this particularly bad slowdown under default profile configuration.
Usually you can see performance differences in the range of 10-15% but not in the range of 400-500%.
The Rust source files implementing the
wasmi_v1
engine can be found here:wasmi_v1
engine:wasmi_v1
engine:The text was updated successfully, but these errors were encountered: