-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Conversation
Looking at
So I ran our basic benchmarks; those are mostly unchanged, all good. But then I ran a small subset of our pallet benches, and it seems like there are some huge performance regressions here. (These numbers are the newest The full table of all of the benches I ran: Benchmark results
Here are the top regressions: Benchmark results
So normal pallets are, like ~5% slower, while the contracts pallet is something around ~50% slower. This is very worrying. Assuming I haven't screwed up something while benchmarking (this was just a quick back-of-the-envelope run) then merging this in might tank the performance of chains like e.g. Moonbeam. |
Okay, that sounds bad. Maybe we should try the bench bot and look the results to see if it comes to similar results. |
/cmd queue -c bench-bot $ pallet dev pallet_contracts |
@bkchr https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1775806 was started for your command Comment |
@bkchr Command |
Not great but also not horrible either: SWC,
Did you take the component ranges into account? |
Nice tool! So, what are we going to do? I assume the wasmtime guys will not accept reverting their commit that makes our code slower? 😬 |
There are quite a lot of these migrations to this |
My comparisons were based on the raw times from the
I agree! @ggwpez How exactly are the "Old [Time]" and "New [Time]" columns calculated there? Are those based on raw times?
I'm sure they won't, but they do care about performance. A regression like this is surely bound to affect other people too. So what I think we should do is to make a minimal reproduction that clearly demonstrates the problem (port the benchmark into a standalone WASM blob and make a minimal standalone host which calls into it), and post an issue on their issue tracker.
I could do this but... I just said the other day I don't want to get distracted anymore before clearing up my current TODO list, so... if there are any volunteers, feel free to take a stab, and if not then I'll do it... (: |
It looks like the regression is |
Yea for raw WASM performance that makes sense to not look at the post-processed weights.
It uses the weight formulas from the Rust weight.rs files by parsing the code. Which shows that these weights were probably out-dated since the number of storage accesses changed. |
Sounds like we should regenerate these weights on master to have a fair comparison? |
Looks like there are no volunteers after all, so I'll do it. (: |
That doesn't look to be the case from what I can see. I've hacked up a benchmark across many wasmtime versions, and here are the results I got for coremark (coremark scores, so higher is better):
So there might be a slight performance regression between 0.38 and 0.40, but it's nothing major. I'll try something heavier. Maybe the wasmi benches from here. |
I still don't know why this happens, but, I've just discovered something funny. Apparently doing this in - sp-io = { version = "6.0.0", default-features = false, path = "../../primitives/io" }
+ sp-io = { version = "6.0.0", default-features = false, path = "../../primitives/io", features = ["improved_panic_error_reporting"] }
I've ran this multiple times to be sure; the results are consistent. Just doesn't really make much sense as it just enables the feature from this PR of mine from a while back so my guess is that it affects the optimizations somehow in a weird way? It would explain why I can't reproduce the issue in a minimal example - I've ported the contracts benchmark into a minimal example but I see no regression there. So I guess I'll either have to go the other way around - start with a full substrate build and just start deleting code until I end up with a minimal reproduction, or (and this is probably a better idea) bisect |
Your conclusion seems plausible. Optimization behavior can be erratic. This is further supported by the fact that you can't reproduce the regression when benchmarking wasmi directly. |
I am not aware of how that could possibly affect optimizations. However, 0.40 implemented a custom stack walking mechanism which supposed to make those things faster (although, I did not dig into this current problem enough to be confident) |
Okay, this is one of the weirdest heisenbugs I've ever seen. I did bisect So what I can conclude right now:
I still don't know exactly why this happens, but I can already see it's (as usual) probably going to be something very stupid. |
@koute maybe there is something like the ahash compile time randomness? Aka something that gets sourced at compile time and that changes the value based on how you compile it? Or may change random code that doesn't really seem to be related? |
Yeah, that's one of the things I'm checking. Although it seems like the behavior is consistent between compilations (if I compile the the same code with exactly the same flags I get the same behavior) so if it's due to the randomness it's not using a truly random seed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fine to merge after making sure the rustix
version we use is aligned with what wasmtime
uses. (IIRC they should be the same now so we shouldn't need to have two of them anymore)
FWIW, we have #10707 for tracking the non-MVP features on the executor |
frame/contracts/src/weights.rs
Outdated
@@ -18,22 +18,22 @@ | |||
//! Autogenerated weights for pallet_contracts | |||
//! | |||
//! THIS FILE WAS AUTO-GENERATED USING THE SUBSTRATE BENCHMARK CLI VERSION 4.0.0-dev | |||
//! DATE: 2022-06-22, STEPS: `50`, REPEAT: 20, LOW RANGE: `[]`, HIGH RANGE: `[]` | |||
//! DATE: 2022-08-25, STEPS: `50`, REPEAT: 20, LOW RANGE: `[]`, HIGH RANGE: `[]` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need that file updated? Why this, and why not other benchmarks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we only had the bot run for this.
primitives/wasm-interface/Cargo.toml
Outdated
@@ -18,7 +18,7 @@ codec = { package = "parity-scale-codec", version = "3.0.0", default-features = | |||
impl-trait-for-tuples = "0.2.2" | |||
log = { version = "0.4.17", optional = true } | |||
wasmi = { version = "0.9.1", optional = true } | |||
wasmtime = { version = "0.38.0", default-features = false, optional = true } | |||
wasmtime = { version = "0.40.0", default-features = false, optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: make it 0.40.1
as in executor's cargo.toml?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tracking this down :)
bot merge |
Waiting for commit status. |
Merge cancelled due to error. Error: Statuses failed for 4a86689 |
@athei so should I regenerate the weights again here? |
After further investigation I have some final notes regarding the slowdown we saw here. Let me quickly walk you through it. DetailsThe #[inline(always)]
pub unsafe fn set_bytes(mut s: *mut u8, c: u8, mut n: usize) {
#[inline(always)]
pub unsafe fn set_bytes_bytes(mut s: *mut u8, c: u8, n: usize) {
let end = s.add(n);
while s < end {
*s = c;
s = s.add(1);
}
}
#[inline(always)]
pub unsafe fn set_bytes_words(s: *mut u8, c: u8, n: usize) {
let mut broadcast = c as usize;
let mut bits = 8;
while bits < WORD_SIZE * 8 {
broadcast |= broadcast << bits;
bits *= 2;
}
let mut s_usize = s as *mut usize;
let end = s.add(n) as *mut usize;
while s_usize < end {
*s_usize = broadcast;
s_usize = s_usize.add(1);
}
}
if likely(n >= WORD_COPY_THRESHOLD) {
// Align s
// Because of n >= 2 * WORD_SIZE, dst_misalignment < n
let misalignment = (s as usize).wrapping_neg() & WORD_MASK;
set_bytes_bytes(s, c, misalignment);
s = s.add(misalignment);
n -= misalignment;
let n_words = n & !WORD_MASK;
set_bytes_words(s, c, n_words);
s = s.add(n_words);
n -= n_words;
}
set_bytes_bytes(s, c, n);
} Now, you can just ignore most of this code as it's irrelevant to the performance and is here to handle things like misaligned pointers or very small buffers. What we're interested the most in is this little inner loop here inside of while s_usize < end {
*s_usize = broadcast;
s_usize = s_usize.add(1);
} What it does is that it just fills the memory in a loop, in chunks of Now, here's how our friend loop:
41 8b c2 mov %r10d,%eax ; moves the pointer (well, offset relative to the linear memory to be exact) to the buffer into another register
44 89 5c 06 00 mov %r11d,0x0(%rsi,%rax,1) ; writes 4 bytes into our buffer (rsi is the linear memory base pointer and this instructions adds it together to our pointer from rax/eax before dereferencing)
41 83 c2 04 add $0x4,%r10d ; move the pointer forward by 4 bytes
45 39 ca cmp %r9d,%r10d ; is the pointer equal to the end pointer, that is - have we reached the end?
0f 82 eb ff ff ff jb loop ; if not repeat the loop So now let me show this very same snippet a few more times, but at different offsets in memory. (1) Offset of 0 (fast execution)
(2) Offset of 1 (slow execution)
...all of the offsets in-between here are also slow... (3) Offset of 13 (still slow execution)
... all of the offsets in-between here are also slow... (4) Offset of 20 (still slow execution)
(5) Offset of 21 (fast execution again!)
The first number is the address of the instruction, the second number is the cache line, and the third number (the one with a '+') is the offset within that cache line. Noticed anything? Yes! In every slow case the execution of this loop straddles two cache lines! As long as all of the code fits within a single cache line (64 bytes) then everything is fast, but if it doesn't then it slows down dramatically. ConclusionDepending on the phase of the moon Personally I'd classify this as a bug in |
Yes. I assume you resolved the conflict by choosing what is currently on master. |
/cmd queue -c bench-bot $ pallet dev pallet_contracts |
@bkchr https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1818801 was started for your command Comment |
@bkchr Command |
bot merge |
Waiting for commit status. |
Merge cancelled due to error. Error: Statuses failed for f8770aa |
* Upgrade wasmtime to 0.40.0 * ".git/.scripts/bench-bot.sh" pallet dev pallet_contracts * Update in the other Cargo.toml * ".git/.scripts/bench-bot.sh" pallet dev pallet_contracts Co-authored-by: command-bot <> Co-authored-by: Alexander Theißen <alex.theissen@me.com>
Closes: #12073