-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Document and improve wasmer performance for contracts #10298
Comments
Intermediate results so far. tldr: wasmer is ~twice as slow at instantiating modules and ~twice as fast at executing calls when compared to wasmi. The SetupI used the I used the I disabled embedded executor, so both wasmi and wasmer were executed by sandbox API from Substrate client. That way control flow would be as close as possible between the two backends so we can directly compare the results. I executed the node using the cli below but for wasmi execution I hacked client/executor/wasmtime/src/host.rs like this: pub fn new(allocator: FreeingBumpHeapAllocator) -> Self {
HostState {
sandbox_store: SandboxStore(Some(Box::new(sandbox::Store::new(
- sandbox::SandboxBackend::TryWasmer,
+ sandbox::SandboxBackend::Wasmi,
)))),
I also tried to measure the whole lifetime of The ResultsThese two files are part of the node log during dispatch of the Flipper::flip transaction. Important lines were surrounded by newlines. Best to be viewed side-by-side. https://gist.github.com/0x7CFE/b7db97b07620aca2fa860c2296604cd0#file-wasmi2-txt Note that the Result InterpretationHere I show notable lines from the files above with my commentary. See the whole log for the full picture. To simplify reading I removed irrelevant info and formatted time data with spaces ( During module initialization phase we clearly see that wasmi is way ahead of wasmer and is almost twice as fast. This is not surprising at all, since wasmer needs to perform a computationally expensive task of compiling wasm to a native machine code, whereas wasmi just does instruction transpiling into a format that is better suited for immediate execution:
However, when modules were initialized the situation is very much the reverse. Here's an example of handling a host function:
The topmost invoke (near the end of the log file) is also very much in favor of wasmer:
Again, I should note that these measurements were based on an execution of a single transaction. So please take it with a grain of salt. Still it contains a few dozen of invocations and, I think, shows pretty convincingly that wasmer is good at executing stuff, but really falls behind during the instantiation phase. So, what gives? Unfortunately, our current setup is pretty straightforward: we instantiate the module right before making the call: // Instantiate the instance from the instrumented module code and invoke the contract entrypoint.
let mut runtime = Runtime::new(ext, input_data, memory);
let result = sp_sandbox::default_executor::Instance::new(&code, &imports, &mut runtime)
.and_then(|mut instance| instance.invoke(function.identifier(), &[], &mut runtime)); So, even that wasmer is able to execute code fast it still makes no practical difference in the current setup. This is especially visible in smart contract scenario where the code is small and often contains tiny amount of instructions. Thoughts and Ideas
Further Work
|
This makes me a bit sad. Is there an issue in substrate for this? I believe others have struggled with this as well in the past (cc @Imod7 perhaps?).
I don't know how representative this call is, but imo it'd be good to repeat the benchmark with a few other function calls just to ensure that the overall conclusion is not overly skewed by the specific code in
This seems like the most pressing question. Is it easy to measure the actual compilation time vs compiler startup? |
Well, it's the simplest contract call I can imagine, so I essentially wanted to draw a baseline here. It could easily happen that for the simplest of calls wasmi is faster because it exits early. However, what we see is that even on such a trivial method, wasmer still executes faster than wasmi. Also, don't get me wrong, even such a tiny method makes On the other hand, we surely need to measure something like ERC20 for the real world scenario, but I'm expecting it to be mostly the same: slower init, faster execution.
That's a question on its own. P.S.: I've added some more entries to the thoughts and ideas section. |
This suggests that we can rule out our theory that it is due to slow execution of host functions by wasmer. It is all about the startup which was the most likely scenario anyways. Transformation into machine code takes time. I still would like to see this repeated with an ink! ERC20 transfer, though. Also, having this somehow integrated into substrate or some other repo so it is easily repeatable would be very nice. Don't want all your work to go to waste and then the next time we need to do it all over again. I think this is nice to have this as some kind of regression test (doesn't need to be completely automated).
I think that this is the next question that you should investigate. It guess this would mean profiling wasmer itself. All other thoughts about cutting down compile times are useless if we don't know if that is actually the bottle neck.
There is no other point in time we could possibly instantiate the module. We don't know which code to instantiate until someone makes a call (very different from the runtime itself). We can't do some basic LRU cache of modules if you are thinking about that: The cache can be gamed (warm/cold can be different for different validators). Also, good luck putting weights on that.
Caching machine code is problematic because it is platform dependent and the cache lives within the runtime. If we could introduce an IR to wasmer singlepass that is trivially translatable to all supported machines this could be a target for caching. Probably some basic register machine byte code. SSA not possible because there is no O(1) transformation from wasm. On the other hand not having an IR is kind of the point of a single pass compiler in order to reduce compilation times by leaving out this step. But having two O(1) transformations could be worth it for the sake of supporting caching.
Having a dynamic recompiler that only compiles functions on entry (or even smaller chunks) could cut down on compile times if that is really what makes it expensive. The context switch between interpretation and compilation could pessimise performance, though. You would still pass in the whole module. I think this is essentially the same as your Hotspot idea. |
Ok, so I've made some measurements that I'm now confident of. Please note that I redid the tests using manual measurement since I discovered that for some reason Long story short: the performance is comparable, with the exception of module instantiation which is considerably slower on Wasmer. Everything else is more or less on par. The tests were performed on the following branch: https://github.com/paritytech/substrate/commits/dk-tracing-executor. Here are links to recorded Wasmi and Wasmer sessions where I did
Here are the data grepped from the logs: Calls to
Calls to
Calls to
All things combined make me think that the bottleneck is not in the execution engine, but in a way how we dispatch wasm calls. JIT shines when it's able to fully utilize modern super-scalar hardware, especially speculative execution and cache prefetch. In our case it's impossible to do so because execution is jumping back and forth between substrate core and wasm. Of course it's just my speculation, but I can't imagine other reasons why interpreter and compiler show similar results. Interestingly, Wasmer is generally faster when doing If only we could cache instantiated modules then we'd probably be able to squeeze some performance out of Wasmer. Still, I don't think it's practical given that performance win is marginal. P.S.: I did another test session where I removed all logs from |
I don't know how you come to this conclusion. Assuming that you mean imported function calls by "wasm calls". The values you measured for
Again, this doesn't necessarily follow from the numbers observed. Compilation performance can be the culprit.
From that I am seeing here I think the most likely scenario is the obvious one: Faster execution of the compiled code cannot recuperate the big up front costs of code compilation. This is especially true of just a small part is executed (compare
If we assume that compilation is the culprit than caching would make all the difference. Thoughts and IdeasYou asked me to comment on them. I will do this here. But in general I think this issue should only be about accessing the status quo. All of those ideas are massive undertakings and shouldn't be tackled on a hunch. First, we need to know what is going on. Otherwise we could be optimizing at the completely wrong end. We need to verify our assumption.
|
Yes, you are absolutely right. Maybe I failed to express my point, but my conclusions are mostly the same.
My understanding is that compiled code should be an order of magnitude faster which is definitely not the case as we see in the measurements. Yes, |
Correct me if I am wrong: Your measurement always measure compilation + execution. This means even if the execution is much faster it could still be offset by some very slow compilation. So you can't follow from that that the execution isn't a magnitude faster. For all we know it could be. As a matter of fact it is suggested by the synthetic instruction benchmarks I posted in the top post.
Same speed is completely fine there assuming that the rest of the code executes much faster. Assuming that imported function calls are roughly the same that should still yield much better performance. But we can't observe any speedup and I still think that compilation costs are the most likely explanation: It is the most obvious and doesn't contradict any measurement.
I don't get why you think it should be that. I mean it could be but compilation seems to be more likely. You could test this assumption by writing two contract calls where one does much more host function calls and the other much more arithmetic instructions, stack access etc.. |
I can be wrong too, but my current understanding is that wasmer does the full compilation during module creation that accepts a wasm bytecode and a So, effectively, everything related to native code generation is done during module creation. Everything else deals with already compiled native module. I have done some quick checks and indeed, the data backs up my theory:
Please see the Note that the majority of the time is spent during module creation where the compilation takes place. Also note that this procedure is pure in a sense that it does not depend on anything but the wasm blob. If only we could cache it then all subsequent module instantiations would happen in a microsecond range. Even if we'd cache the compiled module for one block we would already be able to shave off some time. Even in my test scenario the module gets compiled 7 times. |
Okay I get confused because I didn't understand what is shown in the tables. We cleared that over element:
Given these information it is indeed curious why even with all the code compiled wasmer doesn't outperform wasmi by a huge margin. Compilation is factored out already. Wasmer looks really bad: It takes a long time to create the module and then doesn't perform faster at execution. But why would it be the host function calls? You measured them as roughly equal in time. I guess the A/B test I mentioned would be the next step: Comparing a call with many vs. few host function calls. |
I made some more tests. The In such a scenario, even if wasmer will perform 1000 times faster than wasmi it still can affect only those remaining ≈2.5 ms. Even in ideal case the deployment would take no less than 13.4ms + ε. On the other hand, for other calls it's not that bad. Overall I think that host function overhead explains why wasmer does not not show x10 speedup in all tests. Yet, that thing alone cannot explain why wasmer is slower sometimes even if we factor out the compilation. |
1600 host function calls for a single deploy of an ERC20 contract? That seems too much. Or is it a batch of calls?
Agreed. It does explain why it isn't able to outperform it by a huge margin but not why it is slower. Except if the host functions would be much slower on wasmer. Maybe the slowness isn't correctly captured in your measurements because the context switch makes the following non host function instructions slower (cold caches due to a lot of code being shuffled around). I think you already mentioned this theory somewhere. |
Unfortunately, not. It's for a single deployment. I also think that's a bit too much but I don't know the contract internals. Wasmi log shows the same story so it's definitely not Wasmer's fault. Also, call that takes 22.064872ms in the table above makes even more, 2244 host function calls that take 8.4ms in total. So its not specific to deploy alone.
Yes, I think that is the case, not to mention that massive log output introduces delays on its own. I think now we need to aggregate data over many invocations with and without host function calls to say for sure. |
I assume it is the gas metering then. It generates one host function call for every basic block. I think what we can learn from that is that adding disk caching to wasmer (with an IR language) + implementing paritytech/wasm-instrument#11 could make wasmer competitive. |
To confirm that assumption @0x7CFE can disable gas metering and do one more measurement. Also, to avoid the influence of debug logs, you can measure the total time of the execution and the time to init the executor. The difference - is the time of call execution. |
I no longer work on this. |
Yes I also thought of this. Redoing this with disabled gas metering would be interesting to confirm the theory. |
No one working on this anymore. De-prioritized because we go with in-runtime wasmi for now which will get some performance updates soon. |
With this pr here: #12501 we can probably close this? |
Yea. I don't think wasmer is happening any time soon. |
Motivation
Currently, we use an embedded (compiled to wasm into the runtime) wasmi in order to execute contracts. We want to switch to the wasmer singlepass backend. However, in order to make an informed decision we need to document the status quo.
We have two benchmark approaches to assess the performance of contract execution:
Synthetic Benchmarks
This is a suite of runtime benchmarks that are used to assign a weight value to every host function available to contracts and each supported wasm instruction. These benchmark are defined in this file. You can have a look at #10266 to find out how to run these benchmarks with the benchbot and what the current diff is is to wasmi.
Insights from those benchmarks:
seal_caller
benchmark)call
benchmark).End to end benchmarks
The end to end benchmarks are also runtime benchmarks but flagged as
extra
so they don't get part of the emitted weight file. They are just for manual execution and inspection.Instead of using procedural generated fixtures as the other benchmarks they use real world contracts written in ink! or solang. For now there are only two: Benchmarking a
transfer
on a ink! ERC20 contract and on an solang ERC20 contract.You can look into #10268 on how to run these benchmarks. You will also find results for running those with wasmi and wasmer.
By looking at those benchmarks results we can learn that wasmer is 50% slower on the ink! example and 78% slower on the solang example.
Conclusion
Instruction performance of wasmer is better than wasmi (including compilation overhead of wasmer). However, in an end-to-end benchmark wasmi is still faster. We need to find out why this is and resolve the issue if possible.
Goal
The goal of this issue is to find out why those end-to-end benchmarks report no results in favor of wasmer. If the reason are some resolvable inefficiencies in either wasmer or the wasmer integration those should be fixed and the new results should be reported here. Those fixes should then be upstreamed to substrate and wasmer.
If the conclusion is that we cannot outperform interpretation for typical smart contract workloads this is also a valid result. This needs to be backed up with additional end-to-end benchmarks, though. Maybe there are some workloads where is outperforms interpretation (loop heavy contracts) but not all. A solution where we offer two execution engines might be something to think about then.
Steps
The text was updated successfully, but these errors were encountered: