Optimize calling a WebAssembly function #2757

alexcrichton · 2021-03-23T18:41:46Z

This commit implements a few optimizations, mainly inlining, that should
improve the performance of calling a WebAssembly function. This code
path can be quite hot depending on the embedding case and we hadn't
really put much effort into optimizing the nitty gritty.

The predominant optimization here is adding #[inline] to trivial
functions so performance is improved without having to compile with LTO.
Another optimization is to call lazy_per_thread_init when traps are
initialized per-thread (when a Store is created) rather than each time
a function is called. The next optimization is to change the unwind
reason in the CallThreadState to MaybeUninit to avoid extra checks
in the default case about whether we need to drop its variants (since in
the happy path we never need to drop it). The final optimization is to
optimize out a few checks when async support is disabled for a small
speed boost.

In a small benchmark where wasmtime calls a simple wasm function my
macOS computer dropped from 110ns to 86ns overhead, a 20% decrease. The
macOS overhead is still largely dominated by the global lock acquisition
and hash table management for traps right now, but I suspect the Linux
overhead is much better (should be on the order of ~30 or so ns).

We still have a long way to go to compete with SpiderMonkey which, in
testing, seem to have ~6ns overhead in calling the same wasm function on
my computer.

fitzgen

Nice!

This commit implements a few optimizations, mainly inlining, that should improve the performance of calling a WebAssembly function. This code path can be quite hot depending on the embedding case and we hadn't really put much effort into optimizing the nitty gritty. The predominant optimization here is adding `#[inline]` to trivial functions so performance is improved without having to compile with LTO. Another optimization is to call `lazy_per_thread_init` when traps are initialized per-thread (when a `Store` is created) rather than each time a function is called. The next optimization is to change the unwind reason in the `CallThreadState` to `MaybeUninit` to avoid extra checks in the default case about whether we need to drop its variants (since in the happy path we never need to drop it). The final optimization is to optimize out a few checks when `async` support is disabled for a small speed boost. In a small benchmark where wasmtime calls a simple wasm function my macOS computer dropped from 110ns to 86ns overhead, a 20% decrease. The macOS overhead is still largely dominated by the global lock acquisition and hash table management for traps right now, but I suspect the Linux overhead is much better (should be on the order of ~30 or so ns). We still have a long way to go to compete with SpiderMonkey which, in testing, seem to have ~6ns overhead in calling the same wasm function on my computer.

github-actions · 2021-03-23T19:46:59Z

Subscribe to Label Action

cc @peterhuene

This issue or pull request has been labeled: "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

peterhuene: wasmtime:api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

This commit is an extension of bytecodealliance#2757 where the goal is to optimize entry into WebAssembly. Currently wasmtime has two stack-based cleanups when entering wasm, one for the externref activation table and another for stack limits getting reset. This commit fuses these two cleanups together into one and moves some code around which enables less captures for fewer closures and such to speed up calls in to wasm a bit more. Overall this drops the execution time from 88ns to 80ns locally for me. This also updates the atomic orderings when updating the stack limit from `SeqCst` to `Relaxed`. While `SeqCst` is a reasonable starting point the usage here should be safe to use `Relaxed` since we're not using the atomics to actually protect any memory, it's simply receiving signals from other threads.

* Combine stack-based cleanups for faster wasm calls This commit is an extension of #2757 where the goal is to optimize entry into WebAssembly. Currently wasmtime has two stack-based cleanups when entering wasm, one for the externref activation table and another for stack limits getting reset. This commit fuses these two cleanups together into one and moves some code around which enables less captures for fewer closures and such to speed up calls in to wasm a bit more. Overall this drops the execution time from 88ns to 80ns locally for me. This also updates the atomic orderings when updating the stack limit from `SeqCst` to `Relaxed`. While `SeqCst` is a reasonable starting point the usage here should be safe to use `Relaxed` since we're not using the atomics to actually protect any memory, it's simply receiving signals from other threads. * Determine whether a pc is wasm via a global map The macOS implementation of traps recently changed to using mach ports for handlers instead of signal handlers. This means that a previously relied upon invariant, each thread fixes its own trap, was broken. The macOS implementation worked around this by maintaining a global map from thread id to thread local information, however, to solve the problem. This global map is quite slow though. It involves taking a lock and updating a hash map on all calls into WebAssembly. In my local testing this accounts for >70% of the overhead of calling into WebAssembly on macOS. Naturally it'd be great to remove this! This commit fixes this issue and removes the global lock/map that is updated on all calls into WebAssembly. The fix is to maintain a global map of wasm modules and their trap addresses in the `wasmtime` crate. Doing so is relatively simple since we're already tracking this information at the `Store` level. Once we've got a global map then the macOS implementation can use this from a foreign thread and everything works out. Locally this brings the overhead, on macOS specifically, of calling into wasm from 80ns to ~20ns. * Fix compiles * Review comments

Platforms Wasmtime supports may have per-thread initialization that needs to run before WebAssembly. For example Unix needs to setup a sigaltstack and macOS needs to set up mach ports. In bytecodealliance#2757 this per-thread setup was moved out of the invocation of a wasm function, relying on the lack of Send for Store to initialize the thread at Store creation time and never worry about it later. This conflicted with [wasmtime's desired multithreading story](bytecodealliance#2812) so a new [`Store::notify_switched_thread` was added](bytecodealliance#2822) to explicitly indicate a Store has moved to another thread (if it unsafely did so). It turns out though that it's not always easy to determine when a `Store` moves to a new thread. For example the Go bindings for Wasmtime are generally unaware when a goroutine switches OS threads. This led to bytecodealliance/wasmtime-go#74 where a SIGILL was left uncaught, making it appear that traps aren't working properly. This commit revisits the decision in bytecodealliance#2757 and moves per-thread initialization back into the path of calling into WebAssembly. This is differently from before, though, where there's still only one TLS access on the path of calling into WebAssembly, unlike before where it was a separate access. This allows us to get the speed benefits of bytecodealliance#2757 as well as the flexibility benefits of not having to explicitly move a store between threads. With this new ability this commit deletes the recently added `Store::notify_switched_thread` method since it's no longer necessary.

* Bring back per-thread lazy initialization Platforms Wasmtime supports may have per-thread initialization that needs to run before WebAssembly. For example Unix needs to setup a sigaltstack and macOS needs to set up mach ports. In #2757 this per-thread setup was moved out of the invocation of a wasm function, relying on the lack of Send for Store to initialize the thread at Store creation time and never worry about it later. This conflicted with [wasmtime's desired multithreading story](#2812) so a new [`Store::notify_switched_thread` was added](#2822) to explicitly indicate a Store has moved to another thread (if it unsafely did so). It turns out though that it's not always easy to determine when a `Store` moves to a new thread. For example the Go bindings for Wasmtime are generally unaware when a goroutine switches OS threads. This led to bytecodealliance/wasmtime-go#74 where a SIGILL was left uncaught, making it appear that traps aren't working properly. This commit revisits the decision in #2757 and moves per-thread initialization back into the path of calling into WebAssembly. This is differently from before, though, where there's still only one TLS access on the path of calling into WebAssembly, unlike before where it was a separate access. This allows us to get the speed benefits of #2757 as well as the flexibility benefits of not having to explicitly move a store between threads. With this new ability this commit deletes the recently added `Store::notify_switched_thread` method since it's no longer necessary. * Fix a test compiling

* Bring back per-thread lazy initialization Platforms Wasmtime supports may have per-thread initialization that needs to run before WebAssembly. For example Unix needs to setup a sigaltstack and macOS needs to set up mach ports. In bytecodealliance#2757 this per-thread setup was moved out of the invocation of a wasm function, relying on the lack of Send for Store to initialize the thread at Store creation time and never worry about it later. This conflicted with [wasmtime's desired multithreading story](bytecodealliance#2812) so a new [`Store::notify_switched_thread` was added](bytecodealliance#2822) to explicitly indicate a Store has moved to another thread (if it unsafely did so). It turns out though that it's not always easy to determine when a `Store` moves to a new thread. For example the Go bindings for Wasmtime are generally unaware when a goroutine switches OS threads. This led to bytecodealliance/wasmtime-go#74 where a SIGILL was left uncaught, making it appear that traps aren't working properly. This commit revisits the decision in bytecodealliance#2757 and moves per-thread initialization back into the path of calling into WebAssembly. This is differently from before, though, where there's still only one TLS access on the path of calling into WebAssembly, unlike before where it was a separate access. This allows us to get the speed benefits of bytecodealliance#2757 as well as the flexibility benefits of not having to explicitly move a store between threads. With this new ability this commit deletes the recently added `Store::notify_switched_thread` method since it's no longer necessary. * Fix a test compiling

fitzgen approved these changes Mar 23, 2021

View reviewed changes

alexcrichton force-pushed the fast-call branch from e8340dd to b690a15 Compare March 23, 2021 19:11

github-actions bot added the wasmtime:api Related to the API of the `wasmtime` crate itself label Mar 23, 2021

alexcrichton merged commit c95971a into bytecodealliance:main Mar 23, 2021

alexcrichton deleted the fast-call branch March 23, 2021 20:22

alexcrichton mentioned this pull request Mar 23, 2021

More optimizations for calling into WebAssembly #2759

Merged

alexcrichton mentioned this pull request Apr 28, 2021

Bring back per-thread lazy initialization #2863

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize calling a WebAssembly function #2757

Optimize calling a WebAssembly function #2757

alexcrichton commented Mar 23, 2021

fitzgen left a comment

github-actions bot commented Mar 23, 2021

Optimize calling a WebAssembly function #2757

Optimize calling a WebAssembly function #2757

Conversation

alexcrichton commented Mar 23, 2021

fitzgen left a comment

Choose a reason for hiding this comment

github-actions bot commented Mar 23, 2021

Subscribe to Label Action