Improve process spawning performance #37

bkolobara · 2021-04-10T15:20:12Z

Lunatic encourages program architectures where it's common to spawn many short lived processes (e.g. a process per HTTP request). For this to work the process spawning overhead needs to stay low. I ran some benchmarks on my MacBook:

With the Wasmtime backend:

wasmtime instance creation                                                                             
                        time:   [26.164 us 26.284 us 26.424 us]
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

lunatic instance creation                                                                            
                        time:   [321.65 us 323.69 us 326.78 us]
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) high mild
  8 (8.00%) high severe

With the Wasmer backend:

wasmer instance creation                                                                             
                        time:   [23.603 us 23.727 us 23.863 us]
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

lunatic instance creation                                                                            
                        time:   [216.54 us 217.95 us 219.62 us]
                        change: [-32.116% -30.953% -29.620%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

Ideally we want the instance creation time to be in the single digit micro second range, matching Erlang. There are many improvements we can do to get there.

As we keep adding features the instance creation time has been getting worse, mostly because every time we add a new host function it will increase the Linker creation time. The good news is that with a recent addition to Wasmtime it's possible to define "global" host functions so we can completely skip the step of adding all host functions to the instance linker each time we spawn a process.

Another recent Wasmtime addition allows us to reuse and pool resources. We could create a pool of other resources too, like AsyncWormhole stacks. As the Wasm code can't observe the "real" stack it would be even safe to reuse it between instances without clearing it first.

Even both of this optimisations are Wasmtime specific, I believe that Wasmer is going to add similar functionality in the future. I will open separate issues for both of this approaches and keep this as a tracking issue for further ideas and discussions around spawning performance.

The text was updated successfully, but these errors were encountered:

bkolobara · 2021-07-29T14:34:17Z

This has improved a bit with the re-write, but there should be a lot of room for further improvement:

spawn process           time:   [117.11 us 117.39 us 117.72 us]                          
                        change: [-44.351% -40.109% -36.021%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

jgarvin · 2022-12-14T23:12:36Z

OS and hardware are going to make things vary, but isn't 100us much greater than the normal amount of time to spawn a full OS thread? I haven't benchmarked this myself but others have reported times under 20us. This would make even bare wasmtime/wasmer instance creation even without lunatic involved slower than spawning a native thread according to the numbers here? I don't have a mental model for what the wasm runtimes need to do but I find this surprising.

bkolobara · 2022-12-15T08:30:15Z

At the moment it's significantly slower than spawning an OS thread, but the amount of work done is also significantly higher.

A Wasm instance gets a completely fresh heap memory, this means that all the static strings compiled into the binary need to be copied into the newly created heap. memcopy is fast, but I believe this is currently the biggest performance hit. There are also some mmap allocations (that are super slow on macOS) to eliminate bound checks additionally slowing things down. Each instance also holds onto file descriptors, tcp connections and other resources. Threads don't have individual resources that need to be set up and all threads inside the process share a table. That's also why threads start up much faster than operating system processes, because the heavy lifting is already done. The amount of work that is done by spawning a Wasm instance is much more comparable to spawning an operating system process.

The good news is that we might be able to reduce the amount of memory copied by doing more sharing of the static memory. And I don't see a real blocker why we would not be able to get close or even beat thread spawning speed. One big advantage that we have is that once the Wasm instances are spawned, scheduling them in user space is much cheaper than for the OS to schedule threads

bkolobara mentioned this issue Apr 10, 2021

Use Wasmtime's shared host functions to improve process spawning performance #38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve process spawning performance #37

Improve process spawning performance #37

bkolobara commented Apr 10, 2021

bkolobara commented Jul 29, 2021

jgarvin commented Dec 14, 2022

bkolobara commented Dec 15, 2022

Improve process spawning performance #37

Improve process spawning performance #37

Comments

bkolobara commented Apr 10, 2021

bkolobara commented Jul 29, 2021

jgarvin commented Dec 14, 2022

bkolobara commented Dec 15, 2022