Wasmtime horizontal scaling results in poor performance #4637

Duslia · 2022-08-08T10:41:30Z

When I try to use tokio to scale wasmtime horizontally, I found that wasmtime performance drops significantly. It looks like there are some shared resources inside. And no matter how many threads I open, the total number of execute num is almost the same. This is my performance data:

thread and task	data	total execute nums
thread 1, task 1	11w/task	11w
thread 3, task 3	4w/task	12w
thread 8, task 8	1w/task	8w

This is my test code: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=e9fc13a7d68b80afdc07076482a5b787 .

And the wasm code is just a function return 1.

package main

//export echo
func echo(ctxLen, reqLen int32) int32 { //nolint
	return 1
}

func main() {
}

The text was updated successfully, but these errors were encountered:

alexcrichton · 2022-08-08T15:38:00Z

Thanks for the report! Could you share the Wasmtime version you're using as well as the specific *.wasm file that you're benchmarking?

alexcrichton · 2022-08-08T15:52:04Z

That being said what you're almost certainly running into is limits on concurrently running mmap in multiple threads within one process. Allocation of fiber stacks for asynchronously executing WebAsembly primarily goes through mmap right now and unmapping a stack requires broadcasting to other threads with an IPI which slows down concurrent execution.

One improvement that you can do to your benchmark is to use the pooling allocator instead of the on-demand allocator. Otherwise though improvements need to come from the kernel itself (assuming you're running Linux, which would also be great to specify in the issue description). Historically the kernel has prototyped a madvisev syscall which allows batch unmappings that Wasmtime could use with the pooling allocator. We've tested this historically and it's shown to greatly increase the throughput in situations like this. This isn't landed in Linux, however.

Finally it depends on your use case of whether this comes up much in practice. Typically the actual work performed by a wasm module dwarfs the cost of the memory mappings since the wasm modules live for a longer time than "return 1". If that's the case then this microbenchmark isn't necessarily indicative of performance and it's recommended to use a more realistic workload for your use case. If this sort of short-running computation is common for your embedding, however, then multithreaded execution will currently have the impact that you're measuring here.

Duslia · 2022-08-09T07:34:18Z

Thanks!! I got it.

Duslia · 2022-08-09T12:55:42Z

I'm trying to comment out these two lines and it can scale horizontally with a high performance.

I 'd like to ask that is the function of these lines of code to prevent the previous data from being obtained in the next execution?

bjorn3 · 2022-08-09T13:08:41Z

Yes, without that call the old data would be preserved when reusing a memory region as linear memory for the wasm module.

alexcrichton · 2022-08-09T14:32:56Z

Ah ok that's a good confirmation that madvise and IPIs are indeed likely your scaling bottleneck. In that case there's unfortunately not much that can be done right now that Wasmtime can do. This is some (old) discussion about a possible madvisev syscall which would largely fix this scaling bottleneck within Wasmtime. This was confirmed with a modified kernel with that patch and developing a version of Wasmtime that uses madvisev.

In the meantime though this is something that basically just needs to be taken into account when capacity planning with Wasmtime. If you're willing assistance in getting a patch such as that into the Linux kernel (or investigating other routes of handling this bottleneck) would be greatly appreciated!

alexcrichton · 2022-08-09T14:33:23Z

Oh sorry, and to confirm the madvise call is absolutely critical for safety, it cannot be removed and still have a secure sandbox for wasm.

Duslia · 2022-08-09T15:23:06Z

OK. I will think about how to solve it. Thanks a lot!!

ihciah · 2022-08-10T02:40:01Z

I'm wondering if the following 2 solutions may be accepted?

Call madvise with io_uring: It saves syscall but I don't know if it will work to avoid scaling bottleneck.
Maintain 2 index groups, one is dirty, the other one is clean. On drop, the index is put into dirty group and try to merge with others, and won't be madvised instantly. On index allocation, try to get from the clean one, and if there's no available in it, try to obtain a continuous range from dirty group and call madvise, then return one of it back and put all the index left inside the range into the clean group. In this solution, we can call less madvise by clean stacks in batch.

bjorn3 · 2022-08-10T06:51:14Z

Call madvise with io_uring: It saves syscall but I don't know if it will work to avoid scaling bottleneck.

As I understand it the IPI is the bottlwneck, not the syscall overhead. Every time madvise is run all cores running the wasmtime process have to be temporarily interupted and suspended while the page table is being modified. This scales really badly. io_uring can't do anything against this.

ihciah · 2022-08-10T07:43:01Z

Call madvise with io_uring: It saves syscall but I don't know if it will work to avoid scaling bottleneck.

As I understand it the IPI is the bottlwneck, not the syscall overhead. Every time madvise is run all cores running the wasmtime process have to be temporarily interupted and suspended while the page table is being modified. This scales really badly. io_uring can't do anything against this.

Thanks for the explain! I also checked the kernel code and found in func io_madvise of io_uring.c, it just call do_madvise, the lock will be acquired inside do_madvise, which means there's no aggregation with it.
To do madvise aggregation without modifying kernel, what about the solution 2? I implemented a demo here(not finished yet): ihciah@071f5d9

bjorn3 · 2022-08-10T07:55:00Z

I might be misunderstanding, but isn't that what the pooling allocator does?

ihciah · 2022-08-10T08:09:36Z

I might be misunderstanding, but isn't that what the pooling allocator does?

The pool inside the wasmtime allocates a big memory one time in a single call, and manages it by maintaining index of memory pages manually. Each time the memory page(memory_pages, table_pages, stack_pages) dropped, it will clear the memory content by calling decommit->madvise, and mark the index available for the later allocation. And here is the problem.
What I want to impl is to make the indices to recycle continues and clear the memory in batch. I think it is a kind aggregation in userspace.
(I started reading wasmtime code just 2 from days ago. Maybe there's something I misunderstand..)

alexcrichton · 2022-08-10T14:58:11Z

Call madvise with io_uring

While we haven't directly experimented with this I think as you've discovered we took a look and concluded it probably wouldn't help.

Maintain 2 index groups, one is dirty, the other one is clean.

This we have actually experimented with historically. Our conclusion was that it did not actually help all that much. We found that the cost of an madvise scaled with the size of the region being advised, which meant that when we batched everything up it took effectively just as long as doing everything individually. We had various hypotheses for why this was the case but we didn't bottom it out 100% at the time.

Our testing with madvisev is mostly what led us to stop investigating this issue because madvisev so clearly fixed the issue relative to all other workarounds.

ihciah · 2022-08-10T17:23:35Z

We found that the cost of an madvise scaled with the size of the region being advised

Thank you for sharing past experiments.

But it still confuse me that advising multiple regions with madvisev is better than madvise with a merged region. If this is true, does this means we can impl a better madvise in kernel by simply split the region and do it one by one?

pchickey · 2022-08-10T17:36:01Z

You can't merge the regions to madvise - they're not contiguous.

ihciah · 2022-08-10T17:56:29Z

You can't merge the regions to madvise - they're not contiguous.

I mean maybe we can make the indices as contiguous as possible and then we can do madvise on the biggest contiguous region when alloc. I don't know if this will work.

alexcrichton · 2022-08-10T18:29:49Z

IIRC we were just as confused as you are. I remember though reorganizing the pooling allocator to put the table/memory/instance allocations all next to each other so each instance's index identified one large contiguous region of memory. In the benchmarks for this repository parallel instantiation actually got worse than the current performance on main. My guess at the time was that it had to do with the very large size of the madvise since madvise got more expensive as the region got larger.

As for kernel improvements that's never something we've looked into beyond madvisev

ihciah · 2022-08-12T03:42:16Z

I finished my experiment and result shows the same conclusion.
What I did: clear stack pool in batch(https://github.com/ihciah/wasmtime/tree/main)
Result: running return 1 demo(on a KVM vm)

Original 1 core: 117 kQPS
Original 3 cores: 66 kQPS * 3
Exp 1 core: 133 kQPS
Exp 3 cores: 52kQPS * 3

Under 1 core, the modified version has slightly better performance(the madvise batch is 1000 times original); but under 3 cores it becomes worse.

My guess at the time was that it had to do with the very large size of the madvise since madvise got more expensive as the region got larger.

Also, I changed stack pool size 1000 to bigger values, size=2048 => 20ms and size=4096 => 45ms, which matches your guess.
I haven't looked closely at the madvisev impl yet. Comparing to calling madvise, what does it saves? It still need the same IPI in theory.

alexcrichton · 2022-08-15T14:26:43Z

Always good to have independent confirmation, thanks for doing that!

As for why madvisev improves the situation, I believe this happens because each call to madvise issues an IPI. With madvisev only one IPI is necessary still. This means that instead of N calls to madvisev and N IPIs there's only one call to madvisev with only one IPI. From the kernel's perspective all the calling process needs is to know that by the time the syscall returns that the address space is as-expected in all threads, but during execution of the syscall there's a lot more flexibility about precisely what all threads see.

ihciah · 2022-08-19T07:40:11Z

Thanks for the explanation. But I guess IPI cost(or other bottleneck) is not linear with syscall times but the memory ranges since doing madvise with twice bigger memory a little bit cost more than twice longer time.

It seems there's no easy way to save the madvise cost and avoid its bottleneck for general usage...
Since users may control the instance isolation by their own, for example they can make sure only the same tenant and the same wasm binary share instances. Under these circumstances doing madvise maybe unneccessary.
Is it acceptable to add an option to disable madvise(madvise can be still enabled by default, and the option can only be touched when some features like "dangerous" enabled)?

bjorn3 · 2022-08-19T07:52:38Z

Since users may control the instance isolation by their own, for example they can make sure only the same tenant and the same wasm binary share instances. Under these circumstances doing madvise maybe unneccessary.

In that case you could reuse the Instance instead of recreate it every time, right? Not calling madvise is effectively equivalent to reusing the Instance while resetting all tables and wasm globals (actual global variables are in the linear memory and as such not reset), but not the linear memory, in which case you might as well reset nothing. That prevents everything getting out of sync and most observable state isn't reset anyway if you don't do madvise.

ihciah · 2022-08-19T08:12:09Z

Not calling madvise is effectively equivalent to reusing the Instance while resetting all tables and wasm globals (actual global variables are in the linear memory and as such not reset), but not the linear memory, in which case you might as well reset nothing.

I see. There are 3 callers for decommit, the decommit_memory_pages and decommit_table_pages are with the instance lifecycle. So we may not affect them.
The hot path is the third caller decommit_stack_pages. Every call for func.call_async will do it. So we may only disable decommit_stack_pages optionally.

alexcrichton · 2022-08-19T19:49:50Z

Using madvise to clear linear memory and tables is required for correctness and we can't add a knob to turn that off, but I think it would be reasonable to add a knob for the stacks to avoid madvise since that's purely a defense-in-depth thing and is not required for correctness.

alexcrichton · 2022-12-02T00:04:50Z

The *_keep_resident options added recently should additionally help by reducing page faults and therefore contention on kernel-level locks. There's also work towards optimizing bounds-checking-mode in Cranelift to avoid pagefaults/mprotect even more to reduce contention even further.

Otherwise though this is a known issue and isn't something where there's any easy wins on the table we know of that haven't been applied yet. PRs and more discussion around this are of course always welcome but I'm going to go ahead and close this since there's not a whole lot more that can be done at this time.

Duslia mentioned this issue Aug 30, 2022

feat: add a knob for reset stack #4813

Merged

alexcrichton mentioned this issue Nov 10, 2022

Is there a performance bottleneck in a multithread environment? #5243

Closed

alexcrichton closed this as completed Dec 2, 2022

jcaesar mentioned this issue Dec 2, 2022

Performance bottleneck when creating Instances / growing memory in parallel wasmerio/wasmer#2563

Closed

hydai mentioned this issue Dec 29, 2022

feat: evaluate the horizontal scaling performance improvement approach WasmEdge/WasmEdge#2199

Closed

fitzgen mentioned this issue Mar 23, 2023

Further Improve Execution Speed with Explicit Bounds Checks #6094

Open

thomastaylor312 mentioned this issue Feb 29, 2024

Wasmtime throughput low on linux systems #8034

Closed

fitzgen mentioned this issue May 8, 2024

Handle all decommits/madvise(DONTNEED)s homogenously #8587

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wasmtime horizontal scaling results in poor performance #4637

Wasmtime horizontal scaling results in poor performance #4637

Duslia commented Aug 8, 2022 •

edited

Loading

alexcrichton commented Aug 8, 2022

alexcrichton commented Aug 8, 2022

Duslia commented Aug 9, 2022

Duslia commented Aug 9, 2022 •

edited

Loading

bjorn3 commented Aug 9, 2022

alexcrichton commented Aug 9, 2022

alexcrichton commented Aug 9, 2022

Duslia commented Aug 9, 2022

ihciah commented Aug 10, 2022

bjorn3 commented Aug 10, 2022 •

edited

Loading

ihciah commented Aug 10, 2022

bjorn3 commented Aug 10, 2022

ihciah commented Aug 10, 2022

alexcrichton commented Aug 10, 2022

ihciah commented Aug 10, 2022

pchickey commented Aug 10, 2022

ihciah commented Aug 10, 2022

alexcrichton commented Aug 10, 2022

ihciah commented Aug 12, 2022

alexcrichton commented Aug 15, 2022

ihciah commented Aug 19, 2022

bjorn3 commented Aug 19, 2022

ihciah commented Aug 19, 2022

alexcrichton commented Aug 19, 2022

alexcrichton commented Dec 2, 2022

Wasmtime horizontal scaling results in poor performance #4637

Wasmtime horizontal scaling results in poor performance #4637

Comments

Duslia commented Aug 8, 2022 • edited Loading

alexcrichton commented Aug 8, 2022

alexcrichton commented Aug 8, 2022

Duslia commented Aug 9, 2022

Duslia commented Aug 9, 2022 • edited Loading

bjorn3 commented Aug 9, 2022

alexcrichton commented Aug 9, 2022

alexcrichton commented Aug 9, 2022

Duslia commented Aug 9, 2022

ihciah commented Aug 10, 2022

bjorn3 commented Aug 10, 2022 • edited Loading

ihciah commented Aug 10, 2022

bjorn3 commented Aug 10, 2022

ihciah commented Aug 10, 2022

alexcrichton commented Aug 10, 2022

ihciah commented Aug 10, 2022

pchickey commented Aug 10, 2022

ihciah commented Aug 10, 2022

alexcrichton commented Aug 10, 2022

ihciah commented Aug 12, 2022

alexcrichton commented Aug 15, 2022

ihciah commented Aug 19, 2022

bjorn3 commented Aug 19, 2022

ihciah commented Aug 19, 2022

alexcrichton commented Aug 19, 2022

alexcrichton commented Dec 2, 2022

Duslia commented Aug 8, 2022 •

edited

Loading

Duslia commented Aug 9, 2022 •

edited

Loading

bjorn3 commented Aug 10, 2022 •

edited

Loading