-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wasmtime horizontal scaling results in poor performance #4637
Comments
Thanks for the report! Could you share the Wasmtime version you're using as well as the specific |
That being said what you're almost certainly running into is limits on concurrently running One improvement that you can do to your benchmark is to use the pooling allocator instead of the on-demand allocator. Otherwise though improvements need to come from the kernel itself (assuming you're running Linux, which would also be great to specify in the issue description). Historically the kernel has prototyped a Finally it depends on your use case of whether this comes up much in practice. Typically the actual work performed by a wasm module dwarfs the cost of the memory mappings since the wasm modules live for a longer time than "return 1". If that's the case then this microbenchmark isn't necessarily indicative of performance and it's recommended to use a more realistic workload for your use case. If this sort of short-running computation is common for your embedding, however, then multithreaded execution will currently have the impact that you're measuring here. |
Thanks!! I got it. |
Yes, without that call the old data would be preserved when reusing a memory region as linear memory for the wasm module. |
Ah ok that's a good confirmation that In the meantime though this is something that basically just needs to be taken into account when capacity planning with Wasmtime. If you're willing assistance in getting a patch such as that into the Linux kernel (or investigating other routes of handling this bottleneck) would be greatly appreciated! |
Oh sorry, and to confirm the |
OK. I will think about how to solve it. Thanks a lot!! |
I'm wondering if the following 2 solutions may be accepted?
|
As I understand it the IPI is the bottlwneck, not the syscall overhead. Every time madvise is run all cores running the wasmtime process have to be temporarily interupted and suspended while the page table is being modified. This scales really badly. io_uring can't do anything against this. |
Thanks for the explain! I also checked the kernel code and found in func |
I might be misunderstanding, but isn't that what the pooling allocator does? |
The pool inside the wasmtime allocates a big memory one time in a single call, and manages it by maintaining |
While we haven't directly experimented with this I think as you've discovered we took a look and concluded it probably wouldn't help.
This we have actually experimented with historically. Our conclusion was that it did not actually help all that much. We found that the cost of an Our testing with |
Thank you for sharing past experiments. But it still confuse me that advising multiple regions with |
You can't merge the regions to madvise - they're not contiguous. |
I mean maybe we can make the indices as contiguous as possible and then we can do madvise on the biggest contiguous region when alloc. I don't know if this will work. |
IIRC we were just as confused as you are. I remember though reorganizing the pooling allocator to put the table/memory/instance allocations all next to each other so each instance's index identified one large contiguous region of memory. In the benchmarks for this repository parallel instantiation actually got worse than the current performance on As for kernel improvements that's never something we've looked into beyond |
I finished my experiment and result shows the same conclusion.
Under 1 core, the modified version has slightly better performance(the madvise batch is 1000 times original); but under 3 cores it becomes worse.
Also, I changed stack pool size 1000 to bigger values, |
Always good to have independent confirmation, thanks for doing that! As for why |
Thanks for the explanation. But I guess IPI cost(or other bottleneck) is not linear with syscall times but the memory ranges since doing madvise with twice bigger memory a little bit cost more than twice longer time. It seems there's no easy way to save the madvise cost and avoid its bottleneck for general usage... |
In that case you could reuse the |
I see. There are 3 callers for |
Using |
The Otherwise though this is a known issue and isn't something where there's any easy wins on the table we know of that haven't been applied yet. PRs and more discussion around this are of course always welcome but I'm going to go ahead and close this since there's not a whole lot more that can be done at this time. |
When I try to use tokio to scale wasmtime horizontally, I found that wasmtime performance drops significantly. It looks like there are some shared resources inside. And no matter how many threads I open, the total number of execute num is almost the same. This is my performance data:
This is my test code: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=e9fc13a7d68b80afdc07076482a5b787 .
And the wasm code is just a function return 1.
The text was updated successfully, but these errors were encountered: