-
Notifications
You must be signed in to change notification settings - Fork 798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance bottleneck when creating Instances / growing memory in parallel #2563
Labels
Milestone
Comments
jcaesar
changed the title
Performance bottleneck when creating Instances / growing memory
Performance bottleneck when creating Instances / growing memory in parallel
Sep 10, 2021
For context: wasmtime seems to have given up on this problem: bytecodealliance/wasmtime#4637 (comment) |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Feel free to reopen the issue if it has been closed by mistake. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Motivation
I'd like to invoke user-defined functionality in a stream data processing setting. To ensure that there are no odd influences between subsequent (but usually unrelated) stream events, I'd like to create a new Instance for every event I process.
Also, I'd like to process loads of messages, possibly millions per second (in a cluster).
So I need lots of instances per second.
Problem
Wasmer instance creation doesn't benefit from parallelization. 1 thread gives 111 k instances per second, and 2 threads 80 k/s.
More threads make the situation even worse.
(The numbers are from a Ryzen 7 3700X 8-Core, Linux 5.13.13-arch1-1. My production machines will be larger…)
After a bit of benchmarking and experimenting, I arrived at the conclusion that the instance memory must be at fault.
If an instance has accessible memory, it will make a call to mmap/mprotect when being created (and the corresponding munmap call on drop) (look for the __GI_m* calls in the flamegraph, right below the stacks of [unknown]). The following experiment proves that they're at fault.
Proposed solution?
I have experimented with avoiding the mmap calls by reusing
wasmer_vm::Mmaps
. This can be done without modifying wasmer (but with some code duplication) byThis has the desired effect of letting instance creation scale near-linearly to the number of cores. (e.g. 1 thread: 115 k/s, 2 threads: 233 k/s).
The problem is that to avoid once instance being able to see memory left by another, Mmaps need to be zeroed out up to the accessible size. Which is only cheaper than mmaping up to 10 pages (for the single-threaded case).
So, I'm looking for alternatives.
Alternatives
At a single thread, even with one mprotect/mmap/munmap for each instance memory allocation, I can get about 100k Instances per second. It's not awesome, but probably enough for most of my use-cases. I can just have a single thread create all the instances and shove them through an mpmc. (If in doubt, I can use more smaller machines. Cloud and all.)
I'm also considering the possibility of munmapping chunks of memory at the beginning of a Mmap that have already been used by an instance. It would free the memory and save zeroing or mmapping, but I'm not sure whether it's much cheaper, since munmap still messes with the TLB(?).
The text was updated successfully, but these errors were encountered: