You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'd like to invoke user-defined functionality in a stream data processing setting. To ensure that there are no odd influences between subsequent (but usually unrelated) stream events, I'd like to create a new Instance for every event I process.
Also, I'd like to process loads of messages, possibly millions per second (in a cluster).
So I need lots of instances per second.
Problem
Wasmer instance creation doesn't benefit from parallelization. 1 thread gives 111 k instances per second, and 2 threads 80 k/s.
More threads make the situation even worse.
(The numbers are from a Ryzen 7 3700X 8-Core, Linux 5.13.13-arch1-1. My production machines will be larger…)
After a bit of benchmarking and experimenting, I arrived at the conclusion that the instance memory must be at fault.
If an instance has accessible memory, it will make a call to mmap/mprotect when being created (and the corresponding munmap call on drop) (look for the __GI_m* calls in the flamegraph, right below the stacks of [unknown]). The following experiment proves that they're at fault.
Proposed solution?
I have experimented with avoiding the mmap calls by reusing wasmer_vm::Mmaps. This can be done without modifying wasmer (but with some code duplication) by
which returns Mmaps to a thread-local pool instead of dropping them.
This has the desired effect of letting instance creation scale near-linearly to the number of cores. (e.g. 1 thread: 115 k/s, 2 threads: 233 k/s).
The problem is that to avoid once instance being able to see memory left by another, Mmaps need to be zeroed out up to the accessible size. Which is only cheaper than mmaping up to 10 pages (for the single-threaded case).
So, I'm looking for alternatives.
Alternatives
At a single thread, even with one mprotect/mmap/munmap for each instance memory allocation, I can get about 100k Instances per second. It's not awesome, but probably enough for most of my use-cases. I can just have a single thread create all the instances and shove them through an mpmc. (If in doubt, I can use more smaller machines. Cloud and all.)
I'm also considering the possibility of munmapping chunks of memory at the beginning of a Mmap that have already been used by an instance. It would free the memory and save zeroing or mmapping, but I'm not sure whether it's much cheaper, since munmap still messes with the TLB(?).
The text was updated successfully, but these errors were encountered:
jcaesar
changed the title
Performance bottleneck when creating Instances / growing memory
Performance bottleneck when creating Instances / growing memory in parallel
Sep 10, 2021
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Motivation
I'd like to invoke user-defined functionality in a stream data processing setting. To ensure that there are no odd influences between subsequent (but usually unrelated) stream events, I'd like to create a new Instance for every event I process.
Also, I'd like to process loads of messages, possibly millions per second (in a cluster).
So I need lots of instances per second.
Problem
Wasmer instance creation doesn't benefit from parallelization. 1 thread gives 111 k instances per second, and 2 threads 80 k/s.
More threads make the situation even worse.
(The numbers are from a Ryzen 7 3700X 8-Core, Linux 5.13.13-arch1-1. My production machines will be larger…)
After a bit of benchmarking and experimenting, I arrived at the conclusion that the instance memory must be at fault.
If an instance has accessible memory, it will make a call to mmap/mprotect when being created (and the corresponding munmap call on drop) (look for the __GI_m* calls in the flamegraph, right below the stacks of [unknown]). The following experiment proves that they're at fault.
Proposed solution?
I have experimented with avoiding the mmap calls by reusing
wasmer_vm::Mmaps
. This can be done without modifying wasmer (but with some code duplication) byThis has the desired effect of letting instance creation scale near-linearly to the number of cores. (e.g. 1 thread: 115 k/s, 2 threads: 233 k/s).
The problem is that to avoid once instance being able to see memory left by another, Mmaps need to be zeroed out up to the accessible size. Which is only cheaper than mmaping up to 10 pages (for the single-threaded case).
So, I'm looking for alternatives.
Alternatives
At a single thread, even with one mprotect/mmap/munmap for each instance memory allocation, I can get about 100k Instances per second. It's not awesome, but probably enough for most of my use-cases. I can just have a single thread create all the instances and shove them through an mpmc. (If in doubt, I can use more smaller machines. Cloud and all.)
I'm also considering the possibility of munmapping chunks of memory at the beginning of a Mmap that have already been used by an instance. It would free the memory and save zeroing or mmapping, but I'm not sure whether it's much cheaper, since munmap still messes with the TLB(?).
The text was updated successfully, but these errors were encountered: