-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wasmtime throughput low on linux systems #8034
Comments
First off: are you enabling the pooling allocator? Eg So will disabling virtual memory-based bounds checks and replacing them with explicit bounds checks (
I don't think we can make hard guarantees about this unless you disable virtual memory-based bounds checks completely, because the performance bottleneck for concurrent Wasm guests is the kernel's virtual memory subsystem. Even with the pooling allocator, we are bottlenecked on |
For example, here are the results I get on my ~6 year old think pad (4-core / 8 hyperthreads) running linux:
(I'd expect the delta between the second and third configuration to be even greater on machines with more cores) |
Ah, you also have to pass
|
@thomastaylor312 could you tell us more about your hardware on the Linux side? The reason I ask is that "cloud VM" is pretty vague -- it could be an older microarchitecture, or with oversubscribed cores, or something else. Without more specific details, I'm not sure why it's a given that RPS should be higher on Linux on hardware A vs. macOS on hardware B. Also a possible experiment: are you able to run a Linux VM on your M1 Max hardware (even better, native Linux such as Asahi, but a VM in e.g. UTM is fine too), and test Wasmtime there? That would tell us a lot about how the raw CPU power actually compares. |
I apologize if this is a bit piling on at this point, but I wanted to comment the same as @fitzgen, this is probably the
This is perhaps an argument that we should turn on the pooling allocator by default for the Also, as mentioned in #4637, there's various knobs to help with the overhead of virtual memory here. They're not always applicable in all cases, for example |
(also thank you for such a detailed report!) |
@cfallin For sure! This is a GCP machine running on an AMD Rome series processor. The exact instance size is a n2d-standard-16. Also, to call out again, I did try this on a local linux server running a slightly older intel processor with similar results. Working on trying out some of the options suggested and will report back soon |
Here are my numbers:
So that seems to line up with what you were seeing @fitzgen. What I am a little unsure about are the tradeoffs involved here. I think I wouldn't want Also, are there any tradeoffs around using |
There are two performance "figures of merit": the instantiation speed and the runtime speed (how fast the Wasm executes once instantiation completes). The first two lines (no pooling and pooling) keep exactly the same generated code, so there's no runtime slowdown; the pooling allocator speedup is pure win on instantiation speed (due to the virtual-memory tricks). (Reality-is-complicated footnote: there may be some effects in the margins with page fault latency that do actually affect runtime depending on the virtual memory strategy; but those effects should be much smaller than the explicit vs. static bounds checks.) |
Yep I actually just tried the pooling + no memory CoW and that looks good. My remaining concern is around how all the different pooling allocator knobs can affect runtime/memory usage. |
@thomastaylor312 the best answer is usually "try it and see", since tradeoffs can cross different inflection points depending on your particular workload. As mentioned above, runtime (that is, execution speed of the Wasm) is unaffected by the pooling allocator. Explicit bounds checks are the only option that modify the generated code. Memory usage (resident set size) might increase slightly if the "no CoW" option is set, because the pooling allocator keeps private pages around for each slot. (I don't remember if it retains a "warmed up" slot for a given image to reuse it though, @alexcrichton do you remember?) CoW is more memory-efficient because it can leverage shared read-only mappings of the heap image that aren't modified by the guest. |
Ok, I think between that and the docs that are already on the pooling allocator, I think I should be good enough. I'll go ahead and close this, but hopefully this issue can be helpful to others who might be optimizing things. Thanks for the help all! |
Inspired by discussion on bytecodealliance#8034
I was inspired to summarize some of the bits and bobs here in #8038 to hopefully help future readers as well.
...
This surprises me! I would not expect disabling CoW to provide much benefit when explicit bounds checks are still enabled. If anything I'd expect it to get a bit slower (like what I measured above). I say this because even if you disable copy-on-write we still use
I don't think this will actually be the case because when a slot is deallocated we Now for allocated slots, as you've pointed out, having 1000 instances with CoW should have less rss than 1000 instances without CoW because readonly pages will be shared in the CoW case. |
Test Case
This is the wasm file, zipped up in order to upload to GH. It is from https://github.com/sunfishcode/hello-wasi-http.git
hello_wasi_http.wasm.zip
Steps to Reproduce
Try these steps on a linux machine and on a macos machine (preferably close to the same size):
wasmtime serve
(no additional flags)hey -z 10s -c 100 http://localhost:8080/
Expected Results
I expect the number of requests/second to be the same or greater on linux than they are on Mac
Actual Results
On my Mac (details on OS below) that was running a bunch of other applications, I get around 20k req/s
On linux (details on OS below), I get around 4.3k req/s
Versions and Environment
Wasmtime version or commit: 18.0.2
Mac
Operating system: Sonoma 14.3.1
Architecture: M1 Max (2 performance cores, 8 normal cores) and 64 GB of memory
Linux
Operating system: Debian Bookworm (6.1.0-18-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux)
Architecture: AMD64 (16 cores) and 64 GB of memory
This was run on a cloud VM but I also tested this on a ubuntu 20.04 amd64 server running at my house with similar performance
Extra Info
On the linux server, I did double check my file descriptor limit had been raised and also observed that the wasmtime processes were all getting to an uninterruptible sleep state almost constantly through the whole test (which could mean nothing). Also, I did a similar test with wasmCloud and Spin, which both use wasmtime and was getting a similar drop in numbers between mac and linux. For reference, I also did some smoke tests with normal server traffic (I did a test with Caddy and with NATS) and all of them were getting easily into the 100k+ range. So this definitely seems like something on the wasmtime side.
I did see #4637 and that does explain some of the horizontal scaling issues, but I didn't expect such a drastic difference between Mac and Linux
The text was updated successfully, but these errors were encountered: