Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wasmtime throughput low on linux systems #8034

Closed
thomastaylor312 opened this issue Feb 29, 2024 · 13 comments
Closed

Wasmtime throughput low on linux systems #8034

thomastaylor312 opened this issue Feb 29, 2024 · 13 comments
Labels
bug Incorrect behavior in the current implementation that needs fixing

Comments

@thomastaylor312
Copy link

Test Case

This is the wasm file, zipped up in order to upload to GH. It is from https://github.com/sunfishcode/hello-wasi-http.git
hello_wasi_http.wasm.zip

Steps to Reproduce

Try these steps on a linux machine and on a macos machine (preferably close to the same size):

  1. Run the component with wasmtime serve (no additional flags)
  2. Run hey -z 10s -c 100 http://localhost:8080/

Expected Results

I expect the number of requests/second to be the same or greater on linux than they are on Mac

Actual Results

On my Mac (details on OS below) that was running a bunch of other applications, I get around 20k req/s
On linux (details on OS below), I get around 4.3k req/s

Versions and Environment

Wasmtime version or commit: 18.0.2

Mac
Operating system: Sonoma 14.3.1

Architecture: M1 Max (2 performance cores, 8 normal cores) and 64 GB of memory

Linux
Operating system: Debian Bookworm (6.1.0-18-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux)

Architecture: AMD64 (16 cores) and 64 GB of memory

This was run on a cloud VM but I also tested this on a ubuntu 20.04 amd64 server running at my house with similar performance

Extra Info

On the linux server, I did double check my file descriptor limit had been raised and also observed that the wasmtime processes were all getting to an uninterruptible sleep state almost constantly through the whole test (which could mean nothing). Also, I did a similar test with wasmCloud and Spin, which both use wasmtime and was getting a similar drop in numbers between mac and linux. For reference, I also did some smoke tests with normal server traffic (I did a test with Caddy and with NATS) and all of them were getting easily into the 100k+ range. So this definitely seems like something on the wasmtime side.

I did see #4637 and that does explain some of the horizontal scaling issues, but I didn't expect such a drastic difference between Mac and Linux

@thomastaylor312 thomastaylor312 added the bug Incorrect behavior in the current implementation that needs fixing label Feb 29, 2024
@fitzgen
Copy link
Member

fitzgen commented Feb 29, 2024

First off: are you enabling the pooling allocator? Eg -O pooling-allocator in the CLI. Enabling or disabling the pooling allocator is going to greatly affect requests/second.

So will disabling virtual memory-based bounds checks and replacing them with explicit bounds checks (-O static-memory-maximum-size=0) which should increase requests/second for short-lived Wasm but will slow down Wasm execution by ~1.5x.

I expect the number of requests/second to be the same or greater on linux than they are on Mac

I don't think we can make hard guarantees about this unless you disable virtual memory-based bounds checks completely, because the performance bottleneck for concurrent Wasm guests is the kernel's virtual memory subsystem. Even with the pooling allocator, we are bottlenecked on madvise kinds of things and their associated IPIs. Without the pooling allocator, you're essentially benchmarking concurrent mmap.

@fitzgen
Copy link
Member

fitzgen commented Feb 29, 2024

For example, here are the results I get on my ~6 year old think pad (4-core / 8 hyperthreads) running linux:

No pooling allocator:                        5983.7146 requests/second
Pooling allocator:                          34980.6398 requests/second
Pooling allocator + explicit bounds checks: 35368.5013 requests/second

(I'd expect the delta between the second and third configuration to be even greater on machines with more cores)

@fitzgen
Copy link
Member

fitzgen commented Feb 29, 2024

Ah, you also have to pass -O memory-init-cow=n to get rid of all the virtual memory interaction here. Once I do that I get the following results:

No pooling allocator:                                        5983.7146 requests/second
Pooling allocator:                                          34980.6398 requests/second
Pooling allocator + explicit bounds checks:                 35368.5013 requests/second
Pooling allocator + explicit bounds checks + no memory CoW: 45451.2630 requests/second

@cfallin
Copy link
Member

cfallin commented Feb 29, 2024

@thomastaylor312 could you tell us more about your hardware on the Linux side? The reason I ask is that "cloud VM" is pretty vague -- it could be an older microarchitecture, or with oversubscribed cores, or something else. Without more specific details, I'm not sure why it's a given that RPS should be higher on Linux on hardware A vs. macOS on hardware B.

Also a possible experiment: are you able to run a Linux VM on your M1 Max hardware (even better, native Linux such as Asahi, but a VM in e.g. UTM is fine too), and test Wasmtime there? That would tell us a lot about how the raw CPU power actually compares.

@alexcrichton
Copy link
Member

I apologize if this is a bit piling on at this point, but I wanted to comment the same as @fitzgen, this is probably the -O pooling-allocator vs not. Locally the difference I see is:

  • wasmtime serve - 5.5k rps
  • wasmtime serve -O pooling-allocator - 236k rps

This is perhaps an argument that we should turn on the pooling allocator by default for the wasmtime serve command!

Also, as mentioned in #4637, there's various knobs to help with the overhead of virtual memory here. They're not always applicable in all cases, for example wasmtime serve -O pooling-allocator,memory-init-cow=n,static-memory-maximum-size=0,pooling-memory-keep-resident=$((10<<20)),pooling-table-keep-resident=$((10<<20)) yields 193k rps for me locally. There are zero interactions with virtual memory in the steady state, but wasmtime spends 70% of its time in memset resetting memory between http requests.

@alexcrichton
Copy link
Member

(also thank you for such a detailed report!)

@thomastaylor312
Copy link
Author

thomastaylor312 commented Mar 1, 2024

@cfallin For sure! This is a GCP machine running on an AMD Rome series processor. The exact instance size is a n2d-standard-16. Also, to call out again, I did try this on a local linux server running a slightly older intel processor with similar results. Working on trying out some of the options suggested and will report back soon

@thomastaylor312
Copy link
Author

Here are my numbers:

No pooling allocator:                                        4714.3026 requests/second
Pooling allocator:                                          42696.3823 requests/second
Pooling allocator + explicit bounds checks:                 42603.7692 requests/second
Pooling allocator + explicit bounds checks + no memory CoW: 55162.2862 requests/second

So that seems to line up with what you were seeing @fitzgen. What I am a little unsure about are the tradeoffs involved here. I think I wouldn't want static-memory-maximum-size=0 since I don't want to slow down execution of longer lived components. I did already try out the pooling allocator in wasmCloud and saw the benefits, but all of the options were a little confusing as to what the memory footprint will be. I wasn't sure how to set all of those values to use the right amount of memory on any given machine. I was starting to make some guesses but wasn't entirely certain.

Also, are there any tradeoffs around using memory-init-cow=n with the pooling allocator?

@cfallin
Copy link
Member

cfallin commented Mar 1, 2024

There are two performance "figures of merit": the instantiation speed and the runtime speed (how fast the Wasm executes once instantiation completes). The first two lines (no pooling and pooling) keep exactly the same generated code, so there's no runtime slowdown; the pooling allocator speedup is pure win on instantiation speed (due to the virtual-memory tricks). memory-init-cow again has to do with instantiation speed and doesn't alter runtime. One other configuration you haven't run yet, that might be interesting to try, is pooling + no memory CoW (but without explicit bounds checks): that should have fully optimal generated code and no runtime slowdown.

(Reality-is-complicated footnote: there may be some effects in the margins with page fault latency that do actually affect runtime depending on the virtual memory strategy; but those effects should be much smaller than the explicit vs. static bounds checks.)

@thomastaylor312
Copy link
Author

thomastaylor312 commented Mar 1, 2024

Yep I actually just tried the pooling + no memory CoW and that looks good. My remaining concern is around how all the different pooling allocator knobs can affect runtime/memory usage.

@cfallin
Copy link
Member

cfallin commented Mar 1, 2024

runtime/memory usage

@thomastaylor312 the best answer is usually "try it and see", since tradeoffs can cross different inflection points depending on your particular workload.

As mentioned above, runtime (that is, execution speed of the Wasm) is unaffected by the pooling allocator. Explicit bounds checks are the only option that modify the generated code.

Memory usage (resident set size) might increase slightly if the "no CoW" option is set, because the pooling allocator keeps private pages around for each slot. (I don't remember if it retains a "warmed up" slot for a given image to reuse it though, @alexcrichton do you remember?) CoW is more memory-efficient because it can leverage shared read-only mappings of the heap image that aren't modified by the guest.

@thomastaylor312
Copy link
Author

Ok, I think between that and the docs that are already on the pooling allocator, I think I should be good enough. I'll go ahead and close this, but hopefully this issue can be helpful to others who might be optimizing things. Thanks for the help all!

@alexcrichton
Copy link
Member

I was inspired to summarize some of the bits and bobs here in #8038 to hopefully help future readers as well.

that might be interesting to try, is pooling + no memory CoW

...

Yep I actually just tried the pooling + no memory CoW and that looks good

This surprises me! I would not expect disabling CoW to provide much benefit when explicit bounds checks are still enabled. If anything I'd expect it to get a bit slower (like what I measured above).

I say this because even if you disable copy-on-write we still use madvise to clear memory (regardless of whether bounds checks are enabled or not) which involves IPIs which don't scale well. This might be a case of you running into something Chris has pointed out in the past where when using CoW if you first read a page that will fault in a read-only mapping, but then if you write to the same page the copy happens in addition to an IPI to clear the old mapping. This means that CoW, while beneficial for large heap images due to removing startup cost, may be less beneficial over time if pages are read-then-written to cause even more IPIs (which aren't scalable) to happen over time.

Memory usage (resident set size) might increase slightly if the "no CoW" option is set, because the pooling allocator keeps private pages around for each slot.

I don't think this will actually be the case because when a slot is deallocated we madvise-reset the entire linear memory which should release all memory back to the kernel, so with-and-without CoW should have the same rss for deallocated slots.

Now for allocated slots, as you've pointed out, having 1000 instances with CoW should have less rss than 1000 instances without CoW because readonly pages will be shared in the CoW case.

github-merge-queue bot pushed a commit that referenced this issue Mar 1, 2024
…8038)

* Add commentary on advantages/disadvantages of the pooling allocator

Inspired by discussion on #8034

* Add `memory_init_cow` commentary on disadvantages

* Fix doc link
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior in the current implementation that needs fixing
Projects
None yet
Development

No branches or pull requests

4 participants