Wasmtime throughput low on linux systems #8034

thomastaylor312 · 2024-02-29T22:30:23Z

Test Case

This is the wasm file, zipped up in order to upload to GH. It is from https://github.com/sunfishcode/hello-wasi-http.git
hello_wasi_http.wasm.zip

Steps to Reproduce

Try these steps on a linux machine and on a macos machine (preferably close to the same size):

Run the component with wasmtime serve (no additional flags)
Run hey -z 10s -c 100 http://localhost:8080/

Expected Results

I expect the number of requests/second to be the same or greater on linux than they are on Mac

Actual Results

On my Mac (details on OS below) that was running a bunch of other applications, I get around 20k req/s
On linux (details on OS below), I get around 4.3k req/s

Versions and Environment

Wasmtime version or commit: 18.0.2

Mac
Operating system: Sonoma 14.3.1

Architecture: M1 Max (2 performance cores, 8 normal cores) and 64 GB of memory

Linux
Operating system: Debian Bookworm (6.1.0-18-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux)

Architecture: AMD64 (16 cores) and 64 GB of memory

This was run on a cloud VM but I also tested this on a ubuntu 20.04 amd64 server running at my house with similar performance

Extra Info

On the linux server, I did double check my file descriptor limit had been raised and also observed that the wasmtime processes were all getting to an uninterruptible sleep state almost constantly through the whole test (which could mean nothing). Also, I did a similar test with wasmCloud and Spin, which both use wasmtime and was getting a similar drop in numbers between mac and linux. For reference, I also did some smoke tests with normal server traffic (I did a test with Caddy and with NATS) and all of them were getting easily into the 100k+ range. So this definitely seems like something on the wasmtime side.

I did see #4637 and that does explain some of the horizontal scaling issues, but I didn't expect such a drastic difference between Mac and Linux

The text was updated successfully, but these errors were encountered:

fitzgen · 2024-02-29T22:40:08Z

First off: are you enabling the pooling allocator? Eg -O pooling-allocator in the CLI. Enabling or disabling the pooling allocator is going to greatly affect requests/second.

So will disabling virtual memory-based bounds checks and replacing them with explicit bounds checks (-O static-memory-maximum-size=0) which should increase requests/second for short-lived Wasm but will slow down Wasm execution by ~1.5x.

I expect the number of requests/second to be the same or greater on linux than they are on Mac

I don't think we can make hard guarantees about this unless you disable virtual memory-based bounds checks completely, because the performance bottleneck for concurrent Wasm guests is the kernel's virtual memory subsystem. Even with the pooling allocator, we are bottlenecked on madvise kinds of things and their associated IPIs. Without the pooling allocator, you're essentially benchmarking concurrent mmap.

fitzgen · 2024-02-29T22:49:07Z

For example, here are the results I get on my ~6 year old think pad (4-core / 8 hyperthreads) running linux:

No pooling allocator:                        5983.7146 requests/second
Pooling allocator:                          34980.6398 requests/second
Pooling allocator + explicit bounds checks: 35368.5013 requests/second

(I'd expect the delta between the second and third configuration to be even greater on machines with more cores)

fitzgen · 2024-02-29T22:51:29Z

Ah, you also have to pass -O memory-init-cow=n to get rid of all the virtual memory interaction here. Once I do that I get the following results:

No pooling allocator:                                        5983.7146 requests/second
Pooling allocator:                                          34980.6398 requests/second
Pooling allocator + explicit bounds checks:                 35368.5013 requests/second
Pooling allocator + explicit bounds checks + no memory CoW: 45451.2630 requests/second

cfallin · 2024-02-29T22:52:18Z

@thomastaylor312 could you tell us more about your hardware on the Linux side? The reason I ask is that "cloud VM" is pretty vague -- it could be an older microarchitecture, or with oversubscribed cores, or something else. Without more specific details, I'm not sure why it's a given that RPS should be higher on Linux on hardware A vs. macOS on hardware B.

Also a possible experiment: are you able to run a Linux VM on your M1 Max hardware (even better, native Linux such as Asahi, but a VM in e.g. UTM is fine too), and test Wasmtime there? That would tell us a lot about how the raw CPU power actually compares.

alexcrichton · 2024-02-29T22:58:16Z

I apologize if this is a bit piling on at this point, but I wanted to comment the same as @fitzgen, this is probably the -O pooling-allocator vs not. Locally the difference I see is:

wasmtime serve - 5.5k rps
wasmtime serve -O pooling-allocator - 236k rps

This is perhaps an argument that we should turn on the pooling allocator by default for the wasmtime serve command!

Also, as mentioned in #4637, there's various knobs to help with the overhead of virtual memory here. They're not always applicable in all cases, for example wasmtime serve -O pooling-allocator,memory-init-cow=n,static-memory-maximum-size=0,pooling-memory-keep-resident=$((10<<20)),pooling-table-keep-resident=$((10<<20)) yields 193k rps for me locally. There are zero interactions with virtual memory in the steady state, but wasmtime spends 70% of its time in memset resetting memory between http requests.

alexcrichton · 2024-02-29T23:01:29Z

(also thank you for such a detailed report!)

thomastaylor312 · 2024-03-01T17:10:01Z

@cfallin For sure! This is a GCP machine running on an AMD Rome series processor. The exact instance size is a n2d-standard-16. Also, to call out again, I did try this on a local linux server running a slightly older intel processor with similar results. Working on trying out some of the options suggested and will report back soon

thomastaylor312 · 2024-03-01T17:25:34Z

Here are my numbers:

No pooling allocator:                                        4714.3026 requests/second
Pooling allocator:                                          42696.3823 requests/second
Pooling allocator + explicit bounds checks:                 42603.7692 requests/second
Pooling allocator + explicit bounds checks + no memory CoW: 55162.2862 requests/second

So that seems to line up with what you were seeing @fitzgen. What I am a little unsure about are the tradeoffs involved here. I think I wouldn't want static-memory-maximum-size=0 since I don't want to slow down execution of longer lived components. I did already try out the pooling allocator in wasmCloud and saw the benefits, but all of the options were a little confusing as to what the memory footprint will be. I wasn't sure how to set all of those values to use the right amount of memory on any given machine. I was starting to make some guesses but wasn't entirely certain.

Also, are there any tradeoffs around using memory-init-cow=n with the pooling allocator?

cfallin · 2024-03-01T17:39:28Z

There are two performance "figures of merit": the instantiation speed and the runtime speed (how fast the Wasm executes once instantiation completes). The first two lines (no pooling and pooling) keep exactly the same generated code, so there's no runtime slowdown; the pooling allocator speedup is pure win on instantiation speed (due to the virtual-memory tricks). memory-init-cow again has to do with instantiation speed and doesn't alter runtime. One other configuration you haven't run yet, that might be interesting to try, is pooling + no memory CoW (but without explicit bounds checks): that should have fully optimal generated code and no runtime slowdown.

(Reality-is-complicated footnote: there may be some effects in the margins with page fault latency that do actually affect runtime depending on the virtual memory strategy; but those effects should be much smaller than the explicit vs. static bounds checks.)

thomastaylor312 · 2024-03-01T17:42:49Z

Yep I actually just tried the pooling + no memory CoW and that looks good. My remaining concern is around how all the different pooling allocator knobs can affect runtime/memory usage.

cfallin · 2024-03-01T18:04:31Z

runtime/memory usage

@thomastaylor312 the best answer is usually "try it and see", since tradeoffs can cross different inflection points depending on your particular workload.

As mentioned above, runtime (that is, execution speed of the Wasm) is unaffected by the pooling allocator. Explicit bounds checks are the only option that modify the generated code.

Memory usage (resident set size) might increase slightly if the "no CoW" option is set, because the pooling allocator keeps private pages around for each slot. (I don't remember if it retains a "warmed up" slot for a given image to reuse it though, @alexcrichton do you remember?) CoW is more memory-efficient because it can leverage shared read-only mappings of the heap image that aren't modified by the guest.

thomastaylor312 · 2024-03-01T18:26:12Z

Ok, I think between that and the docs that are already on the pooling allocator, I think I should be good enough. I'll go ahead and close this, but hopefully this issue can be helpful to others who might be optimizing things. Thanks for the help all!

Inspired by discussion on bytecodealliance#8034

alexcrichton · 2024-03-01T18:54:20Z

I was inspired to summarize some of the bits and bobs here in #8038 to hopefully help future readers as well.

that might be interesting to try, is pooling + no memory CoW

...

Yep I actually just tried the pooling + no memory CoW and that looks good

This surprises me! I would not expect disabling CoW to provide much benefit when explicit bounds checks are still enabled. If anything I'd expect it to get a bit slower (like what I measured above).

I say this because even if you disable copy-on-write we still use madvise to clear memory (regardless of whether bounds checks are enabled or not) which involves IPIs which don't scale well. This might be a case of you running into something Chris has pointed out in the past where when using CoW if you first read a page that will fault in a read-only mapping, but then if you write to the same page the copy happens in addition to an IPI to clear the old mapping. This means that CoW, while beneficial for large heap images due to removing startup cost, may be less beneficial over time if pages are read-then-written to cause even more IPIs (which aren't scalable) to happen over time.

Memory usage (resident set size) might increase slightly if the "no CoW" option is set, because the pooling allocator keeps private pages around for each slot.

I don't think this will actually be the case because when a slot is deallocated we madvise-reset the entire linear memory which should release all memory back to the kernel, so with-and-without CoW should have the same rss for deallocated slots.

Now for allocated slots, as you've pointed out, having 1000 instances with CoW should have less rss than 1000 instances without CoW because readonly pages will be shared in the CoW case.

…8038) * Add commentary on advantages/disadvantages of the pooling allocator Inspired by discussion on #8034 * Add `memory_init_cow` commentary on disadvantages * Fix doc link

thomastaylor312 added the bug Incorrect behavior in the current implementation that needs fixing label Feb 29, 2024

thomastaylor312 closed this as completed Mar 1, 2024

alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Mar 1, 2024

Add commentary on advantages/disadvantages of the pooling allocator

1391e54

Inspired by discussion on bytecodealliance#8034

alexcrichton mentioned this issue Mar 1, 2024

Add commentary on advantages/disadvantages of the pooling allocator #8038

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wasmtime throughput low on linux systems #8034

Wasmtime throughput low on linux systems #8034

thomastaylor312 commented Feb 29, 2024

fitzgen commented Feb 29, 2024

fitzgen commented Feb 29, 2024

fitzgen commented Feb 29, 2024

cfallin commented Feb 29, 2024

alexcrichton commented Feb 29, 2024

alexcrichton commented Feb 29, 2024

thomastaylor312 commented Mar 1, 2024 •

edited

Loading

thomastaylor312 commented Mar 1, 2024

cfallin commented Mar 1, 2024

thomastaylor312 commented Mar 1, 2024 •

edited

Loading

cfallin commented Mar 1, 2024

thomastaylor312 commented Mar 1, 2024

alexcrichton commented Mar 1, 2024

Wasmtime throughput low on linux systems #8034

Wasmtime throughput low on linux systems #8034

Comments

thomastaylor312 commented Feb 29, 2024

Test Case

Steps to Reproduce

Expected Results

Actual Results

Versions and Environment

Extra Info

fitzgen commented Feb 29, 2024

fitzgen commented Feb 29, 2024

fitzgen commented Feb 29, 2024

cfallin commented Feb 29, 2024

alexcrichton commented Feb 29, 2024

alexcrichton commented Feb 29, 2024

thomastaylor312 commented Mar 1, 2024 • edited Loading

thomastaylor312 commented Mar 1, 2024

cfallin commented Mar 1, 2024

thomastaylor312 commented Mar 1, 2024 • edited Loading

cfallin commented Mar 1, 2024

thomastaylor312 commented Mar 1, 2024

alexcrichton commented Mar 1, 2024

thomastaylor312 commented Mar 1, 2024 •

edited

Loading

thomastaylor312 commented Mar 1, 2024 •

edited

Loading