Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement lazy funcref table and anyfunc initialization. #3733

Merged
merged 1 commit into from
Feb 9, 2022

Conversation

cfallin
Copy link
Member

@cfallin cfallin commented Jan 27, 2022

During instance initialization, we build two sorts of arrays eagerly:

  • We create an "anyfunc" (a VMCallerCheckedAnyfunc) for every function
    in an instance.

  • We initialize every element of a funcref table with an initializer to
    a pointer to one of these anyfuncs.

Most instances will not touch (via call_indirect or table.get) all
funcref table elements. And most anyfuncs will never be referenced,
because most functions are never placed in tables or used with
ref.func. Thus, both of these initialization tasks are quite wasteful.
Profiling shows that a significant fraction of the remaining
instance-initialization time after our other recent optimizations is
going into these two tasks.

This PR implements two basic ideas:

  • The anyfunc array can be lazily initialized as long as we retain the
    information needed to do so. A zero in the func-ptr part of the tuple
    means "uninitalized"; a null-check and slowpath does the
    initialization whenever we take a pointer to an anyfunc.

  • A funcref table can be lazily initialized as long as we retain a link
    to its corresponding instance and function index for each element. A
    zero in a table element means "uninitialized", and a slowpath does the
    initialization.

The use of all-zeroes to mean "uninitialized" means that we can use fast
memory clearing techniques, like madvise(DONTNEED) on Linux or just
freshly-mmap'd anonymous memory, to get to the initial state without
a lot of memory writes.

Funcref tables are a little tricky because funcrefs can be null. We need
to distinguish "element was initially non-null, but user stored explicit
null later" from "element never touched" (ie the lazy init should not
blow away an explicitly stored null). We solve this by stealing the LSB
from every funcref (anyfunc pointer): when the LSB is set, the funcref
is initialized and we don't hit the lazy-init slowpath. We insert the
bit on storing to the table and mask it off after loading.

Performance effect on instantiation in the on-demand allocator (pooling
allocator effect should be similar as the table-init path is the same):

sequential/default/spidermonkey.wasm
                        time:   [71.886 us 72.012 us 72.133 us]

sequential/default/spidermonkey.wasm
                        time:   [22.243 us 22.256 us 22.270 us]
                        change: [-69.117% -69.060% -69.000%] (p = 0.00 < 0.05)
                        Performance has improved.

So, 72µs to 22µs, or a 69% reduction.

@cfallin
Copy link
Member Author

cfallin commented Jan 27, 2022

Benchmarking: the delta is hard to see with memory initialization taking most of the time (pre-#3697); a few percent improvement in the mean with spidermonkey.wasm but Criterion says not statistically significant.

However, in #3697 I measured approximately a factor-of-two improvement in the initialize_vmcontext hotpath (which was most of instantiation, in a scenario where we have a memory technique to reuse mappings).

@github-actions github-actions bot added the wasmtime:api Related to the API of the `wasmtime` crate itself label Jan 27, 2022
@github-actions
Copy link

Subscribe to Label Action

cc @peterhuene

This issue or pull request has been labeled: "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

  • peterhuene: wasmtime:api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@alexcrichton
Copy link
Member

Thanks for splitting this out! I've mentioned a few things here in other contexts but I think it's good to get them all in once place. The tl;dr; is that I don't personally feel like this is the best approach, but I think we can still get the same wins.

Due to a number of implementation details I don't think that this PR is really getting all that much of a benefit from laziness. The anyfunc array is used as storage for a module's *mut VMCallerCheckedAnyfunc which is used as the representation of a funcref. This is subsequently used by things like ref.func, tables, and exports. Today the anyfunc array is the size of the module's index space of functions, but this is actually much larger than necessary because we only use it for those items, and not all functions make their way into ref.func, tables, or exports. This PR also doesn't lazily initialize tables, only the anyfunc array, which means we still initialize all anyfuncs used as part of tables.

Adding all that together we still pay the full cost of anyfunc initialization for all table entries when a module is instantiated. The cost of initializing exports is done lazily and most modules nowadays don't have ref.func instructions so there's not much cost there. Overall this means that this PR is improving the amount of work done, but I think it's doing it in a bit of a roundabout way. Effectively we're immediately initializing a static subset of the anyfunc array on initialization and we're never actually initializing anything else after that (modulo exports but that's such a small set I don't think it really matters all that much).

Given all that personally think that a better implementation would be to avoid the laziness complications and instead just go ahead and keep doing what we do today, only over the same set of functions that this PR is doing it for. Basically I think we should shrink the anyfunc array to the size of the escaped_funcs array. That I think means that the same magnitude of work would be done on instantiation and there wouldn't be any need to deal with laziness anywhere.


As a few other side things:

  • For "benchmarking" this I think it might be good to take a module with 1k functions and a table with 100 of those functions and instantiate that. That should get memory out of the picture and should help focus on just the time it takes to deal with all the anyfuncs.
  • For the race comment, I'm under the impression that because the writes are non-atomic that any concurrent access to get_caller_checked_anyfunc would result in a data race. I don't think we have concurrent accesses today though so this is probably fine, but if we were to stick with this PR I'd prefer to either make get_caller_checked_anyfunc unsafe or make it take &mut self.

Ok now all that being said, one thing you mentioned the other day which I'm coming around to is that long-term I think it would be best to lazily initialize tables instead of just this anyfunc array. With CoW and uffd we're already effectively lazily initializing memory (just relying on the kernel to do so), and I think it would be best to do that for tables as well. I don't really know how we'd do this off the top of my head though.

Alternatively we could try to figure out how to make the table initialization much more "static" where work is done at compile time instead of instantiation time. I had one idea over here but that's sort of half-baked and only shrinks VMCallerCheckedAnyfunc, it doesn't offload too much to compile time. Overall doing something at compile-time I think would also be better than the lazy approach when we can.

@cfallin
Copy link
Member Author

cfallin commented Jan 27, 2022

@alexcrichton thanks for your thoughts here!

Top-level: I agree that in practice, eagerly initializing a smaller array-of-anyfuncs which is only "anyfuncs referenced in tables or by ref.func" is more-or-less equivalent to this PR's lazy approach. (In the presence of many ref.funcs referring to functions that are not in tables, this PR should still win a bit, I think.)

The main attraction I had to this approach was simplicity and small impact -- indexing via escaped_funcs indices is definitely workable too but implies a bit more indirection.

The other thing about a pervasively lazy approach is that it paves the way for more gains later, as I think you're getting at with your second-to-last paragraph. If we're lazy about requesting anyfunc pointers (ie not initializing table elements until accessed), then the laziness in filling in the pointed-to anyfuncs finally pays off. And I think the change to get to this point is pretty simple -- just a null check when loading from the table and a slowpath that calls get_caller_checked_anyfunc from inside a libcall, then initializes the table with the result.

Of course we could do both, and that has benefits too -- both shrinking the size of the vmcontext, and maybe doing things eagerly at first, does not preclude us from lazily initializing later...

A specific note too:

For the race comment, I'm under the impression that because the writes are non-atomic that any concurrent access to get_caller_checked_anyfunc would result in a data race.

Yes, exactly, the goal is to take advantage of a benign race (and come out with correct results regardless). Maybe to make the compiler happy we need to do this all through atomics and Ordering::Relaxed, as long as the bitmap update compiles down to something without the lock prefix on x86-64; the lock or was the major hotspot in my tests otherwise.

The comment is attempting to give a safety argument why this is still OK in the presence of concurrent accesses; we should do whatever we need to at the language semantics level but I think I've convinced myself that the resulting machine code / access pattern is sound :-)

@cfallin
Copy link
Member Author

cfallin commented Jan 28, 2022

@alexcrichton I've reformulated the benign-racy init path now to use atomics for the bitmap accesses at least. There is still a (very intentional!) race, in that the bit-set is not done with an atomic fetch_or but rather by just storing the loaded value with the new bit set, and no compare-exchange or anything like that. This is to avoid any pipeline serializations, bus locking, etc in the hot-path. The comment explains why this is safe; the tl;dr is that it depends on the fact that all initializations write the same values (idempotent writes) so spurious re-initializations are acceptable.

@alexcrichton
Copy link
Member

I would personally not say that this PR is simple with a small impact, due to a few consequences:

  • There's a new InstanceAllocationInfo structure to juggle which should theoretically be managed with other module-level resources like Arc<Module> but isn't. This means while it logically has the same lifetime as other metadata it practically doesn't because it was developed at a different time.
  • This inflates the size of VMContext with extra flag bits for whether types are initialized.
  • This, as currently written, tries to deal with a race that never actually happens in practice. I don't believe that the get_caller_checked_anyfunc function is ever actually called concurrently (in that Wasmtime's APIs I don't think provide any means of doing so) so the extra synchronization/comments aren't actually necessary.

Overall I continue to feel that this is a fair bit of complication, especially with InstanceAllocationInfo, for what ends up being not a huge benefit. This current lazy strategy really isn't all that lazy, it basically immediately initializes everything used for tables/exports and then never initializes anything else in the anyfunc array, so the laziness is never actually leveraged (it just makes initial initialization slower since the flags are all checked and written).

I'm less certain about the idea of paving the way for gains later as well. I understand how this is sort of theoretically applicable but we're talking about nanoseconds-per-element "laziness" which feels like too small of a granularity to manage to me. If we get around to implementing a fully-lazily-initialized table approach (which I'm not entirely sure how we'd do that) it seems like it might be at a more bulk-level instead of at a per-element level. I would also expect that whatever laziness mechanism that would use would subsume this one since there's not really much need for cascading laziness.

Naively I would expect that using escaped_funcs would be workable to implement. It seems like we'd assign entries in escaped_funcs an index (which would probably be a side table stored in Module) and then ref.func would use this index to get to the anyfunc array as well as table element initialization going through this (perhaps pre-computing something else in Module to avoid too many levels of array acceses during initialization).

I've reformulated the benign-racy init path now to use atomics for the bitmap accesses at least

I believe this is still a theoretical data race that Rust disallows. Despite the machine code working just fine at the LLVM layer any concurrent unsynchronized writes are a data race and undefined behavior, so any concurrent writes would have to be atomic. (again though I don't think it's worthwhile trying to fix this because I don't think this overall laziness approach is what we want either)

@cfallin
Copy link
Member Author

cfallin commented Jan 28, 2022

@alexcrichton thanks for your comments!

I'm less certain about the idea of paving the way for gains later as well. I understand how this is sort of theoretically applicable but we're talking about nanoseconds-per-element "laziness" which feels like too small of a granularity to manage to me. If we get around to implementing a fully-lazily-initialized table approach (which I'm not entirely sure how we'd do that) it seems like it might be at a more bulk-level instead of at a per-element level. I would also expect that whatever laziness mechanism that would use would subsume this one since there's not really much need for cascading laziness.

So, a few things:

  1. I think the per-element lazy initialization is going to be an important part of any lazy table init. As I think @fitzgen mentioned earlier as well, there's not necessarily any spatial correlation between uses of adjacent table elements. So imagine I have 1000 funcs in a table, and I pull out index 553 to return to you, and never touch anything else. Do we initialize the whole table at that point? Or a block of 100 surrounding the accessed index? The larger the "bulk init", the greater the waste -- accessing index 553 probably doesn't mean I'm more likely to access 554 or 552 in the future than any other index.

  2. Given that, I think that "cascading laziness" is not really the right way to characterize this change, plus a lazy-table-init change on top. It's more like this change is a necessary implementation piece of the latter. If the top layer is lazy (at the whole-table, or table-chunk, or individual-element level) but we don't have a way of avoiding setting up the anyfuncs the top layer points to, then we haven't carried through the work savings.

  3. Re: "I'm not entirely sure how we'd do that" (lazy table init), do you have any thoughts on the scheme I proposed, with a null check on each table.get that branches to a slowpath libcall?

  4. Reducing to the list of escaped-funcs only in the anyfuncs table is definitely possible, and I'm happy to go that way if you'd prefer for now. But re: above, I think it doesn't really do anything to prepare us for a lazy-table-init world. This change was designed to sort of push us in that direction, and avoid the need for side-tables and indirections.

Otherwise, simplicity is subjective, I guess :-) The factoring of state here is I think going to be necessary for most approaches that initialize this general state after instantiation -- we probably can't get away from having e.g. signature info at libcall time.

I won't argue the concurrency bit here is complex -- maybe the better answer is to put together an actual lazy-table-init implementation, and then get rid of the initialization bitmap and go back to the zero-func-ptr-means-uninitialized design I had earlier.

Maybe it would help if I "drew the rest of the owl" here and actually sketched lazy table initialization as well? This would perhaps make the picture a bit clearer how all of this could work together, at least as I'm imagining it.

@alexcrichton
Copy link
Member

Ah so my thinking of a more bulk-style initialization is me being inspired by what we're doing for memories where we fault in by page rather than all at once. I'm sort of naively assuming that some level of granularity would also be similar on tables, but byte access patterns may not really have any correlation to wasm-table-access-patterns so it may just be a flawed assumption on my part.

Also I think I understand now more how something like this would be required for full laziness. I'm still somewhat skeptical that it would need to look precisely like this but I see how my thinking of cascading laziness being unnecessary is flawed.

And finally sorry I meant to respond to the table point but I forgot! That does sound workable but I think we still need to be careful. For example if an element segment initializes an imported table I don't think that we can skip that. Additionally we'd have to be careful to implement Table::get properly in the embedding API for lazily-initialized tables (I don't think that all the access to do the lazy initialization is trivially there today).


I think that my preference at this point is to largely let performance guide us. The performance win in this PR is to not initialize anyfuncs for everything that never needs an anyfunc. There's still some unrealized performance wins though:

  • Fully lazily initializating anyfuncs (requires lazy tables since table init today force-initializes the anyfuncs)
  • Shrinking the vmcontext to only have anyfuncs for what's necessary

Given that this is all peformance work effectively I think I'd prefer to be guided by numbers. I would be curious to benchmark this implementation here vs one that shrinks the anyfunc array like I've been thinking. The performance difference may come out in the wash but given how small the numbers are here I could imagine it being significant. I think we could then use a similar benchmark for measuring a fully-lazy-table approach.

Overall I'm still hesitant to commit to design work which will largely only come to fruition in the future when the future is still somewhat uncertain (e.g. the precise design of a fully-lazy-table approach). In that sense I'd prefer to get the win we'll always want in any case, shrinking the VMContext, here and work on the laziness afterwards. (or doing it all at once is fine, but I'd prefer to not be in an inbetween state where we still have a big VMContext and some of it is initialized at instantiation time)

@cfallin
Copy link
Member Author

cfallin commented Jan 29, 2022

I believe this is still a theoretical data race that Rust disallows. Despite the machine code working just fine at the LLVM layer any concurrent unsynchronized writes are a data race and undefined behavior, so any concurrent writes would have to be atomic. (again though I don't think it's worthwhile trying to fix this because I don't think this overall laziness approach is what we want either)

I was almost done with my Friday, then a thought occurred to me -- we can just make the anyfunc fields atomics as well and do relaxed-ordering loads/stores. This satisfies the language's memory model; now all racing accesses are to atomics, and there is a Release-to-Acquire edge on the bitmap update that ensures the relaxed stores during init are seen by relaxed loads during normal use. This part at least should be fully theoretically sound now, I think.

I'll do the rest of the lazy-table init as it is in my head on Monday, just to sketch it and be able to measure, I think. This work will be compatible with / complementary to any work that reduces the number of anyfuncs as well...

@alexcrichton
Copy link
Member

Er sorry but to reiterate again, I do not think anything should be atomic here. This method is statically not possible to call concurrently (or so I believe) so I do not think we should be "working around" the compiler in this case. Instead I think the method should change to &mut self or otherwise be unsafe and indicate that callers must be &mut self or otherwise prevent concurrent calls.

@cfallin
Copy link
Member Author

cfallin commented Jan 31, 2022

Ah, OK; I had been assuming that concurrent calls actually are possible via the "get an export" path. Within wasmtime_runtime at least, that's all via &self methods. But I see that at the wasmtime layer this takes a mut Store (or AsContextMut or whatnot) so we actually do have mutual exclusion. I agree that given this, just using "unsafe" is way simpler.

@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:wasm labels Feb 3, 2022
@cfallin cfallin changed the title Implement lazy VMCallerCheckedAnyfunc initialization. Implement lazy funcref table and anyfunc initialization. Feb 3, 2022
@cfallin
Copy link
Member Author

cfallin commented Feb 3, 2022

I've implemented lazy table initialization now (mostly; there's one odd test failure with linking I need to resolve). It could probably be cleaned up somewhat, but here's one initial datapoint, with spidermonkey.wasm:

(before)

sequential/default/spidermonkey.wasm
                        time:   [71.886 us 72.012 us 72.133 us]

(after)

sequential/default/spidermonkey.wasm
                        time:   [22.243 us 22.256 us 22.270 us]
                        change: [-69.117% -69.060% -69.000%] (p = 0.00 < 0.05)
                        Performance has improved.

So, 72µs to 22µs, or a 69% reduction.

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some initial comments below, but the biggest one is that I don't think this is lazy at least in the way I thought it would be lazy. Instantiating a module locally still seems to apply all element segment initializers on instantiation rather than lazily computing them after start-up.

Otherwise though I feel that it's best to apply a refactoring either before this PR or in this PR about all the little Arc allocations that are popping up. I think they should all be housed in one "unit" to avoid having to piecemeal hand everything from the wasmtime crate to the wasmtime_runtime crate as separate Arc allocations.

Overall though this approach seems pretty reasonable to me. The codegen changes in func_environ.rs all look as I'd expect from a lazily-initialized scheme.

As a final note though I feel like this is significant enough that we should really shrink the size of the VMCallerCheckedAnyfunc array either before this PR or during this PR. Especially if we're still memset-ing the entire array to 0 during instantiation for the on-demand allocator then shrinking this array seems like it will be quite beneficial because it's less memory to initialize. If you'd like though I can work on this change independently.

crates/runtime/src/instance/allocator.rs Outdated Show resolved Hide resolved
crates/runtime/src/instance.rs Show resolved Hide resolved
crates/runtime/src/instance.rs Outdated Show resolved Hide resolved
crates/runtime/src/instance.rs Outdated Show resolved Hide resolved
crates/runtime/src/instance/allocator.rs Outdated Show resolved Hide resolved
crates/runtime/src/instance/allocator.rs Outdated Show resolved Hide resolved
crates/runtime/src/instance/allocator.rs Outdated Show resolved Hide resolved
crates/runtime/src/table.rs Outdated Show resolved Hide resolved
/// here to fields that should be initialized to zero. This allows the
/// caller to use an efficient means of zeroing memory (such as using
/// anonymous-mmap'd zero pages) if available, rather than falling
/// back onto a memset (or the manual equivalent) here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now I think that this may be detrimental to the on-demand allocator because it either does a too-large memset right now for the entire VMContext or it will need to call calloc, which also does a too-large memset.

An alternative design, though, would be to pass a flag into this function whether the memory is known to be zero'd. That way we could conditionally zero out the array of VMCallerCheckedAnyfunc contents based on whether it was pre-zeroed already, making the pooling allocator efficiently use madvise while the on-demand allocator still initializes as little memory as possible.

After saying this, though, I think it may actually be best to take this incrementally? What do you think about using memset here to zero memory and benchmarking later how much faster using madvise is for the pooling allocator? I'm a little worried about the madvise traffic being increased for the pooling allocator since we already know it's a source of slow tlb-shootdowns, and I'm afraid that any cost of that will be hidden by the other performance gains in this PR so we can't accurately measure the strategy of memset-on-initialize vs madvise-to-zero.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified to take a prezeroed flag as suggested -- good idea!

I'm a little hesitant to settle on a memset-only design even for the pooling allocator, as my earlier measurements with a large module (spidermonkey.wasm) showed that that became really quite hot; that's why I had gone to the initialized-bitmap instead (1 bit vs 24 bytes to zero --> 192x less). But I will benchmark this to be sure on Monday.

Alternatively, I think that if we move to a single-slot-per-instance design for the pooling allocator, such that we have one madvise for everything (memories, tables), stretching that range to include a few extra pages for the Instance and VMContext should be no big deal (it will share one mmap-lock acquire and one set of IPIs), so I suspect that madvise-cost may not have to be a huge consideration in the future. Of course getting there implies removing uffd first, since it depends on the separate-pools design, so we would have to make that decision first.

@cfallin
Copy link
Member Author

cfallin commented Feb 3, 2022

@alexcrichton (and anyone else watching) argh, sorry about that, I pushed my changes prior to last night's measurements to a private save-the-work-in-progress fork that I keep, but not here; just pushed now. Regardless the review comments are useful and I'll keep refining this!

@cfallin cfallin force-pushed the lazy-anyfuncs branch 3 times, most recently from 352f990 to fede8a8 Compare February 4, 2022 02:55
@alexcrichton
Copy link
Member

For the benchmark numbers I posted #3775 which I think will tweak the benchmark to better serve measuring this. Running that locally for a large module I'm seeing the expected exponential slowdown for the pooling allocator, and also as expected the mmap allocator is behaving better here due to fewer ipis needed (as it's all contended on the vma lock)

Could you try benchmarking with that?

@cfallin
Copy link
Member Author

cfallin commented Feb 7, 2022

Sure; here's a benchmarking run using your updated instantiation benchmark. I include baselines for default and pooling, but my local changes don't affect default (the mmap allocator) -- in pooling.rs, I remove the call to decommit_instance_pages and pass false to prezeroed on initialize_vmcontext instead, and that's it.

================================================================================
With madvise() to flash-zero Instance/VMContext:
================================================================================

sequential/default/empty.wat
                        time:   [1.1723 us 1.1752 us 1.1778 us]
sequential/pooling/empty.wat
                        time:   [2.4567 us 2.4608 us 2.4652 us]

parallel/default/empty.wat: with 1 background thread
                        time:   [1.1596 us 1.1628 us 1.1666 us]
parallel/default/empty.wat: with 16 background threads
                        time:   [20.918 us 21.125 us 21.307 us]
parallel/pooling/empty.wat: with 1 background thread
                        time:   [2.4203 us 2.4224 us 2.4249 us]
parallel/pooling/empty.wat: with 16 background threads
                        time:   [25.422 us 25.860 us 26.295 us]

sequential/default/small_memory.wat
                        time:   [5.5100 us 5.5230 us 5.5361 us]
sequential/pooling/small_memory.wat
                        time:   [3.0336 us 3.0356 us 3.0376 us]

parallel/default/small_memory.wat: with 1 background thread
                        time:   [5.3376 us 5.3488 us 5.3614 us]
parallel/default/small_memory.wat: with 16 background threads
                        time:   [95.562 us 96.023 us 96.485 us]
parallel/pooling/small_memory.wat: with 1 background thread
                        time:   [3.0222 us 3.0270 us 3.0316 us]
parallel/pooling/small_memory.wat: with 16 background threads
                        time:   [172.62 us 175.12 us 177.54 us]

sequential/default/data_segments.wat
                        time:   [6.3065 us 6.3094 us 6.3120 us]
sequential/pooling/data_segments.wat
                        time:   [2.8083 us 2.8116 us 2.8147 us]

parallel/default/data_segments.wat: with 1 background thread
                        time:   [6.0155 us 6.0290 us 6.0407 us]
parallel/default/data_segments.wat: with 16 background threads
                        time:   [141.99 us 142.58 us 143.27 us]
parallel/pooling/data_segments.wat: with 1 background thread
                        time:   [2.7353 us 2.7533 us 2.7687 us]
parallel/pooling/data_segments.wat: with 16 background threads
                        time:   [40.435 us 41.035 us 41.644 us]

sequential/default/wasi.wasm
                        time:   [6.6997 us 6.7118 us 6.7264 us]
sequential/pooling/wasi.wasm
                        time:   [3.7074 us 3.7099 us 3.7129 us]

parallel/default/wasi.wasm: with 1 background thread
                        time:   [6.7795 us 6.7964 us 6.8131 us]
parallel/default/wasi.wasm: with 16 background threads
                        time:   [154.73 us 155.36 us 156.03 us]
parallel/pooling/wasi.wasm: with 1 background thread
                        time:   [3.8161 us 3.8195 us 3.8231 us]
parallel/pooling/wasi.wasm: with 16 background threads
                        time:   [60.806 us 61.850 us 62.927 us]

sequential/default/spidermonkey.wasm
                        time:   [15.974 us 15.983 us 15.993 us]
sequential/pooling/spidermonkey.wasm
                        time:   [6.0185 us 6.0215 us 6.0248 us]

parallel/default/spidermonkey.wasm: with 1 background thread
                        time:   [16.189 us 16.201 us 16.215 us]
parallel/default/spidermonkey.wasm: with 16 background threads
                        time:   [165.91 us 167.16 us 168.51 us]
parallel/pooling/spidermonkey.wasm: with 1 background thread
                        time:   [5.9293 us 5.9348 us 5.9403 us]
parallel/pooling/spidermonkey.wasm: with 16 background threads
                        time:   [55.862 us 57.049 us 58.373 us]

================================================================================
With explicit memset:
  ("default"-policy was not changed, so excluding)
================================================================================

sequential/pooling/empty.wat
                        time:   [1.2088 us 1.2111 us 1.2136 us]
                        change: [-51.445% -51.270% -51.113%] (p = 0.00 < 0.05)
                        Performance has improved.

parallel/pooling/empty.wat: with 1 background thread
                        time:   [1.1838 us 1.1853 us 1.1869 us]
                        change: [-50.952% -50.778% -50.613%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/pooling/empty.wat: with 16 background threads
                        time:   [21.178 us 21.353 us 21.515 us]
                        change: [-18.533% -17.405% -16.183%] (p = 0.00 < 0.05)
                        Performance has improved.

sequential/pooling/small_memory.wat
                        time:   [1.7699 us 1.7719 us 1.7740 us]
                        change: [-41.479% -41.392% -41.293%] (p = 0.00 < 0.05)
                        Performance has improved.

parallel/pooling/small_memory.wat: with 1 background thread
                        time:   [1.7915 us 1.7930 us 1.7945 us]
                        change: [-40.593% -40.487% -40.383%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/pooling/small_memory.wat: with 16 background threads
                        time:   [65.378 us 66.775 us 68.080 us]
                        change: [-62.639% -61.536% -60.394%] (p = 0.00 < 0.05)
                        Performance has improved.

sequential/pooling/data_segments.wat
                        time:   [1.5276 us 1.5288 us 1.5301 us]
                        change: [-45.622% -45.551% -45.475%] (p = 0.00 < 0.05)
                        Performance has improved.

parallel/pooling/data_segments.wat: with 1 background thread
                        time:   [1.5664 us 1.5716 us 1.5776 us]
                        change: [-42.192% -41.806% -41.407%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/pooling/data_segments.wat: with 16 background threads
                        time:   [30.378 us 30.951 us 31.554 us]
                        change: [-24.975% -23.479% -21.963%] (p = 0.00 < 0.05)
                        Performance has improved.

sequential/pooling/wasi.wasm
                        time:   [2.0274 us 2.0401 us 2.0519 us]
                        change: [-46.488% -46.206% -45.904%] (p = 0.00 < 0.05)
                        Performance has improved.

parallel/pooling/wasi.wasm: with 1 background thread
                        time:   [1.9543 us 1.9559 us 1.9579 us]
                        change: [-48.881% -48.790% -48.700%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/pooling/wasi.wasm: with 16 background threads
                        time:   [49.629 us 50.532 us 51.431 us]
                        change: [-21.128% -19.627% -18.124%] (p = 0.00 < 0.05)
                        Performance has improved.

sequential/pooling/spidermonkey.wasm
                        time:   [12.469 us 12.556 us 12.635 us]
                        change: [+108.02% +109.03% +110.10%] (p = 0.00 < 0.05)
                        Performance has regressed.

parallel/pooling/spidermonkey.wasm: with 1 background thread
                        time:   [12.523 us 12.562 us 12.595 us]
                        change: [+109.29% +110.20% +111.03%] (p = 0.00 < 0.05)
                        Performance has regressed.
parallel/pooling/spidermonkey.wasm: with 16 background threads
                        time:   [254.13 us 277.81 us 304.77 us]
                        change: [+410.73% +449.63% +496.28%] (p = 0.00 < 0.05)
                        Performance has regressed.

So, switching to an explicit memset, we see performance improvements in all cases except the large SpiderMonkey module (31k functions in my local build), where using explicit memset is 5.49x slower (+449%).

Either way, it's clear to me that we'll feel some pain on the low end (madvise) or high end (memset), so the ultimate design probably has to be to incorporate this flash-clearing into an madvise we're already doing (single-madvise idea). I can implement that as soon as we confirm that we want to remove uffd. I guess the question is just which we settle on in the meantime :-)

@cfallin
Copy link
Member Author

cfallin commented Feb 7, 2022

Hmm, actually, the other option is to bring back the bitmap approach from the first versions of this PR above. I think the atomics scared everyone off but by now I've propagated &mut self everywhere it needs to be, so this should be pretty straightforward. Then we can always memset the bitmap, but that's relatively small and should always be cheaper than an madvise, even on SpiderMonkey...

@cfallin cfallin force-pushed the lazy-anyfuncs branch 2 times, most recently from d46057a to b9e160e Compare February 7, 2022 23:57
@alexcrichton
Copy link
Member

I'm a bit perplexed with the numbers I'm seeing locally. Using this small program I measured that the difference in memset-vs-madvise to clear 16 pages of memory is:

threads memset madvise
1 691ns 234ns
4 752ns 7995ns
8 766ns 15991ns

This is what I'd expect which is that memset has a relatively constant cost where madvise gets worse as you add more threads since there's more IPIs. I don't understand why memset gets slightly worse when you add more threads, however. (these are all threads operating on disjoint pages of memory).

So it doesn't really make sense to me that using memset in your benchmarks actually makes things slower rather than faster.

I don't have the spidermonkey.wasm you're testing with but a random "big module" that I had on hand was rustpython.wasm. This has ~12k functions as a ~6k element table. I ran the instantiation benchmark with this PR as-is via:

$ cargo bench --features memfd --bench instantiation 'parallel.*pool.*rustpy'

After I applied this patch which I believe avoids madvise for instance pages entirely and uses memset. Using the same benchmarking command the timings I got for the two approaches are:

threads madvise memset
1 16us 8us
2 113us 15us
3 174us 40us
4 234us 44us

This is more in line with what I am expecting. We're seeing that madvise is atrocious for concurrent performance, so I'm highly surprised that you're seeing madvise as faster and I'm seeing memset as faster. I'm working on the arm64 server that we have but I don't think x86_64 vs aarch64 would explain this difference.

I've confirmed with perf that the hottest symbol in the madvise build is tlb_flush_mmu (as expected), and the hottest symbol in the memset build is memset (also as expected).

Can you try to reproduce these results locally? If not I think we need to figure out what the difference is?

@cfallin
Copy link
Member Author

cfallin commented Feb 8, 2022

I'm working on the arm64 server that we have but I don't think x86_64 vs aarch64 would explain this difference.

Ah, I think it would actually, or more precisely the server specs will: the aarch64 machine we have is 128 cores, so an "IPI to every core" is exceedingly expensive. For reference I'm doing tests on my Ryzen 3900X, 12 cores; big but not "never do a broadcast to all cores or you will suffer" big.

It seems to me that past a certain tradeoff point, which varies based on the system, madvise is faster to flash-zero sparsely accessed memory. (In the limit, the IPI is a fixed cost, flushing the whole TLB is a fixed cost, and actually zeroing the page tables is faster than zeroing whole pages.) Perhaps on arm2-ci that point is where we're zeroing 1MB, or 10MB, or 100MB; it would be interesting to sweep your experiment in that axis.

In any case, the numbers are the numbers; I suspect if I ssh'd to the arm64 machine and ran with your wasm module, I'd replicate yours, and if you ssh'd to my workstation and ran with my wasm module, you'd replicate mine. The interesting bits are the differences in platform configuration and workload I think.

For reference, my spidermonkey.wasm (gzipped) has 31894 functions, so the anyfunc array is 765 kilobytes that need to be zeroed. Three-quarters of a megabyte! We can definitely do better than that I think. (Possibly-exported functions, as you've suggested before -- I still need to build this -- comes to 7420 functions, or 178 KB; better but still too much.)

Given all of that, I do have a proposed direction that I think will solve all of the above issues: I believe that an initialization-bitmap can help us here. If we keep a bitmap to indicate when an anyfunc is not initialized, we need only 3992 bytes (499 u64s) for spidermonkey.wasm. This seems unambiguously better with no downsides, but please do let me know if you disagree :-) I'd be happy to split that part off into a separate PR with its own speedup measurements, leaving this PR with a memset-only approach (which should not be the final word for anyone running SpiderMonkey).

@cfallin
Copy link
Member Author

cfallin commented Feb 8, 2022

I was curious to quantify the difference between our systems a bit more so:

Using this small program I measured that the difference in memset-vs-madvise to clear 16 pages of memory is:

On a 12-core system I get:

64KiB zeroing:

threads madvise memset
1 136ns 470ns
2 1.16µs 582ns
4 3.79µs 625ns
8 6µs 646ns

1 MiB zeroing:

threads madvise memset
1 363ns 10.635µs
2 1.724µs 11.438µs
4 3.857µs 11.694µs
8 5.554µs 12.308µs

So on my system at least, madvise is a clear loss at. 64KiB, but is a win in all cases for 1 MiB. (Wildly enough, it becomes more of a win at higher thread counts, because the total cache footprint of all threads touching their memory exceeds my LLC size.) I haven't bisected the breakeven point between those two.

I'm also kind of flabbergasted at the cost of madvise on arm2-ci; for 64KIB, 8 threads, you see 16 µs while I see 646 ns, about 25 times faster (!). EDIT: 6 µs, about 2.6x faster (table columns are hard sorry).

So given that my canonical "big module" requires close to a megabyte of VMContext zeroing, and given that we're testing on systems with wildly different TLB-shootdown cost, I'm not surprised at all that we've been led to different conclusions! The variance of workloads and systems "in the field" is all the more reason to not have to zero data at all, IMHO, by using a bitmap :-)

@alexcrichton
Copy link
Member

Hm there's still more I'd like to dig into performance-wise here, but I don't want to over-rotate and dedicate this whole thread to a few lines of code that are pretty inconsequential. Additionally I was thinking last night and concluded "why even zero at all?" For the anyfunc array I don't think there's actually any need to zero since it's not accessed in a loop really right now. For example table elements are sort of a next layer of cache so if you hammer on table elements any computation to create the anyfunc is cached at that layer. Otherwise the only other uses of anyfuncs are exports (cached inherently as you typically pull out the export and don't pull it out again-and-again) and as ref.func which isn't really used by modules today. "Always compute anyfunc constructors" may make ref.func slower one day but it seems like we could cross that bridge when we get there.

Effectively I think we could get away with doing nothing to the anyfunc array on instantiation. When an anyfunc is asked for, and every time it's asked for, we construct the anyfunc into the appropriate slot and return it. These aren't ever used concurrently, as we've seen, so there's no need to worry about concurrent writes and it's fine to pave over what's previously there with the same content.

In the future when we have the instance/table/memory all adjacent in the pooling allocator we may as well zero out the memory with one madvise since it's pretty easy to do (and we're already require to do it for memories/tables). For now though until that happens it should be fine to leave the contents undefined until it's used. We could also implement a sort of "hardening" at some point to use calloc to alloc a VMContext in the on-demand allocator one day, but that doesn't need to be how it's done for now either.

If that seems reasonable then I think it's fine to shelve performance things for later since there's no longer a question of how to zero since we wouldn't need to zero at all.

@cfallin
Copy link
Member Author

cfallin commented Feb 8, 2022

I was thinking last night and concluded "why even zero at all?" For the anyfunc array I don't think there's actually any need to zero since it's not accessed in a loop really right now. For example table elements are sort of a next layer of cache so if you hammer on table elements any computation to create the anyfunc is cached at that layer. Otherwise the only other uses of anyfuncs are exports (cached inherently as you typically pull out the export and don't pull it out again-and-again) and as ref.func which isn't really used by modules today. "Always compute anyfunc constructors" may make ref.func slower one day but it seems like we could cross that bridge when we get there.

Huh, that is a really interesting idea; I very much like the simplicity of it! Thanks!

I can do this, then file an issue to record the "maybe an is-initialized bitmap, or maybe madvise-zero anyfuncs along with the other bits" ideas when if/when we do eventually come across ref.func used in the wild in a hotpath.

@cfallin
Copy link
Member Author

cfallin commented Feb 9, 2022

@alexcrichton I've done all of the refactors you suggested (wasmtime::ModuleInner implementing runtime trait directly, no separate Arc-held thing; and moving table init precomputation to compilation in wasmtime_environ) -- things are a lot cleaner now, thanks!

Tomorrow I'll squash this down and look at the complete diff with fresh eyes, and clean it up -- undoubtedly some small diffs have snuck in from attempting and undoing bits that I'll want to remove. I'll do a final round of before/after benchmarking as well so provide a good top-level summary of this PR's effects.

@cfallin cfallin force-pushed the lazy-anyfuncs branch 3 times, most recently from f91380d to c8cfec5 Compare February 9, 2022 05:55
@cfallin
Copy link
Member Author

cfallin commented Feb 9, 2022

OK, I think this is ready for the next (hopefully final? happy to keep going of course) round of review.

Here's a benchmark (in-tree benches/instantiation.rs, using above spidermonkey.wasm, memfd enabled, {1, 16} threads):

BEFORE:

sequential/default/spidermonkey.wasm
                        time:   [68.569 us 68.696 us 68.856 us]
sequential/pooling/spidermonkey.wasm
                        time:   [69.406 us 69.435 us 69.465 us]

parallel/default/spidermonkey.wasm: with 1 background thread
                        time:   [69.444 us 69.470 us 69.497 us]
parallel/default/spidermonkey.wasm: with 16 background threads
                        time:   [183.72 us 184.31 us 184.89 us]
parallel/pooling/spidermonkey.wasm: with 1 background thread
                        time:   [69.018 us 69.070 us 69.136 us]
parallel/pooling/spidermonkey.wasm: with 16 background threads
                        time:   [326.81 us 337.32 us 347.01 us]

WITH THIS PR:

sequential/default/spidermonkey.wasm
                        time:   [6.7821 us 6.8096 us 6.8397 us]
                        change: [-90.245% -90.193% -90.142%] (p = 0.00 < 0.05)
                        Performance has improved.
sequential/pooling/spidermonkey.wasm
                        time:   [3.0410 us 3.0558 us 3.0724 us]
                        change: [-95.566% -95.552% -95.537%] (p = 0.00 < 0.05)
                        Performance has improved.

parallel/default/spidermonkey.wasm: with 1 background thread
                        time:   [7.2643 us 7.2689 us 7.2735 us]
                        change: [-89.541% -89.533% -89.525%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/default/spidermonkey.wasm: with 16 background threads
                        time:   [147.36 us 148.99 us 150.74 us]
                        change: [-18.997% -18.081% -17.285%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/pooling/spidermonkey.wasm: with 1 background thread
                        time:   [3.1009 us 3.1021 us 3.1033 us]
                        change: [-95.517% -95.511% -95.506%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/pooling/spidermonkey.wasm: with 16 background threads
                        time:   [49.449 us 50.475 us 51.540 us]
                        change: [-85.423% -84.964% -84.465%] (p = 0.00 < 0.05)
                        Performance has improved.

tl;dr: 80-95% faster, or 69µs -> 7µs (1-threaded, on-demand) / 337µs -> 50µs (16-threaded, pooling).

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks great to me, thanks again for being patient! I've got a number of stylistic nits below but nothing major really.

One thing I did remember though is that we should probably have at least a smoke benchmark showing that performance with something that uses call_indirect isn't tanking. I don't think it's necessary to have a super rigorous benchmark but I think it'd be good to get a number or two about the runtime of modules in addition to the instantiation time.

That being said the instantiation numbers you're getting are very impressive! It's not every day that we get to take instantiation and make it that much faster. FWIW I was running this locally as well and forgot to turn on memfd, but it meant that I could measure the impact of instantiating a "big module" and it dropped from 2ms to 5us, meaning it's now 400x faster or 99.75% faster after memfd + lazy table init. Certainly nothing to sneeze at!

crates/cranelift/src/func_environ.rs Outdated Show resolved Hide resolved
crates/cranelift/src/func_environ.rs Show resolved Hide resolved
crates/environ/src/module.rs Outdated Show resolved Hide resolved
crates/environ/src/module.rs Outdated Show resolved Hide resolved
crates/environ/src/module.rs Show resolved Hide resolved
crates/runtime/src/instance.rs Outdated Show resolved Hide resolved
crates/runtime/src/instance/allocator.rs Outdated Show resolved Hide resolved
crates/runtime/src/table.rs Outdated Show resolved Hide resolved
crates/wasmtime/src/module.rs Outdated Show resolved Hide resolved
crates/wasmtime/src/module.rs Outdated Show resolved Hide resolved
During instance initialization, we build two sorts of arrays eagerly:

- We create an "anyfunc" (a `VMCallerCheckedAnyfunc`) for every function
  in an instance.

- We initialize every element of a funcref table with an initializer to
  a pointer to one of these anyfuncs.

Most instances will not touch (via call_indirect or table.get) all
funcref table elements. And most anyfuncs will never be referenced,
because most functions are never placed in tables or used with
`ref.func`. Thus, both of these initialization tasks are quite wasteful.
Profiling shows that a significant fraction of the remaining
instance-initialization time after our other recent optimizations is
going into these two tasks.

This PR implements two basic ideas:

- The anyfunc array can be lazily initialized as long as we retain the
  information needed to do so. For now, in this PR, we just recreate the
  anyfunc whenever a pointer is taken to it, because doing so is fast
  enough; in the future we could keep some state to know whether the
  anyfunc has been written yet and skip this work if redundant.

  This technique allows us to leave the anyfunc array as uninitialized
  memory, which can be a significant savings. Filling it with
  initialized anyfuncs is very expensive, but even zeroing it is
  expensive: e.g. in a large module, it can be >500KB.

- A funcref table can be lazily initialized as long as we retain a link
  to its corresponding instance and function index for each element. A
  zero in a table element means "uninitialized", and a slowpath does the
  initialization.

Funcref tables are a little tricky because funcrefs can be null. We need
to distinguish "element was initially non-null, but user stored explicit
null later" from "element never touched" (ie the lazy init should not
blow away an explicitly stored null). We solve this by stealing the LSB
from every funcref (anyfunc pointer): when the LSB is set, the funcref
is initialized and we don't hit the lazy-init slowpath. We insert the
bit on storing to the table and mask it off after loading.

We do have to set up a precomputed array of `FuncIndex`s for the table
in order for this to work. We do this as part of the module compilation.

This PR also refactors the way that the runtime crate gains access to
information computed during module compilation.

Performance effect measured with in-tree benches/instantiation.rs, using
SpiderMonkey built for WASI, and with memfd enabled:

```
BEFORE:

sequential/default/spidermonkey.wasm
                        time:   [68.569 us 68.696 us 68.856 us]
sequential/pooling/spidermonkey.wasm
                        time:   [69.406 us 69.435 us 69.465 us]

parallel/default/spidermonkey.wasm: with 1 background thread
                        time:   [69.444 us 69.470 us 69.497 us]
parallel/default/spidermonkey.wasm: with 16 background threads
                        time:   [183.72 us 184.31 us 184.89 us]
parallel/pooling/spidermonkey.wasm: with 1 background thread
                        time:   [69.018 us 69.070 us 69.136 us]
parallel/pooling/spidermonkey.wasm: with 16 background threads
                        time:   [326.81 us 337.32 us 347.01 us]

WITH THIS PR:

sequential/default/spidermonkey.wasm
                        time:   [6.7821 us 6.8096 us 6.8397 us]
                        change: [-90.245% -90.193% -90.142%] (p = 0.00 < 0.05)
                        Performance has improved.
sequential/pooling/spidermonkey.wasm
                        time:   [3.0410 us 3.0558 us 3.0724 us]
                        change: [-95.566% -95.552% -95.537%] (p = 0.00 < 0.05)
                        Performance has improved.

parallel/default/spidermonkey.wasm: with 1 background thread
                        time:   [7.2643 us 7.2689 us 7.2735 us]
                        change: [-89.541% -89.533% -89.525%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/default/spidermonkey.wasm: with 16 background threads
                        time:   [147.36 us 148.99 us 150.74 us]
                        change: [-18.997% -18.081% -17.285%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/pooling/spidermonkey.wasm: with 1 background thread
                        time:   [3.1009 us 3.1021 us 3.1033 us]
                        change: [-95.517% -95.511% -95.506%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/pooling/spidermonkey.wasm: with 16 background threads
                        time:   [49.449 us 50.475 us 51.540 us]
                        change: [-85.423% -84.964% -84.465%] (p = 0.00 < 0.05)
                        Performance has improved.
```

So an improvement of something like 80-95% for a very large module (7420
functions in its one funcref table, 31928 functions total).
@cfallin
Copy link
Member Author

cfallin commented Feb 9, 2022

Alright, everything addressed and CI green -- time to land this. Thanks again @alexcrichton for all the feedback!

@cfallin cfallin merged commit 39a52ce into bytecodealliance:main Feb 9, 2022
@cfallin cfallin deleted the lazy-anyfuncs branch February 9, 2022 21:56
mpardesh pushed a commit to avanhatt/wasmtime that referenced this pull request Mar 17, 2022
…iance#3733)

During instance initialization, we build two sorts of arrays eagerly:

- We create an "anyfunc" (a `VMCallerCheckedAnyfunc`) for every function
  in an instance.

- We initialize every element of a funcref table with an initializer to
  a pointer to one of these anyfuncs.

Most instances will not touch (via call_indirect or table.get) all
funcref table elements. And most anyfuncs will never be referenced,
because most functions are never placed in tables or used with
`ref.func`. Thus, both of these initialization tasks are quite wasteful.
Profiling shows that a significant fraction of the remaining
instance-initialization time after our other recent optimizations is
going into these two tasks.

This PR implements two basic ideas:

- The anyfunc array can be lazily initialized as long as we retain the
  information needed to do so. For now, in this PR, we just recreate the
  anyfunc whenever a pointer is taken to it, because doing so is fast
  enough; in the future we could keep some state to know whether the
  anyfunc has been written yet and skip this work if redundant.

  This technique allows us to leave the anyfunc array as uninitialized
  memory, which can be a significant savings. Filling it with
  initialized anyfuncs is very expensive, but even zeroing it is
  expensive: e.g. in a large module, it can be >500KB.

- A funcref table can be lazily initialized as long as we retain a link
  to its corresponding instance and function index for each element. A
  zero in a table element means "uninitialized", and a slowpath does the
  initialization.

Funcref tables are a little tricky because funcrefs can be null. We need
to distinguish "element was initially non-null, but user stored explicit
null later" from "element never touched" (ie the lazy init should not
blow away an explicitly stored null). We solve this by stealing the LSB
from every funcref (anyfunc pointer): when the LSB is set, the funcref
is initialized and we don't hit the lazy-init slowpath. We insert the
bit on storing to the table and mask it off after loading.

We do have to set up a precomputed array of `FuncIndex`s for the table
in order for this to work. We do this as part of the module compilation.

This PR also refactors the way that the runtime crate gains access to
information computed during module compilation.

Performance effect measured with in-tree benches/instantiation.rs, using
SpiderMonkey built for WASI, and with memfd enabled:

```
BEFORE:

sequential/default/spidermonkey.wasm
                        time:   [68.569 us 68.696 us 68.856 us]
sequential/pooling/spidermonkey.wasm
                        time:   [69.406 us 69.435 us 69.465 us]

parallel/default/spidermonkey.wasm: with 1 background thread
                        time:   [69.444 us 69.470 us 69.497 us]
parallel/default/spidermonkey.wasm: with 16 background threads
                        time:   [183.72 us 184.31 us 184.89 us]
parallel/pooling/spidermonkey.wasm: with 1 background thread
                        time:   [69.018 us 69.070 us 69.136 us]
parallel/pooling/spidermonkey.wasm: with 16 background threads
                        time:   [326.81 us 337.32 us 347.01 us]

WITH THIS PR:

sequential/default/spidermonkey.wasm
                        time:   [6.7821 us 6.8096 us 6.8397 us]
                        change: [-90.245% -90.193% -90.142%] (p = 0.00 < 0.05)
                        Performance has improved.
sequential/pooling/spidermonkey.wasm
                        time:   [3.0410 us 3.0558 us 3.0724 us]
                        change: [-95.566% -95.552% -95.537%] (p = 0.00 < 0.05)
                        Performance has improved.

parallel/default/spidermonkey.wasm: with 1 background thread
                        time:   [7.2643 us 7.2689 us 7.2735 us]
                        change: [-89.541% -89.533% -89.525%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/default/spidermonkey.wasm: with 16 background threads
                        time:   [147.36 us 148.99 us 150.74 us]
                        change: [-18.997% -18.081% -17.285%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/pooling/spidermonkey.wasm: with 1 background thread
                        time:   [3.1009 us 3.1021 us 3.1033 us]
                        change: [-95.517% -95.511% -95.506%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/pooling/spidermonkey.wasm: with 16 background threads
                        time:   [49.449 us 50.475 us 51.540 us]
                        change: [-85.423% -84.964% -84.465%] (p = 0.00 < 0.05)
                        Performance has improved.
```

So an improvement of something like 80-95% for a very large module (7420
functions in its one funcref table, 31928 functions total).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:x64 Issues related to x64 codegen cranelift:wasm cranelift Issues related to the Cranelift code generator wasmtime:api Related to the API of the `wasmtime` crate itself
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants