-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement lazy funcref table and anyfunc initialization. #3733
Conversation
1c67bf0
to
1573c88
Compare
Benchmarking: the delta is hard to see with memory initialization taking most of the time (pre-#3697); a few percent improvement in the mean with However, in #3697 I measured approximately a factor-of-two improvement in the |
1573c88
to
f45f689
Compare
Subscribe to Label Actioncc @peterhuene
This issue or pull request has been labeled: "wasmtime:api"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
Thanks for splitting this out! I've mentioned a few things here in other contexts but I think it's good to get them all in once place. The tl;dr; is that I don't personally feel like this is the best approach, but I think we can still get the same wins. Due to a number of implementation details I don't think that this PR is really getting all that much of a benefit from laziness. The anyfunc array is used as storage for a module's Adding all that together we still pay the full cost of anyfunc initialization for all table entries when a module is instantiated. The cost of initializing exports is done lazily and most modules nowadays don't have Given all that personally think that a better implementation would be to avoid the laziness complications and instead just go ahead and keep doing what we do today, only over the same set of functions that this PR is doing it for. Basically I think we should shrink the anyfunc array to the size of the As a few other side things:
Ok now all that being said, one thing you mentioned the other day which I'm coming around to is that long-term I think it would be best to lazily initialize tables instead of just this anyfunc array. With CoW and uffd we're already effectively lazily initializing memory (just relying on the kernel to do so), and I think it would be best to do that for tables as well. I don't really know how we'd do this off the top of my head though. Alternatively we could try to figure out how to make the table initialization much more "static" where work is done at compile time instead of instantiation time. I had one idea over here but that's sort of half-baked and only shrinks |
f45f689
to
4103e81
Compare
@alexcrichton thanks for your thoughts here! Top-level: I agree that in practice, eagerly initializing a smaller array-of-anyfuncs which is only "anyfuncs referenced in tables or by ref.func" is more-or-less equivalent to this PR's lazy approach. (In the presence of many The main attraction I had to this approach was simplicity and small impact -- indexing via The other thing about a pervasively lazy approach is that it paves the way for more gains later, as I think you're getting at with your second-to-last paragraph. If we're lazy about requesting anyfunc pointers (ie not initializing table elements until accessed), then the laziness in filling in the pointed-to anyfuncs finally pays off. And I think the change to get to this point is pretty simple -- just a null check when loading from the table and a slowpath that calls Of course we could do both, and that has benefits too -- both shrinking the size of the vmcontext, and maybe doing things eagerly at first, does not preclude us from lazily initializing later... A specific note too:
Yes, exactly, the goal is to take advantage of a benign race (and come out with correct results regardless). Maybe to make the compiler happy we need to do this all through atomics and The comment is attempting to give a safety argument why this is still OK in the presence of concurrent accesses; we should do whatever we need to at the language semantics level but I think I've convinced myself that the resulting machine code / access pattern is sound :-) |
4103e81
to
40ce633
Compare
@alexcrichton I've reformulated the benign-racy init path now to use atomics for the bitmap accesses at least. There is still a (very intentional!) race, in that the bit-set is not done with an atomic |
I would personally not say that this PR is simple with a small impact, due to a few consequences:
Overall I continue to feel that this is a fair bit of complication, especially with I'm less certain about the idea of paving the way for gains later as well. I understand how this is sort of theoretically applicable but we're talking about nanoseconds-per-element "laziness" which feels like too small of a granularity to manage to me. If we get around to implementing a fully-lazily-initialized table approach (which I'm not entirely sure how we'd do that) it seems like it might be at a more bulk-level instead of at a per-element level. I would also expect that whatever laziness mechanism that would use would subsume this one since there's not really much need for cascading laziness. Naively I would expect that using
I believe this is still a theoretical data race that Rust disallows. Despite the machine code working just fine at the LLVM layer any concurrent unsynchronized writes are a data race and undefined behavior, so any concurrent writes would have to be atomic. (again though I don't think it's worthwhile trying to fix this because I don't think this overall laziness approach is what we want either) |
@alexcrichton thanks for your comments!
So, a few things:
Otherwise, simplicity is subjective, I guess :-) The factoring of state here is I think going to be necessary for most approaches that initialize this general state after instantiation -- we probably can't get away from having e.g. signature info at libcall time. I won't argue the concurrency bit here is complex -- maybe the better answer is to put together an actual lazy-table-init implementation, and then get rid of the initialization bitmap and go back to the zero-func-ptr-means-uninitialized design I had earlier. Maybe it would help if I "drew the rest of the owl" here and actually sketched lazy table initialization as well? This would perhaps make the picture a bit clearer how all of this could work together, at least as I'm imagining it. |
Ah so my thinking of a more bulk-style initialization is me being inspired by what we're doing for memories where we fault in by page rather than all at once. I'm sort of naively assuming that some level of granularity would also be similar on tables, but byte access patterns may not really have any correlation to wasm-table-access-patterns so it may just be a flawed assumption on my part. Also I think I understand now more how something like this would be required for full laziness. I'm still somewhat skeptical that it would need to look precisely like this but I see how my thinking of cascading laziness being unnecessary is flawed. And finally sorry I meant to respond to the table point but I forgot! That does sound workable but I think we still need to be careful. For example if an element segment initializes an imported table I don't think that we can skip that. Additionally we'd have to be careful to implement I think that my preference at this point is to largely let performance guide us. The performance win in this PR is to not initialize anyfuncs for everything that never needs an anyfunc. There's still some unrealized performance wins though:
Given that this is all peformance work effectively I think I'd prefer to be guided by numbers. I would be curious to benchmark this implementation here vs one that shrinks the anyfunc array like I've been thinking. The performance difference may come out in the wash but given how small the numbers are here I could imagine it being significant. I think we could then use a similar benchmark for measuring a fully-lazy-table approach. Overall I'm still hesitant to commit to design work which will largely only come to fruition in the future when the future is still somewhat uncertain (e.g. the precise design of a fully-lazy-table approach). In that sense I'd prefer to get the win we'll always want in any case, shrinking the VMContext, here and work on the laziness afterwards. (or doing it all at once is fine, but I'd prefer to not be in an inbetween state where we still have a big VMContext and some of it is initialized at instantiation time) |
40ce633
to
8965c2e
Compare
I was almost done with my Friday, then a thought occurred to me -- we can just make the anyfunc fields atomics as well and do relaxed-ordering loads/stores. This satisfies the language's memory model; now all racing accesses are to atomics, and there is a Release-to-Acquire edge on the bitmap update that ensures the relaxed stores during init are seen by relaxed loads during normal use. This part at least should be fully theoretically sound now, I think. I'll do the rest of the lazy-table init as it is in my head on Monday, just to sketch it and be able to measure, I think. This work will be compatible with / complementary to any work that reduces the number of anyfuncs as well... |
Er sorry but to reiterate again, I do not think anything should be atomic here. This method is statically not possible to call concurrently (or so I believe) so I do not think we should be "working around" the compiler in this case. Instead I think the method should change to |
Ah, OK; I had been assuming that concurrent calls actually are possible via the "get an export" path. Within |
8965c2e
to
067eb47
Compare
I've implemented lazy table initialization now (mostly; there's one odd test failure with linking I need to resolve). It could probably be cleaned up somewhat, but here's one initial datapoint, with spidermonkey.wasm:
So, 72µs to 22µs, or a 69% reduction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some initial comments below, but the biggest one is that I don't think this is lazy at least in the way I thought it would be lazy. Instantiating a module locally still seems to apply all element segment initializers on instantiation rather than lazily computing them after start-up.
Otherwise though I feel that it's best to apply a refactoring either before this PR or in this PR about all the little Arc
allocations that are popping up. I think they should all be housed in one "unit" to avoid having to piecemeal hand everything from the wasmtime
crate to the wasmtime_runtime
crate as separate Arc
allocations.
Overall though this approach seems pretty reasonable to me. The codegen changes in func_environ.rs
all look as I'd expect from a lazily-initialized scheme.
As a final note though I feel like this is significant enough that we should really shrink the size of the VMCallerCheckedAnyfunc
array either before this PR or during this PR. Especially if we're still memset
-ing the entire array to 0 during instantiation for the on-demand allocator then shrinking this array seems like it will be quite beneficial because it's less memory to initialize. If you'd like though I can work on this change independently.
/// here to fields that should be initialized to zero. This allows the | ||
/// caller to use an efficient means of zeroing memory (such as using | ||
/// anonymous-mmap'd zero pages) if available, rather than falling | ||
/// back onto a memset (or the manual equivalent) here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now I think that this may be detrimental to the on-demand allocator because it either does a too-large memset
right now for the entire VMContext
or it will need to call calloc
, which also does a too-large memset
.
An alternative design, though, would be to pass a flag into this function whether the memory is known to be zero'd. That way we could conditionally zero out the array of VMCallerCheckedAnyfunc
contents based on whether it was pre-zeroed already, making the pooling allocator efficiently use madvise
while the on-demand allocator still initializes as little memory as possible.
After saying this, though, I think it may actually be best to take this incrementally? What do you think about using memset
here to zero memory and benchmarking later how much faster using madvise
is for the pooling allocator? I'm a little worried about the madvise
traffic being increased for the pooling allocator since we already know it's a source of slow tlb-shootdowns, and I'm afraid that any cost of that will be hidden by the other performance gains in this PR so we can't accurately measure the strategy of memset
-on-initialize vs madvise
-to-zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified to take a prezeroed
flag as suggested -- good idea!
I'm a little hesitant to settle on a memset
-only design even for the pooling allocator, as my earlier measurements with a large module (spidermonkey.wasm) showed that that became really quite hot; that's why I had gone to the initialized-bitmap instead (1 bit vs 24 bytes to zero --> 192x less). But I will benchmark this to be sure on Monday.
Alternatively, I think that if we move to a single-slot-per-instance design for the pooling allocator, such that we have one madvise
for everything (memories, tables), stretching that range to include a few extra pages for the Instance and VMContext should be no big deal (it will share one mmap-lock acquire and one set of IPIs), so I suspect that madvise-cost may not have to be a huge consideration in the future. Of course getting there implies removing uffd first, since it depends on the separate-pools design, so we would have to make that decision first.
067eb47
to
1a0ae24
Compare
@alexcrichton (and anyone else watching) argh, sorry about that, I pushed my changes prior to last night's measurements to a private save-the-work-in-progress fork that I keep, but not here; just pushed now. Regardless the review comments are useful and I'll keep refining this! |
352f990
to
fede8a8
Compare
For the benchmark numbers I posted #3775 which I think will tweak the benchmark to better serve measuring this. Running that locally for a large module I'm seeing the expected exponential slowdown for the pooling allocator, and also as expected the mmap allocator is behaving better here due to fewer ipis needed (as it's all contended on the vma lock) Could you try benchmarking with that? |
Sure; here's a benchmarking run using your updated
So, switching to an explicit memset, we see performance improvements in all cases except the large SpiderMonkey module (31k functions in my local build), where using explicit memset is 5.49x slower (+449%). Either way, it's clear to me that we'll feel some pain on the low end (madvise) or high end (memset), so the ultimate design probably has to be to incorporate this flash-clearing into an madvise we're already doing (single-madvise idea). I can implement that as soon as we confirm that we want to remove uffd. I guess the question is just which we settle on in the meantime :-) |
Hmm, actually, the other option is to bring back the bitmap approach from the first versions of this PR above. I think the atomics scared everyone off but by now I've propagated |
d46057a
to
b9e160e
Compare
I'm a bit perplexed with the numbers I'm seeing locally. Using this small program I measured that the difference in memset-vs-madvise to clear 16 pages of memory is:
This is what I'd expect which is that memset has a relatively constant cost where So it doesn't really make sense to me that using I don't have the
After I applied this patch which I believe avoids madvise for instance pages entirely and uses memset. Using the same benchmarking command the timings I got for the two approaches are:
This is more in line with what I am expecting. We're seeing that madvise is atrocious for concurrent performance, so I'm highly surprised that you're seeing madvise as faster and I'm seeing memset as faster. I'm working on the arm64 server that we have but I don't think x86_64 vs aarch64 would explain this difference. I've confirmed with Can you try to reproduce these results locally? If not I think we need to figure out what the difference is? |
ee9c7a6
to
7fd001f
Compare
Ah, I think it would actually, or more precisely the server specs will: the aarch64 machine we have is 128 cores, so an "IPI to every core" is exceedingly expensive. For reference I'm doing tests on my Ryzen 3900X, 12 cores; big but not "never do a broadcast to all cores or you will suffer" big. It seems to me that past a certain tradeoff point, which varies based on the system, madvise is faster to flash-zero sparsely accessed memory. (In the limit, the IPI is a fixed cost, flushing the whole TLB is a fixed cost, and actually zeroing the page tables is faster than zeroing whole pages.) Perhaps on arm2-ci that point is where we're zeroing 1MB, or 10MB, or 100MB; it would be interesting to sweep your experiment in that axis. In any case, the numbers are the numbers; I suspect if I ssh'd to the arm64 machine and ran with your wasm module, I'd replicate yours, and if you ssh'd to my workstation and ran with my wasm module, you'd replicate mine. The interesting bits are the differences in platform configuration and workload I think. For reference, my spidermonkey.wasm (gzipped) has 31894 functions, so the anyfunc array is 765 kilobytes that need to be zeroed. Three-quarters of a megabyte! We can definitely do better than that I think. (Possibly-exported functions, as you've suggested before -- I still need to build this -- comes to 7420 functions, or 178 KB; better but still too much.) Given all of that, I do have a proposed direction that I think will solve all of the above issues: I believe that an initialization-bitmap can help us here. If we keep a bitmap to indicate when an anyfunc is not initialized, we need only 3992 bytes (499 |
I was curious to quantify the difference between our systems a bit more so:
On a 12-core system I get: 64KiB zeroing:
1 MiB zeroing:
So on my system at least, madvise is a clear loss at. 64KiB, but is a win in all cases for 1 MiB. (Wildly enough, it becomes more of a win at higher thread counts, because the total cache footprint of all threads touching their memory exceeds my LLC size.) I haven't bisected the breakeven point between those two. I'm also kind of flabbergasted at the cost of madvise on arm2-ci; for 64KIB, 8 threads, you see 16 µs while I see So given that my canonical "big module" requires close to a megabyte of VMContext zeroing, and given that we're testing on systems with wildly different TLB-shootdown cost, I'm not surprised at all that we've been led to different conclusions! The variance of workloads and systems "in the field" is all the more reason to not have to zero data at all, IMHO, by using a bitmap :-) |
Hm there's still more I'd like to dig into performance-wise here, but I don't want to over-rotate and dedicate this whole thread to a few lines of code that are pretty inconsequential. Additionally I was thinking last night and concluded "why even zero at all?" For the anyfunc array I don't think there's actually any need to zero since it's not accessed in a loop really right now. For example table elements are sort of a next layer of cache so if you hammer on table elements any computation to create the anyfunc is cached at that layer. Otherwise the only other uses of anyfuncs are exports (cached inherently as you typically pull out the export and don't pull it out again-and-again) and as Effectively I think we could get away with doing nothing to the anyfunc array on instantiation. When an anyfunc is asked for, and every time it's asked for, we construct the anyfunc into the appropriate slot and return it. These aren't ever used concurrently, as we've seen, so there's no need to worry about concurrent writes and it's fine to pave over what's previously there with the same content. In the future when we have the instance/table/memory all adjacent in the pooling allocator we may as well zero out the memory with one If that seems reasonable then I think it's fine to shelve performance things for later since there's no longer a question of how to zero since we wouldn't need to zero at all. |
Huh, that is a really interesting idea; I very much like the simplicity of it! Thanks! I can do this, then file an issue to record the "maybe an is-initialized bitmap, or maybe madvise-zero anyfuncs along with the other bits" ideas when if/when we do eventually come across |
a31b1f8
to
f3aa888
Compare
@alexcrichton I've done all of the refactors you suggested ( Tomorrow I'll squash this down and look at the complete diff with fresh eyes, and clean it up -- undoubtedly some small diffs have snuck in from attempting and undoing bits that I'll want to remove. I'll do a final round of before/after benchmarking as well so provide a good top-level summary of this PR's effects. |
f91380d
to
c8cfec5
Compare
OK, I think this is ready for the next (hopefully final? happy to keep going of course) round of review. Here's a benchmark (in-tree benches/instantiation.rs, using above spidermonkey.wasm, memfd enabled, {1, 16} threads):
tl;dr: 80-95% faster, or 69µs -> 7µs (1-threaded, on-demand) / 337µs -> 50µs (16-threaded, pooling). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks great to me, thanks again for being patient! I've got a number of stylistic nits below but nothing major really.
One thing I did remember though is that we should probably have at least a smoke benchmark showing that performance with something that uses call_indirect
isn't tanking. I don't think it's necessary to have a super rigorous benchmark but I think it'd be good to get a number or two about the runtime of modules in addition to the instantiation time.
That being said the instantiation numbers you're getting are very impressive! It's not every day that we get to take instantiation and make it that much faster. FWIW I was running this locally as well and forgot to turn on memfd, but it meant that I could measure the impact of instantiating a "big module" and it dropped from 2ms to 5us, meaning it's now 400x faster or 99.75% faster after memfd + lazy table init. Certainly nothing to sneeze at!
e7ed67e
to
c2939e6
Compare
During instance initialization, we build two sorts of arrays eagerly: - We create an "anyfunc" (a `VMCallerCheckedAnyfunc`) for every function in an instance. - We initialize every element of a funcref table with an initializer to a pointer to one of these anyfuncs. Most instances will not touch (via call_indirect or table.get) all funcref table elements. And most anyfuncs will never be referenced, because most functions are never placed in tables or used with `ref.func`. Thus, both of these initialization tasks are quite wasteful. Profiling shows that a significant fraction of the remaining instance-initialization time after our other recent optimizations is going into these two tasks. This PR implements two basic ideas: - The anyfunc array can be lazily initialized as long as we retain the information needed to do so. For now, in this PR, we just recreate the anyfunc whenever a pointer is taken to it, because doing so is fast enough; in the future we could keep some state to know whether the anyfunc has been written yet and skip this work if redundant. This technique allows us to leave the anyfunc array as uninitialized memory, which can be a significant savings. Filling it with initialized anyfuncs is very expensive, but even zeroing it is expensive: e.g. in a large module, it can be >500KB. - A funcref table can be lazily initialized as long as we retain a link to its corresponding instance and function index for each element. A zero in a table element means "uninitialized", and a slowpath does the initialization. Funcref tables are a little tricky because funcrefs can be null. We need to distinguish "element was initially non-null, but user stored explicit null later" from "element never touched" (ie the lazy init should not blow away an explicitly stored null). We solve this by stealing the LSB from every funcref (anyfunc pointer): when the LSB is set, the funcref is initialized and we don't hit the lazy-init slowpath. We insert the bit on storing to the table and mask it off after loading. We do have to set up a precomputed array of `FuncIndex`s for the table in order for this to work. We do this as part of the module compilation. This PR also refactors the way that the runtime crate gains access to information computed during module compilation. Performance effect measured with in-tree benches/instantiation.rs, using SpiderMonkey built for WASI, and with memfd enabled: ``` BEFORE: sequential/default/spidermonkey.wasm time: [68.569 us 68.696 us 68.856 us] sequential/pooling/spidermonkey.wasm time: [69.406 us 69.435 us 69.465 us] parallel/default/spidermonkey.wasm: with 1 background thread time: [69.444 us 69.470 us 69.497 us] parallel/default/spidermonkey.wasm: with 16 background threads time: [183.72 us 184.31 us 184.89 us] parallel/pooling/spidermonkey.wasm: with 1 background thread time: [69.018 us 69.070 us 69.136 us] parallel/pooling/spidermonkey.wasm: with 16 background threads time: [326.81 us 337.32 us 347.01 us] WITH THIS PR: sequential/default/spidermonkey.wasm time: [6.7821 us 6.8096 us 6.8397 us] change: [-90.245% -90.193% -90.142%] (p = 0.00 < 0.05) Performance has improved. sequential/pooling/spidermonkey.wasm time: [3.0410 us 3.0558 us 3.0724 us] change: [-95.566% -95.552% -95.537%] (p = 0.00 < 0.05) Performance has improved. parallel/default/spidermonkey.wasm: with 1 background thread time: [7.2643 us 7.2689 us 7.2735 us] change: [-89.541% -89.533% -89.525%] (p = 0.00 < 0.05) Performance has improved. parallel/default/spidermonkey.wasm: with 16 background threads time: [147.36 us 148.99 us 150.74 us] change: [-18.997% -18.081% -17.285%] (p = 0.00 < 0.05) Performance has improved. parallel/pooling/spidermonkey.wasm: with 1 background thread time: [3.1009 us 3.1021 us 3.1033 us] change: [-95.517% -95.511% -95.506%] (p = 0.00 < 0.05) Performance has improved. parallel/pooling/spidermonkey.wasm: with 16 background threads time: [49.449 us 50.475 us 51.540 us] change: [-85.423% -84.964% -84.465%] (p = 0.00 < 0.05) Performance has improved. ``` So an improvement of something like 80-95% for a very large module (7420 functions in its one funcref table, 31928 functions total).
c2939e6
to
c841cbe
Compare
Alright, everything addressed and CI green -- time to land this. Thanks again @alexcrichton for all the feedback! |
…iance#3733) During instance initialization, we build two sorts of arrays eagerly: - We create an "anyfunc" (a `VMCallerCheckedAnyfunc`) for every function in an instance. - We initialize every element of a funcref table with an initializer to a pointer to one of these anyfuncs. Most instances will not touch (via call_indirect or table.get) all funcref table elements. And most anyfuncs will never be referenced, because most functions are never placed in tables or used with `ref.func`. Thus, both of these initialization tasks are quite wasteful. Profiling shows that a significant fraction of the remaining instance-initialization time after our other recent optimizations is going into these two tasks. This PR implements two basic ideas: - The anyfunc array can be lazily initialized as long as we retain the information needed to do so. For now, in this PR, we just recreate the anyfunc whenever a pointer is taken to it, because doing so is fast enough; in the future we could keep some state to know whether the anyfunc has been written yet and skip this work if redundant. This technique allows us to leave the anyfunc array as uninitialized memory, which can be a significant savings. Filling it with initialized anyfuncs is very expensive, but even zeroing it is expensive: e.g. in a large module, it can be >500KB. - A funcref table can be lazily initialized as long as we retain a link to its corresponding instance and function index for each element. A zero in a table element means "uninitialized", and a slowpath does the initialization. Funcref tables are a little tricky because funcrefs can be null. We need to distinguish "element was initially non-null, but user stored explicit null later" from "element never touched" (ie the lazy init should not blow away an explicitly stored null). We solve this by stealing the LSB from every funcref (anyfunc pointer): when the LSB is set, the funcref is initialized and we don't hit the lazy-init slowpath. We insert the bit on storing to the table and mask it off after loading. We do have to set up a precomputed array of `FuncIndex`s for the table in order for this to work. We do this as part of the module compilation. This PR also refactors the way that the runtime crate gains access to information computed during module compilation. Performance effect measured with in-tree benches/instantiation.rs, using SpiderMonkey built for WASI, and with memfd enabled: ``` BEFORE: sequential/default/spidermonkey.wasm time: [68.569 us 68.696 us 68.856 us] sequential/pooling/spidermonkey.wasm time: [69.406 us 69.435 us 69.465 us] parallel/default/spidermonkey.wasm: with 1 background thread time: [69.444 us 69.470 us 69.497 us] parallel/default/spidermonkey.wasm: with 16 background threads time: [183.72 us 184.31 us 184.89 us] parallel/pooling/spidermonkey.wasm: with 1 background thread time: [69.018 us 69.070 us 69.136 us] parallel/pooling/spidermonkey.wasm: with 16 background threads time: [326.81 us 337.32 us 347.01 us] WITH THIS PR: sequential/default/spidermonkey.wasm time: [6.7821 us 6.8096 us 6.8397 us] change: [-90.245% -90.193% -90.142%] (p = 0.00 < 0.05) Performance has improved. sequential/pooling/spidermonkey.wasm time: [3.0410 us 3.0558 us 3.0724 us] change: [-95.566% -95.552% -95.537%] (p = 0.00 < 0.05) Performance has improved. parallel/default/spidermonkey.wasm: with 1 background thread time: [7.2643 us 7.2689 us 7.2735 us] change: [-89.541% -89.533% -89.525%] (p = 0.00 < 0.05) Performance has improved. parallel/default/spidermonkey.wasm: with 16 background threads time: [147.36 us 148.99 us 150.74 us] change: [-18.997% -18.081% -17.285%] (p = 0.00 < 0.05) Performance has improved. parallel/pooling/spidermonkey.wasm: with 1 background thread time: [3.1009 us 3.1021 us 3.1033 us] change: [-95.517% -95.511% -95.506%] (p = 0.00 < 0.05) Performance has improved. parallel/pooling/spidermonkey.wasm: with 16 background threads time: [49.449 us 50.475 us 51.540 us] change: [-85.423% -84.964% -84.465%] (p = 0.00 < 0.05) Performance has improved. ``` So an improvement of something like 80-95% for a very large module (7420 functions in its one funcref table, 31928 functions total).
During instance initialization, we build two sorts of arrays eagerly:
We create an "anyfunc" (a
VMCallerCheckedAnyfunc
) for every functionin an instance.
We initialize every element of a funcref table with an initializer to
a pointer to one of these anyfuncs.
Most instances will not touch (via call_indirect or table.get) all
funcref table elements. And most anyfuncs will never be referenced,
because most functions are never placed in tables or used with
ref.func
. Thus, both of these initialization tasks are quite wasteful.Profiling shows that a significant fraction of the remaining
instance-initialization time after our other recent optimizations is
going into these two tasks.
This PR implements two basic ideas:
The anyfunc array can be lazily initialized as long as we retain the
information needed to do so. A zero in the func-ptr part of the tuple
means "uninitalized"; a null-check and slowpath does the
initialization whenever we take a pointer to an anyfunc.
A funcref table can be lazily initialized as long as we retain a link
to its corresponding instance and function index for each element. A
zero in a table element means "uninitialized", and a slowpath does the
initialization.
The use of all-zeroes to mean "uninitialized" means that we can use fast
memory clearing techniques, like madvise(DONTNEED) on Linux or just
freshly-mmap'd anonymous memory, to get to the initial state without
a lot of memory writes.
Funcref tables are a little tricky because funcrefs can be null. We need
to distinguish "element was initially non-null, but user stored explicit
null later" from "element never touched" (ie the lazy init should not
blow away an explicitly stored null). We solve this by stealing the LSB
from every funcref (anyfunc pointer): when the LSB is set, the funcref
is initialized and we don't hit the lazy-init slowpath. We insert the
bit on storing to the table and mask it off after loading.
Performance effect on instantiation in the on-demand allocator (pooling
allocator effect should be similar as the table-init path is the same):
So, 72µs to 22µs, or a 69% reduction.