Implement wasmtime GC tracing without libunwind #2459

cfallin · 2020-11-30T20:31:11Z

In wasmtime, GC tracing of reftype pointers currently works by using libunwind to iterate over stack frames, fetching a stackmap for each relevant PC and finding the stack slots with live pointers.

This works perfectly fine, but is potentially slower than we would like, because libunwind relies on DWARF info to understand stack frames. It is also more complex -- it relies on DWARF generation and interpretation to be correct -- which increases risk a little because GC-tracing bugs can lead to various security issues.

In contrast, many other high-performance JITs use explicit data structures of some sort on the stack so that tracing stack roots boils down to walking a linked list of some sort. For example, SpiderMonkey has a strict JIT-frame discipline allowing fast iteration (different from the system ABI), and V8 indirects object references through InstanceHandles that link themselves into a list on the thread context.

We should look into designing a mechanism that maintains a stack of frames reachable from the vmctx and walkable without any metadata (aside from the stackmaps). Two options that come to mind are:

Option 1: Shadow Stack of (SP, Stackmap) Tuples

Maintain a shadow stack (with top and limit pointers in vmctx) of (stackmap, SP) tuples. On function entry, allocate a tuple. At every safepoint, ensure that the stack pointer and stackmap for that safepoint are up-to-date in the tuple. Walking the stack for GC roots then simply requires (i) looping over these tuples, and (ii) tracing references at offsets indicated by the stackmap.

Some advantages of this scheme are:

It is a relatively small delta from today's implementation. It requires inserting code at prologue and epilogue/returns to alloc/dealloc the tuple, and at every safepoint to store the stackmap pointer and SP value.

Some disadvantages of this scheme are:

The stackmap pointer would have to be indirected somehow; we cannot bake the raw pointer value into the code if the code is cached on disk.
It will slightly increase memory traffic at every safepoint, as in addition to the spills of all references inserted by the register allocator, we have to store stackmap and SP pointers.

Option 2: Shadow Stack of References

Maintain a shadow stack that actually stores spilled references. On function entry, bounds-check that there is enough shadow-stack space for the maximal live-set of reftyped values at any safepoint in the function (this is statically known). At any safepoint, push all live reftyped values to the shadow stack; after the safepoint, restore them (if we implement a moving GC that may edit pointers) or bulk-pop them by bumping the top pointer.

Some advantages of this scheme are:

It is independent of reftypes support in Cranelift and regalloc.rs; in other words, it is very simple and easy to verify. While we are pretty confident in the reftypes implementation at least in the backtracking allocator at this point, less complexity is always good; and it lowers the bar for adopting other register allocators in the future, if we choose to do that.
Tracing will be as fast as possible; we literally provide the GC with a &[PointerT] (slice of contiguous live pointers). This is even better than walking a potentially sparse stackmap looking for set bits.
It has no additional memory traffic relative to the status quo (what we do today) unless a reftyped value was already spilled; we are simply replacing regalloc spills with stores to our shadow stack.

A disadvantage of this scheme is:

It adds a little memory traffic when a reftyped value was already spilled at a safepoint: it will be loaded from the spillslot and then pushed onto the shadow stack. Note that this does not need to be explicitly handled (shadow-stack code is inserted before regalloc, so regalloc will just Do The Right Thing and reload the spilled value), but it is suboptimal.

I tentatively prefer Option 2, but I can see both options as viable. Thoughts?

cc @fitzgen

The text was updated successfully, but these errors were encountered:

cfallin · 2020-11-30T20:36:59Z

One advantage of both of these schemes that I missed above is:

There is zero overhead for any frame that has no reference-typed values. Such frames simply don't appear in the shadow stack.

Thus this will likely be an additional performance improvement (on top of faster stack-walking) in code that only sparsely uses reference types.

fitzgen · 2020-11-30T20:46:52Z

I tentatively prefer Option 2, but I can see both options as viable. Thoughts?

Agreed.

We can even use a guard page to avoid explicit capacity checks for option 2, as long as we always push in the right order.

alexcrichton · 2020-11-30T22:56:18Z

The other purpose of dwarf-based unwinding is that we can get a full stack trace on traps, but I think most developers generally expect traps to be slow-ish and not used in performance-sensitive code. In that sense if the main concern here is gc of references and correctness we can focus on just solving that one problem rather than a more general problem of reconstructing the call stack. This is what both proposed solutions basically do (assuming in the first the push is elided if the frame has no stack map), but figured I'd point it out!

One possible way to optimize your first thought would be to first assign a constant number to all stack maps (perhaps function-local too, so when compiling a function we know what number a stack map gets). The linked list entry would then contain a previous pointer, the function's global number, the stack pointer (which presumably doesn't change during function execution), and a slot for the local number within a function. Then when a safepoint happens we'd update the local number slot but that's it, ideally keeping traffic pretty low? (and this would also avoid baking in pointers and it'd be up to the stack walker to translate indices to actual stack maps). This could also presumably all be skipped for functions having no reference types.

I would personally lean towards as little complexity as possible, though, so option 2 sounds pretty good to me as well. One extra possible cost though is that the base pointer of the shadow stack to push to would need to be recalculated after each function call in case the function call relocated it. That may be unconditional new memory traffic we can't get away from?

cfallin · 2020-11-30T23:05:49Z

The other purpose of dwarf-based unwinding is that we can get a full stack trace on traps

Indeed, for maximal clarity I should point out that I wouldn't expect libunwind usage or DWARF info to go away at all; rather, this is just decoupling GC, as you say. Though removing one dependency on the DWARF info might allow us to omit it if GC was the only reason it was generated (i.e., user does not care about debugging or trap callstacks), so could save some memory e.g. in production settings.

[guard page] ... [reload shadow-stack top after return]

Combining these two thoughts led me to: wouldn't it be nice to just build this on the regular stack, where these problems are already solved ... which I think could actually be possible (let's call this "Option 2.5"): push refs on the real stack at any safepoint, in a contiguous block; push the block size; link this chunk into a list of chunks by reading vmctx.ref_chain_head, pushing that, and setting vmctx.ref_chain_head to this chunk. In other words, give up the contiguous-slice-of-live-pointers for linked-list-of-contiguous-chunks, but essentially alloca() those chunks.

I'm indifferent on Option 2.0 vs. Option 2.5 (adds complexity in one place, removes it from another); just thought I would add this to the mix.

alexcrichton · 2020-12-01T15:30:27Z

That's actually a good point that having a switch to turn off dwarf unwinding info would be quite nice!

Anyway back to this isssue, I like the idea of using an alloca-equivalent rather than having to manage two stacks, it should help make the prologue a bit cheaper ideally and is one less stack to manage.

cfallin · 2022-07-28T22:57:45Z

@fitzgen now that #4431 is merged (🎉 ) is this issue subsumed as well? I.e. I wonder if GC tracing works with unwinding info omitted?

cfallin · 2022-07-28T22:58:36Z

(the other part of this issue is I guess discussing ways to avoid the need for stackmaps entirely, but the toplevel problem description is basically just "make do without libunwind")

fitzgen · 2022-08-01T17:32:36Z

Yes, this issue should be resolved, but I'll leave it open until I can verify whether we can disable dynamically registering unwind info without breaking perf or what have you.

fitzgen · 2022-08-04T16:34:52Z

I'm actually going to close this as complete, and handle the unwind info and all that in #4554

fitzgen added the wasm-proposal:reference-types Issues related to the WebAssembly Reference Types proposal label Nov 30, 2020

cfallin mentioned this issue Jan 9, 2021

Detailed debug-info (DWARF) support in new backends (initially x64). #2565

Merged

cfallin mentioned this issue Feb 23, 2021

newBE: value_labels_ranges is very slow #2681

Closed

fitzgen mentioned this issue Apr 6, 2021

Can't unwind wasm frames on aarch64-darwin with system libunwind #2808

Open

cfallin mentioned this issue Oct 11, 2021

Implementation strategy for the Exception Handling proposal #3427

Open

cfallin mentioned this issue Mar 14, 2022

Fold setjmp into Cranelift-generated entry trampolines to WebAssembly #3927

Open

cfallin mentioned this issue Mar 31, 2022

When to remove support for disabling a feature? #3969

Open

fitzgen closed this as completed Aug 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement wasmtime GC tracing without libunwind #2459

Implement wasmtime GC tracing without libunwind #2459

cfallin commented Nov 30, 2020 •

edited

Loading

cfallin commented Nov 30, 2020

fitzgen commented Nov 30, 2020

alexcrichton commented Nov 30, 2020

cfallin commented Nov 30, 2020

alexcrichton commented Dec 1, 2020

cfallin commented Jul 28, 2022

cfallin commented Jul 28, 2022

fitzgen commented Aug 1, 2022

fitzgen commented Aug 4, 2022

Implement wasmtime GC tracing without libunwind #2459

Implement wasmtime GC tracing without libunwind #2459

Comments

cfallin commented Nov 30, 2020 • edited Loading

Option 1: Shadow Stack of (SP, Stackmap) Tuples

Option 2: Shadow Stack of References

cfallin commented Nov 30, 2020

fitzgen commented Nov 30, 2020

alexcrichton commented Nov 30, 2020

cfallin commented Nov 30, 2020

alexcrichton commented Dec 1, 2020

cfallin commented Jul 28, 2022

cfallin commented Jul 28, 2022

fitzgen commented Aug 1, 2022

fitzgen commented Aug 4, 2022

cfallin commented Nov 30, 2020 •

edited

Loading