Wasm tests: add typed-funcref test showing example of desirable optimizations. #8158

cfallin · 2024-03-17T03:29:16Z

In order to have fast IC (inline cache) chains in AOT-compiled dynamic language Wasms, it would be great if we could make the "call to a typed funcref at a constant table index" pattern fast.

This use-case was discussed at the most recent Wasmtime biweekly and @jameysharp is working on some optimizations; the intent of this PR is to provide a concrete test-case whose blessed output we can see improve over time.

In particular, the following opts are still desirable:

With the use of non-nullable typed funcrefs, there shouldn't be a null check (there currently is, as noted by a comment in the code due to lack of type information at the right spot).
With the use of a constant table size and a constant index to the table.get, we should be able to load from the table without a bounds-check or any Spectre masking.

Other further optimizations for this pattern might be possible if we rearrange the table and function-reference data structures, and the lazy-initialization scheme thereof, but the above should be agnostic to that.

@jameysharp

…izations. In order to have fast IC (inline cache) chains in AOT-compiled dynamic language Wasms, it would be great if we could make the "call to a typed funcref at a constant table index" pattern fast. This use-case was discussed at the most recent Wasmtime biweekly and @jameysharp is working on some optimizations; the intent of this PR is to provide a concrete test-case whose blessed output we can see improve over time. In particular, the following opts are still desirable: - With the use of non-nullable typed funcrefs, there shouldn't be a null check (there currently is, as noted by a comment in the code due to lack of type information at the right spot). - With the use of a constant table size and a constant index to the `table.get`, we should be able to load from the table without a bounds-check or any Spectre masking. Other further optimizations for this pattern might be possible if we rearrange the table and function-reference data structures, and the lazy-initialization scheme thereof, but the above should be agnostic to that.

jameysharp

"Chris," I whine to myself, "stop working on a Saturday evening," as I approve your PR at 10pm

Thanks for this! It's definitely the kind of thing I was hoping for.

This example module initializes element zero of the table to point to a function, then calls through elements one and two. What is that supposed to do? Or am I misreading it?

alexcrichton · 2024-03-17T16:14:46Z

(table $ic-sites 100 100 (ref $ic-stub) (ref.func $ic1))

This example module initializes element zero of the table to point to a function

Oh this syntax is actually initializing the entire table with (ref.func $ic1), it's new syntax/binary stuff introduced as part of the function-references proposal (basically folded into the gc proposal)

Somewhat orthogonal question as well: I forget if I asked this in the past, but is the reason to use a typed table instead of typed globals to get the lazy initialization? If I change the above test to use two typed function globals it ends up codegenning (today)

function u0:1(i64 vmctx, i64, i32, i32, i32, i32) -> i32 fast {
    gv0 = vmctx
    gv1 = load.i64 notrap aligned readonly gv0+8
    gv2 = load.i64 notrap aligned gv1
    gv3 = vmctx
    sig0 = (i64 vmctx, i64, i32, i32, i32, i32) -> i32 fast
    sig1 = (i64 vmctx, i32 uext, i32 uext) -> i32 uext system_v
    sig2 = (i64 vmctx, i32 uext) -> i32 uext system_v
    stack_limit = gv2

                                block0(v0: i64, v1: i64, v2: i32, v3: i32, v4: i32, v5: i32):
                                    v8 -> v0
                                    v14 -> v0
@003b                               v9 = load.i64 notrap aligned table v0+80
@003d                               brif v9, block3, block2

                                block2 cold:
@003d                               trap null_reference

                                block3:
@003d                               v10 = load.i64 notrap aligned readonly v9+16
@003d                               v11 = load.i64 notrap aligned readonly v9+32
@003d                               v12 = call_indirect sig0, v10(v11, v0, v2, v3, v4, v5)
@004c                               v15 = load.i64 notrap aligned table v0+96
@004e                               brif v15, block5, block4

                                block4 cold:
@004e                               trap null_reference

                                block5:
@004e                               v16 = load.i64 notrap aligned readonly v15+16
@004e                               v17 = load.i64 notrap aligned readonly v15+32
@004e                               v18 = call_indirect sig0, v16(v17, v0, v2, v3, v4, v5)
@0057                               jump block1

                                block1:
@0052                               v19 = iadd.i32 v18, v12
                                    v6 -> v19
@0057                               return v19
}

which is pretty good given that we've got #5291 in our pocket

jameysharp · 2024-03-17T16:45:08Z

Thanks Alex, now I understand!

I'd forgotten about #5291; we just added a different use of memflags to indicate trap code so we have precedent now. But besides that, if I understood Chris correctly, this example is supposed to be typed as a non-nullable function reference. So if we thread that type information through correctly then we should be able to elide the null-check even without #5291, right?

cfallin · 2024-03-17T17:09:42Z

Indeed, this initializes everything then calls ICs 1 and 2, which were meant to be arbitrary -- I'll add some comments to make it clearer!

is the reason to use a typed table instead of typed globals to get the lazy initialization?

not quite; I realized once I progressed past "IC caller" logic to "IC update" logic that while AOT-compiling can produce bodies that have statically different code locations per IC head, the update logic is shared (polymorphic over IC index) and there's no "set global N to V" instruction. I could potentially finagle the IC-stub ABI to return a new funcref and always update in the statically-unique callsite sequence or something, but that's a lot of overhead if these are frequent; IMHO this is what tables are made for :-)

alexcrichton · 2024-03-17T18:03:06Z

Ah so the call_ref instructions are always done with a constant index, but the table.set instructions are done with a variable index? (if I'm understanding you right). I mostly wanted to confirm that all call_ref is done with a constant index or otherwise this codegen wouldn't be representative.

So if we thread that type information through correctly then we should be able to elide the null-check even without #5291, right?

True!

Thinking a bit more on this, if the table.get plus call_ref pair were replaced with call_indirect we wouldn't have to solve the problem of propagating validator type information into lowering. In such a situation we could inspect the table's type and notice that it's non-nullable and skip the null check. I think it should be the case that table.get plus call_ref should be the same codegen as call_indirect at least ...

alexcrichton · 2024-03-17T18:31:56Z

This has inspired me to go off and do #8159 to handle the null check

cfallin · 2024-03-17T20:09:34Z

Ah so the call_ref instructions are always done with a constant index, but the table.set instructions are done with a variable index? (if I'm understanding you right). I mostly wanted to confirm that all call_ref is done with a constant index or otherwise this codegen wouldn't be representative.

Right; the hope/aspiration is that that constant index can trickle through a carefully-placed series of optimizations so this turns into something closer to the good codegen we see with globals.

(Two other points that came to mind later re: use of tables: scalability is another factor -- the max of 100k (?) globals is a real limit if we use one for every IC site in a large program; and also, your point about lazy init, wherein we don't lazy-init globals and that'd be a big regression of instantiation latency on said ~100k-IC programs. The flipside is that we have the lazy-init dynamic checks on table load...)

call_indirect instead

Ah! I hadn't realized that call_indirect also participated in the "typed funcref" proposal; for some reason I had it pegged as "old dynamic sig check" territory. All the better if it lets us apply the optimizations more easily; I'll update this test to use it.

Thinking a bit more about lazy init: the overall behavior we want is that for a very large program with many ICs, we have fast instantiation and, once ICs are warmed up, fast IC invocation with as few dynamic checks as possible.

IC sites always start as linked to the "fallback IC"; my planned use for this "fast IC head" mechanism was to have a conditional at every IC site testing whether the IC-stub struct is a fallback IC, invoking a traditional (C++) function pointer if so, and then (hitting a weval intrinsic that leads to) doing this typed-funcref thing if not. The upshot of that is that we always have some slowpath action (attaching an IC) before we do the typed funcref invocation.

I guess what I'm getting at is: could we avoid the lazy-init checks if we had a nullable typed funcref table, with null as the default value? Then we do the usual init-the-anyfunc-before-you-take-its-address thing at the ref.func in the IC-attach path, and the just-instantiated state is a bunch of zeroes (kindly provided by mmap/madvise), and the invocation is load/load/call-reg. That's another new optimization but a special case for lazy-init instead. Stated succinctly: if a table's default funcref is null, don't do lazy-init, and make the 0 bit-pattern mean null rather than non-initialized-pointer-to-anyfunc. (EDIT: and this combines with #8159 to eliminate the null-check branch, to be clear.)

…e table. This is based on discussion in bytecodealliance#8158: - We can use `call_indirect` rather than `table.get` + `call_ref`, even on typed funcrefs. TIL; updated the test! - As noted in bytecodealliance#8160, if we use a nullable typed funcref table instead (and given that we know we'll initialize a particular slot before use on the application side, so we won't actually call a null ref), and if we have a null-ref default value, we should be able to avoid the lazy table-init mechanism entirely. (Ignore the part where this module doesn't actually have any update logic that would set non-null refs anywhere; it's a compile-test, not a runtest!) Once bytecodealliance#8159 is merged and bytecodealliance#8160 is implemented, we should see zero branches in this test.

This is based on discussion in bytecodealliance#8158: as noted in bytecodealliance#8160, if we use a nullable typed funcref table instead (and given that we know we'll initialize a particular slot before use on the application side, so we won't actually call a null ref), and if we have a null-ref default value, we should be able to avoid the lazy table-init mechanism entirely. (Ignore the part where this module doesn't actually have any update logic that would set non-null refs anywhere; it's a compile-test, not a runtest!) Once bytecodealliance#8159 is merged and bytecodealliance#8160 is implemented, we should see zero branches in this test.

alexcrichton · 2024-03-18T14:49:40Z

Makes sense! It's always possible we can use the same optimization techniques on globals as well as tables, nothing saying we have to keep everything as-is for example. If tables work then there's no need to change though.

could we avoid the lazy-init checks if we had a nullable typed funcref table, with null as the default value

I like this idea!

This is based on discussion in #8158: as noted in #8160, if we use a nullable typed funcref table instead (and given that we know we'll initialize a particular slot before use on the application side, so we won't actually call a null ref), and if we have a null-ref default value, we should be able to avoid the lazy table-init mechanism entirely. (Ignore the part where this module doesn't actually have any update logic that would set non-null refs anywhere; it's a compile-test, not a runtest!) Once #8159 is merged and #8160 is implemented, we should see zero branches in this test.

cfallin requested a review from jameysharp March 17, 2024 03:29

cfallin requested a review from a team as a code owner March 17, 2024 03:29

jameysharp approved these changes Mar 17, 2024

View reviewed changes

Add comments to clarify typed funcrefs usage.

16212d0

cfallin enabled auto-merge March 17, 2024 17:16

cfallin added this pull request to the merge queue Mar 17, 2024

Merged via the queue into bytecodealliance:main with commit 7fa71a4 Mar 17, 2024
19 checks passed

cfallin deleted the typed-funcref-ics-test branch March 17, 2024 17:56

cfallin mentioned this pull request Mar 17, 2024

Wasmtime: avoid table lazy-init runtime checks on nullable funcref tables with null default #8160

Open

cfallin mentioned this pull request Mar 17, 2024

Wasm ICs / typed-funcrefs test: switch to nullable table. #8161

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wasm tests: add typed-funcref test showing example of desirable optimizations. #8158

Wasm tests: add typed-funcref test showing example of desirable optimizations. #8158

cfallin commented Mar 17, 2024

jameysharp left a comment

alexcrichton commented Mar 17, 2024

jameysharp commented Mar 17, 2024

cfallin commented Mar 17, 2024

alexcrichton commented Mar 17, 2024

alexcrichton commented Mar 17, 2024

cfallin commented Mar 17, 2024 •

edited

Loading

alexcrichton commented Mar 18, 2024

Wasm tests: add typed-funcref test showing example of desirable optimizations. #8158

Wasm tests: add typed-funcref test showing example of desirable optimizations. #8158

Conversation

cfallin commented Mar 17, 2024

jameysharp left a comment

Choose a reason for hiding this comment

alexcrichton commented Mar 17, 2024

jameysharp commented Mar 17, 2024

cfallin commented Mar 17, 2024

alexcrichton commented Mar 17, 2024

alexcrichton commented Mar 17, 2024

cfallin commented Mar 17, 2024 • edited Loading

alexcrichton commented Mar 18, 2024

cfallin commented Mar 17, 2024 •

edited

Loading