Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wasm tests: add typed-funcref test showing example of desirable optimizations. #8158

Merged
merged 2 commits into from
Mar 17, 2024

Conversation

cfallin
Copy link
Member

@cfallin cfallin commented Mar 17, 2024

In order to have fast IC (inline cache) chains in AOT-compiled dynamic language Wasms, it would be great if we could make the "call to a typed funcref at a constant table index" pattern fast.

This use-case was discussed at the most recent Wasmtime biweekly and @jameysharp is working on some optimizations; the intent of this PR is to provide a concrete test-case whose blessed output we can see improve over time.

In particular, the following opts are still desirable:

  • With the use of non-nullable typed funcrefs, there shouldn't be a null check (there currently is, as noted by a comment in the code due to lack of type information at the right spot).
  • With the use of a constant table size and a constant index to the table.get, we should be able to load from the table without a bounds-check or any Spectre masking.

Other further optimizations for this pattern might be possible if we rearrange the table and function-reference data structures, and the lazy-initialization scheme thereof, but the above should be agnostic to that.

…izations.

In order to have fast IC (inline cache) chains in AOT-compiled dynamic
language Wasms, it would be great if we could make the "call to a typed
funcref at a constant table index" pattern fast.

This use-case was discussed at the most recent Wasmtime biweekly and
@jameysharp is working on some optimizations; the intent of this PR is
to provide a concrete test-case whose blessed output we can see improve
over time.

In particular, the following opts are still desirable:

- With the use of non-nullable typed funcrefs, there shouldn't be a null
  check (there currently is, as noted by a comment in the code due to
  lack of type information at the right spot).
- With the use of a constant table size and a constant index to the
  `table.get`, we should be able to load from the table without a
  bounds-check or any Spectre masking.

Other further optimizations for this pattern might be possible if we
rearrange the table and function-reference data structures, and the
lazy-initialization scheme thereof, but the above should be agnostic to
that.
@cfallin cfallin requested a review from jameysharp March 17, 2024 03:29
@cfallin cfallin requested a review from a team as a code owner March 17, 2024 03:29
Copy link
Contributor

@jameysharp jameysharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Chris," I whine to myself, "stop working on a Saturday evening," as I approve your PR at 10pm

Thanks for this! It's definitely the kind of thing I was hoping for.

This example module initializes element zero of the table to point to a function, then calls through elements one and two. What is that supposed to do? Or am I misreading it?

@alexcrichton
Copy link
Member

(table $ic-sites 100 100 (ref $ic-stub) (ref.func $ic1))

This example module initializes element zero of the table to point to a function

Oh this syntax is actually initializing the entire table with (ref.func $ic1), it's new syntax/binary stuff introduced as part of the function-references proposal (basically folded into the gc proposal)


Somewhat orthogonal question as well: I forget if I asked this in the past, but is the reason to use a typed table instead of typed globals to get the lazy initialization? If I change the above test to use two typed function globals it ends up codegenning (today)

function u0:1(i64 vmctx, i64, i32, i32, i32, i32) -> i32 fast {
    gv0 = vmctx
    gv1 = load.i64 notrap aligned readonly gv0+8
    gv2 = load.i64 notrap aligned gv1
    gv3 = vmctx
    sig0 = (i64 vmctx, i64, i32, i32, i32, i32) -> i32 fast
    sig1 = (i64 vmctx, i32 uext, i32 uext) -> i32 uext system_v
    sig2 = (i64 vmctx, i32 uext) -> i32 uext system_v
    stack_limit = gv2

                                block0(v0: i64, v1: i64, v2: i32, v3: i32, v4: i32, v5: i32):
                                    v8 -> v0
                                    v14 -> v0
@003b                               v9 = load.i64 notrap aligned table v0+80
@003d                               brif v9, block3, block2

                                block2 cold:
@003d                               trap null_reference

                                block3:
@003d                               v10 = load.i64 notrap aligned readonly v9+16
@003d                               v11 = load.i64 notrap aligned readonly v9+32
@003d                               v12 = call_indirect sig0, v10(v11, v0, v2, v3, v4, v5)
@004c                               v15 = load.i64 notrap aligned table v0+96
@004e                               brif v15, block5, block4

                                block4 cold:
@004e                               trap null_reference

                                block5:
@004e                               v16 = load.i64 notrap aligned readonly v15+16
@004e                               v17 = load.i64 notrap aligned readonly v15+32
@004e                               v18 = call_indirect sig0, v16(v17, v0, v2, v3, v4, v5)
@0057                               jump block1

                                block1:
@0052                               v19 = iadd.i32 v18, v12
                                    v6 -> v19
@0057                               return v19
}

which is pretty good given that we've got #5291 in our pocket

@jameysharp
Copy link
Contributor

Thanks Alex, now I understand!

I'd forgotten about #5291; we just added a different use of memflags to indicate trap code so we have precedent now. But besides that, if I understood Chris correctly, this example is supposed to be typed as a non-nullable function reference. So if we thread that type information through correctly then we should be able to elide the null-check even without #5291, right?

@cfallin
Copy link
Member Author

cfallin commented Mar 17, 2024

Indeed, this initializes everything then calls ICs 1 and 2, which were meant to be arbitrary -- I'll add some comments to make it clearer!

is the reason to use a typed table instead of typed globals to get the lazy initialization?

not quite; I realized once I progressed past "IC caller" logic to "IC update" logic that while AOT-compiling can produce bodies that have statically different code locations per IC head, the update logic is shared (polymorphic over IC index) and there's no "set global N to V" instruction. I could potentially finagle the IC-stub ABI to return a new funcref and always update in the statically-unique callsite sequence or something, but that's a lot of overhead if these are frequent; IMHO this is what tables are made for :-)

@cfallin cfallin enabled auto-merge March 17, 2024 17:16
@cfallin cfallin added this pull request to the merge queue Mar 17, 2024
Merged via the queue into bytecodealliance:main with commit 7fa71a4 Mar 17, 2024
19 checks passed
@cfallin cfallin deleted the typed-funcref-ics-test branch March 17, 2024 17:56
@alexcrichton
Copy link
Member

Ah so the call_ref instructions are always done with a constant index, but the table.set instructions are done with a variable index? (if I'm understanding you right). I mostly wanted to confirm that all call_ref is done with a constant index or otherwise this codegen wouldn't be representative.

So if we thread that type information through correctly then we should be able to elide the null-check even without #5291, right?

True!

Thinking a bit more on this, if the table.get plus call_ref pair were replaced with call_indirect we wouldn't have to solve the problem of propagating validator type information into lowering. In such a situation we could inspect the table's type and notice that it's non-nullable and skip the null check. I think it should be the case that table.get plus call_ref should be the same codegen as call_indirect at least ...

@alexcrichton
Copy link
Member

This has inspired me to go off and do #8159 to handle the null check

@cfallin
Copy link
Member Author

cfallin commented Mar 17, 2024

Ah so the call_ref instructions are always done with a constant index, but the table.set instructions are done with a variable index? (if I'm understanding you right). I mostly wanted to confirm that all call_ref is done with a constant index or otherwise this codegen wouldn't be representative.

Right; the hope/aspiration is that that constant index can trickle through a carefully-placed series of optimizations so this turns into something closer to the good codegen we see with globals.

(Two other points that came to mind later re: use of tables: scalability is another factor -- the max of 100k (?) globals is a real limit if we use one for every IC site in a large program; and also, your point about lazy init, wherein we don't lazy-init globals and that'd be a big regression of instantiation latency on said ~100k-IC programs. The flipside is that we have the lazy-init dynamic checks on table load...)

call_indirect instead

Ah! I hadn't realized that call_indirect also participated in the "typed funcref" proposal; for some reason I had it pegged as "old dynamic sig check" territory. All the better if it lets us apply the optimizations more easily; I'll update this test to use it.


Thinking a bit more about lazy init: the overall behavior we want is that for a very large program with many ICs, we have fast instantiation and, once ICs are warmed up, fast IC invocation with as few dynamic checks as possible.

IC sites always start as linked to the "fallback IC"; my planned use for this "fast IC head" mechanism was to have a conditional at every IC site testing whether the IC-stub struct is a fallback IC, invoking a traditional (C++) function pointer if so, and then (hitting a weval intrinsic that leads to) doing this typed-funcref thing if not. The upshot of that is that we always have some slowpath action (attaching an IC) before we do the typed funcref invocation.

I guess what I'm getting at is: could we avoid the lazy-init checks if we had a nullable typed funcref table, with null as the default value? Then we do the usual init-the-anyfunc-before-you-take-its-address thing at the ref.func in the IC-attach path, and the just-instantiated state is a bunch of zeroes (kindly provided by mmap/madvise), and the invocation is load/load/call-reg. That's another new optimization but a special case for lazy-init instead. Stated succinctly: if a table's default funcref is null, don't do lazy-init, and make the 0 bit-pattern mean null rather than non-initialized-pointer-to-anyfunc. (EDIT: and this combines with #8159 to eliminate the null-check branch, to be clear.)

cfallin added a commit to cfallin/wasmtime that referenced this pull request Mar 17, 2024
…e table.

This is based on discussion in bytecodealliance#8158:

- We can use `call_indirect` rather than `table.get` + `call_ref`, even
  on typed funcrefs. TIL; updated the test!

- As noted in bytecodealliance#8160, if we use a nullable typed funcref table instead
  (and given that we know we'll initialize a particular slot before use
  on the application side, so we won't actually call a null ref), and if
  we have a null-ref default value, we should be able to avoid the lazy
  table-init mechanism entirely.

(Ignore the part where this module doesn't actually have any update
logic that would set non-null refs anywhere; it's a compile-test, not a
runtest!)

Once bytecodealliance#8159 is merged and bytecodealliance#8160 is implemented, we should see zero
branches in this test.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Mar 17, 2024
This is based on discussion in bytecodealliance#8158: as noted in bytecodealliance#8160, if we use a
nullable typed funcref table instead (and given that we know we'll
initialize a particular slot before use on the application side, so we
won't actually call a null ref), and if we have a null-ref default
value, we should be able to avoid the lazy table-init mechanism
entirely.

(Ignore the part where this module doesn't actually have any update
logic that would set non-null refs anywhere; it's a compile-test, not a
runtest!)

Once bytecodealliance#8159 is merged and bytecodealliance#8160 is implemented, we should see zero
branches in this test.
@alexcrichton
Copy link
Member

Makes sense! It's always possible we can use the same optimization techniques on globals as well as tables, nothing saying we have to keep everything as-is for example. If tables work then there's no need to change though.

could we avoid the lazy-init checks if we had a nullable typed funcref table, with null as the default value

I like this idea!

github-merge-queue bot pushed a commit that referenced this pull request Mar 18, 2024
This is based on discussion in #8158: as noted in #8160, if we use a
nullable typed funcref table instead (and given that we know we'll
initialize a particular slot before use on the application side, so we
won't actually call a null ref), and if we have a null-ref default
value, we should be able to avoid the lazy table-init mechanism
entirely.

(Ignore the part where this module doesn't actually have any update
logic that would set non-null refs anywhere; it's a compile-test, not a
runtest!)

Once #8159 is merged and #8160 is implemented, we should see zero
branches in this test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants