[Wasm RyuJIT] Initial writeup on the calling convention #122988

AndyAyersMS · 2026-01-07T20:07:05Z

Describe the calling convention used by R2R (and perhaps someday jitted) Wasm code.

AndyAyersMS · 2026-01-07T20:08:57Z

PTAL @dotnet/wasm-contrib... this is a first draft so please comment / help fix.

Copilot

Pull request overview

This PR adds comprehensive documentation for the WebAssembly calling convention used by R2R (Ready-to-Run) and JIT-compiled managed code in the CLR. The documentation describes how the runtime interoperates with WebAssembly's stack model, calling sequences, and garbage collection integration.

Key Changes:

Adds a new "Web Assembly ABI (R2R and JIT)" section to the CLR ABI documentation
Documents stack layout, argument passing conventions, prolog/epilog behavior, and calling sequences
Explains GC reference handling at call sites and the portable entry point mechanism

docs/design/coreclr/botr/clr-abi.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

docs/design/coreclr/botr/clr-abi.md

AaronRobinsonMSFT · 2026-01-07T20:24:59Z

docs/design/coreclr/botr/clr-abi.md

+
+The prolog will increment the stack pointer, home any arguments that are stored on the linear stack, and zero initialize slots on the linear stack as appropriate. It will establish a frame pointer if one is needed.
+
+It will also save a frame descriptor onto the stack, for use during GC and EH. For methods with EH or GC, a slot on the linear stack will be reserved for a "virtual IP" that will index into the EH and GC info to provide within-method information and allow external code to walk the managed stack frames.


Frame descriptor is GCInfo, yes?

It will also refer to the EH info, so a bit more general.

Don't we need the stack frames to be linked together in order to provide support for walking managed frames ? So, from the sp value in the current frame, shouldn't we be able to obtain the sp of the parent frame in order to get the descriptor information, etc ? I guess the plan would be to fetch the previous sp from sp[-1]? Would there be methods where sp can be dynamically incremented, in which case this wouldn't work ?

Given the last frame's base address, the plan is that the frames are self-descriptive.

To get the base address (using the scheme above where sp can diverge from $__stack_pointer), we rely on the fact stack walks can only happen once the R2R code has called back into native code (either helper methods, or the interpreter). These calls are passed sp as arguments and save that to the global $__stack_pointer and perhaps some other global or similar for easy access by the unwinder.

The frame descriptor will be at a known offset from this saved sp (likely 0) and the size of the frame will be stored in the descriptor, so the external code can compute the address of the parent frame that way, eg parent_sp = sp + sp[0].frameSize.

For dynamic-sized frames a copy of the prior sp can be likewise stored at some other known offset from sp) to provide the necessary chaining. If the frame grows then this value can be re-established to reflect the new size. Or we can equivalently store the total frame size.

If we follow Katelyn's proposal of keeping $__stack_pointer in sync for all managed methods then there's a bit less ceremony required, but from there the unwinding proceeds the same way.

docs/design/coreclr/botr/clr-abi.md

kg · 2026-01-07T21:21:49Z

docs/design/coreclr/botr/clr-abi.md

+
+## Incoming argument ABI
+
+The linear stack pointer is the first argument to all methods. At a native->managed transition it is the value of the `$__stack_pointer` global. This global is not updated within managed code, but is updated at managed->native boundaries. Within the method the stack pointer always points at the bottom (lowest address) of the stack; generally this is a fixed offset from the value the stack pointer held on entry, except in methods that can do dynamic allocation.


This means helper calls all become managed->native boundaries that require a stack pointer update at start and end, right? Is that a problem?

There is a code size cost for each helper call... it could be amortized with a custom wrapper for each helper that just does the $__stack_pointermaintenance.

kg · 2026-01-07T21:30:08Z

docs/design/coreclr/botr/clr-abi.md

+
+## Outgoing call ABI
+
+For direct managed calls, Wasm uses the Portable Entry Point feature to facilitate smooth interop with interpreted code. This means all managed calls are made indirectly, and the portable entry point is also passed as the last argument.


In the long run will we optimize this for cases where we know both the caller and callee were crossgen'd? I'm fine with not specifying that yet though.

Yes we can optimize that case. Though in general there is no guarantee the runtime will use R2R compiled callee code.

On Wasm this may be less of an issue because the cases where R2R method bodies end up being disqualified may not be possible.

Yes we can optimize that case

Yes.

Though in general there is no guarantee the runtime will use R2R compiled callee code.

The fixups for the caller would have to verify that the directly callled method is going to use R2R too. If the fixup fails, R2R code for the caller would have to be rejected as well.

(We do something similar for ReJIT. If there is a ReJIT request for a method that got inlined, all methods that inlined it must be invalidated as well.)

kg · 2026-01-07T23:36:35Z

This is probably a good time to raise that I think we shouldn't pass the stack pointer in an argument.
Reasons for it:

get_local 0 is 6 bytes when linking because the linker relocation needs to be 5 bytes so it can be patched by the linker, IIRC. (We're not linking, though!)
Loading a local might be faster than loading a global in wasm (I don't know how to verify this though)

Reasons against it:

Code size goes up because we need to copy the stack pointer into and out of the global at very many locations
More room for bugs caused by the stack pointer getting out of sync with the local
The extra argument makes it more likely that actual arguments won't occupy argument registers once our wasm is jitted/aot'd

Given that we're not linking I view the sp argument thing as a Potential Optimization and I feel like it's premature. Single or Yowl may have a good reason why we should still do it though based on their experiences.

I believe what we would do instead is export the stack_pointer from the runtime's wasm module (this may already be the default behavior for emscripten, to enable dynamic linking?) then grab the WebAssembly.Global for the stack_pointer and import it into every r2r module we load. Then all our code can manipulate the linear stack the same way clang generated code does.

AndyAyersMS · 2026-01-08T00:14:06Z

we shouldn't pass the stack pointer in an argument.

Following on this, could we also use a small index global to pass the portable entry point? Then we would not need the interpreter->managed stub to invoke the method with an extra unused argument.

SingleAccretion · 2026-01-08T00:31:30Z

Following on this, could we also use a small index global to pass the portable entry point? Then we would not need the interpreter->managed stub to invoke the method with an extra unused argument.

It doesn't seem necessary to optimize for the stub case? Since stubs will be a very small proportion of the overall code, and every managed callsite will need to be made at least 2 bytes bigger (for global.set). In addition to the inflexibilities of hardcoding global indices.

yowl · 2026-01-08T00:36:20Z

How will a global.set work when threads are a thing?

kg · 2026-01-08T00:39:11Z

How will a global.set work when threads are a thing?

Aren't globals functionally thread-local in wasm? Is it different for wasi?

SingleAccretion · 2026-01-08T00:59:45Z

Aren't globals functionally thread-local in wasm? Is it different for wasi?

I think @yowl is referring to the complexities of this: https://github.com/WebAssembly/shared-everything-threads/blob/main/proposals/shared-everything-threads/Overview.md#thread-local-storage-tls.

Without instance-per-thread, things get a bit expensive for globals. I don't know what the current expected implementation strategy is. I know it (__stack_pointer access cost) was brought up in the shared-everything proposal evolution history and they discussed things like giving a special TLS slot just for the stack pointer. This would need more research.

Edit: found some discussion about this - https://github.com/WebAssembly/meetings/blob/ca764085f4ac4c750b0500d9f2b7e1648636f503/threads/2025/THREADS-03-04.md. This also reminds us that an imported global requires two indirections instead of one.

Another edit: more - https://github.com/WebAssembly/meetings/blob/ca764085f4ac4c750b0500d9f2b7e1648636f503/threads/2024/THREADS-07-09.md.

kg · 2026-01-08T01:22:35Z

Aren't globals functionally thread-local in wasm? Is it different for wasi?

I think @yowl is referring to the complexities of this: https://github.com/WebAssembly/shared-everything-threads/blob/main/proposals/shared-everything-threads/Overview.md#thread-local-storage-tls.

Without instance-per-thread, things get a bit expensive for globals. I don't know what the current expected implementation strategy is. I know it (__stack_pointer access cost) was brought up in the shared-everything proposal evolution history and they discussed things like giving a special TLS slot just for the stack pointer. This would need more research.

Edit: found some discussion about this - https://github.com/WebAssembly/meetings/blob/ca764085f4ac4c750b0500d9f2b7e1648636f503/threads/2025/THREADS-03-04.md. This also reminds us that an imported global requires two indirections instead of one.

This also appears to mention that importing a global turns it into two loads instead of one, which is a bit of a problem. But I think we have to import sp no matter what and it's just a question of how often we're going to touch it.

dotnet-policy-service · 2026-01-08T10:02:58Z

Tagging subscribers to 'arch-wasm': @lewing, @pavelsavara
See info in area-owners.md if you want to be subscribed.

docs/design/coreclr/botr/clr-abi.md

jkotas · 2026-01-08T21:42:49Z

docs/design/coreclr/botr/clr-abi.md

+
+## Incoming argument ABI
+
+The linear stack pointer is the first argument to all methods. At a native->managed transition it is the value of the `$__stack_pointer` global. This global is not updated within managed code, but is updated at managed->native boundaries. Within the method the stack pointer always points at the bottom (lowest address) of the stack; generally this is a fixed offset from the value the stack pointer held on entry, except in methods that can do dynamic allocation.


For PInvokes, I expect that this update will be done around the callsite in the managed code.

Where do you expect it to be done for FCalls? We assume that FCalls have the same managed calling convention. Can this update be done inside the FCall macro somehow, so that FCalls can continue to have managed calling convention?

If we are not able to do that, I guess we will need to create some sort of FCalls wrappers. It is doable, but it is not pretty - we have been there in the past.

For reference, what does native AOT / LLVM do for FCalls currently?

For NAOT/LLVM it looks like FCalls have an extra initial arg that they ignore:

https://github.com/dotnet/runtimelab/blob/7706cd182716062d4fa550e88abd004e1a82dcd5/src/coreclr/nativeaot/Runtime/MathHelpers.cpp#L12

I don't see where the stack pointer global is updated; maybe NAOT/LLVM doesn't need this?

For reference, what does native AOT / LLVM do for FCalls currently?

NAOT-LLVM shadow stack is allocated separately from the __stack_pointer stack, so we only need to track it for the purposes of transition frames with virtual unwinding (another way of putting it is that it is only used for managed code).

Can this update be done inside the FCall macro somehow, so that FCalls can continue to have managed calling convention?

I don't know if it is possible with __attribute__((naked)) trickery to do it all in one function, but it is definitely possible with __asm to insert a stub with the managed calling convention (that'd do global.set __stack_pointer) into FCIMPL.

jkotas · 2026-01-08T21:45:17Z

docs/design/coreclr/botr/clr-abi.md

+
+The prolog will increment the stack pointer, home any arguments that are stored on the linear stack, and zero initialize slots on the linear stack as appropriate. It will establish a frame pointer if one is needed.
+
+It will also save a frame descriptor onto the stack, for use during GC and EH. For methods with EH or GC, a slot on the linear stack will be reserved for a "virtual IP" that will index into the EH and GC info to provide within-method information and allow external code to walk the managed stack frames.


For methods with EH or GC

It is not clear what a "method with GC" means. Should this say for methods with calls or EH (ie non-leaf methods)?

Yes, methods with GC safe points (which will always be at calls) or EH.

jkotas · 2026-01-08T21:49:20Z

docs/design/coreclr/botr/clr-abi.md

+```
+Initially the cell will contain code to determine if the target method has R2R code or must be interpreted. If there is R2R code for the method it is fixed up as needed. Once the target is resolved the cell can be updated to just refer to the R2R code directly, if there is any, or to a thunk for invoking the interpreter.
+
+For indirect managed calls the sequence is similar, but the portable entry point is obtained by calling a resolve helper:


I think this should say "virtual managed calls". Indirect calls (calli) should get the portable-entry-point to call from IL stack, no need to call resolve helper.

Also, there is a potential optimization for vtable-based virtual calls to just fetch the entrypoint by indexing into vtable like we do everywhere else.

AndyAyersMS · 2026-01-09T01:09:37Z

Some crude estimates of the size costs of an SP arg vs maintaining the global SP.

Global SP always in sync

;; PROLOG

global.get $__stack_pointer    ;; (2, if can get a small index), else 6.
i32.const FRAMESIZE            ;; (2 typically)
i32.sub                        ;; 1
dup                            ;; 1
global.set $__stack_pointer    ;; (2/6)
local.set sp                   ;; 2

;; EPILOG

local.get sp                   ;; 2
i32.const FRAMESIZE            ;; 2ish
i32.add                        ;; 1
global.set $__stack_pointer    ;; 2/6

So 10/18 bytes per prolog, 7/11 bytes per epilog

No overhead at call sites. Smaller signatures.

Global SP lazy sync at boundaries

;; PROLOG

local.get sp                  ;; 2
i32.const FRAMESIZE           ;; 2
i32.sub                       ;; 1
local.set sp                  ;; 2

;; EPILOG

(empty)

So 7 bytes per prolog, 0 per epilog

;; unmanaged call sites & fcalls

local.get  sp                 ;; 2
global.set $__stack_pointer   ;; 2/6   (~0 amortizable for fcalls)

;; managed call sites

local.get  sp                 ;; 2

For the current x86 crossgen SPMI collection (which may not represent the set of methods we care about) there are 273829 methods / prologs, 312716 epilogs, 1504289 managed call sites, and 163641 helper call sites.

So assuming the optimistic case where we can encode the global SP index in one byte and wrap FCalls with a global SP update (size assumed negligible) I get size estimates like:

global SP   in sync: 5.581M bytes  (~20 bytes/method)
global SP lazy sync: 4.925M bytes  (~18 bytes/method)

If we can't get a small index for the SP global then the "sync" cost rises to 8.3M (~30 bytes/method).

This may overstate the difference somewhat. For leaf methods we generally won't need these SP sequences... I haven't tried to account for those yet.

SingleAccretion · 2026-01-09T01:21:12Z

Some crude estimates of the size costs of an SP arg vs maintaining the global SP.

This matches my calculations based on NAOT-LLVM data from last year almost perfectly: "50%" code size for the large index encoding, about the same for the small encoding.

For leaf methods we generally won't need these SP sequences... I haven't tried to account for those yet.

The proportion of truly leaf methods is going to be rather low. It was on the order of < 5% on real-world NAOT-LLVM data (this is from dotnet/runtimelab#2697). This is because most methods may throw (an NRE), requiring a helper call.

jkotas · 2026-01-09T07:24:37Z

This is because most methods may throw (an NRE), requiring a helper call.

Was this with assumption that this can be null? It is a gray area. this can be never null in C#. I think it would be ok to assume that this is never null for wasm.

[Wasm RyuJIT] Initial writeup on the calling convention

4b39254

Describe the calling convention used by R2R (and perhaps someday jitted) Wasm code.

Copilot AI review requested due to automatic review settings January 7, 2026 20:07

github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jan 7, 2026

dotnet-policy-service bot assigned AndyAyersMS Jan 7, 2026

Copilot started reviewing on behalf of AndyAyersMS January 7, 2026 20:07 View session

Copilot AI reviewed Jan 7, 2026

View reviewed changes

Apply suggestions from code review

cd1ddca

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

AaronRobinsonMSFT reviewed Jan 7, 2026

View reviewed changes

kg reviewed Jan 7, 2026

View reviewed changes

pavelsavara added the arch-wasm WebAssembly architecture label Jan 8, 2026

pavelsavara reviewed Jan 8, 2026

View reviewed changes

docs/design/coreclr/botr/clr-abi.md Show resolved Hide resolved

JulieLeeMSFT mentioned this pull request Jan 8, 2026

[Wasm RyuJIT] Calling Convention #121215

Open

jkotas reviewed Jan 8, 2026

View reviewed changes


		The prolog will increment the stack pointer, home any arguments that are stored on the linear stack, and zero initialize slots on the linear stack as appropriate. It will establish a frame pointer if one is needed.

		It will also save a frame descriptor onto the stack, for use during GC and EH. For methods with EH or GC, a slot on the linear stack will be reserved for a "virtual IP" that will index into the EH and GC info to provide within-method information and allow external code to walk the managed stack frames.


		## Incoming argument ABI

		The linear stack pointer is the first argument to all methods. At a native->managed transition it is the value of the `$__stack_pointer` global. This global is not updated within managed code, but is updated at managed->native boundaries. Within the method the stack pointer always points at the bottom (lowest address) of the stack; generally this is a fixed offset from the value the stack pointer held on entry, except in methods that can do dynamic allocation.


		## Outgoing call ABI

		For direct managed calls, Wasm uses the Portable Entry Point feature to facilitate smooth interop with interpreted code. This means all managed calls are made indirectly, and the portable entry point is also passed as the last argument.

[Wasm RyuJIT] Initial writeup on the calling convention #122988

Are you sure you want to change the base?

[Wasm RyuJIT] Initial writeup on the calling convention #122988

Conversation

AndyAyersMS commented Jan 7, 2026

Uh oh!

AndyAyersMS commented Jan 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BrzVlad Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkotas Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kg commented Jan 7, 2026

Uh oh!

AndyAyersMS commented Jan 8, 2026

Uh oh!

SingleAccretion commented Jan 8, 2026

Uh oh!

yowl commented Jan 8, 2026

Uh oh!

kg commented Jan 8, 2026

Uh oh!

SingleAccretion commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kg commented Jan 8, 2026

Uh oh!

dotnet-policy-service bot commented Jan 8, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AndyAyersMS commented Jan 9, 2026

Global SP always in sync

Global SP lazy sync at boundaries

Uh oh!

SingleAccretion commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BrzVlad Jan 8, 2026 •

edited

Loading

jkotas Jan 8, 2026 •

edited

Loading

SingleAccretion commented Jan 8, 2026 •

edited

Loading

SingleAccretion commented Jan 9, 2026 •

edited

Loading