Skip to content

Conversation

@AndyAyersMS
Copy link
Member

Describe the calling convention used by R2R (and perhaps someday jitted) Wasm code.

Describe the calling convention used by R2R (and perhaps someday jitted) Wasm code.
Copilot AI review requested due to automatic review settings January 7, 2026 20:07
@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jan 7, 2026
@AndyAyersMS
Copy link
Member Author

PTAL @dotnet/wasm-contrib... this is a first draft so please comment / help fix.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive documentation for the WebAssembly calling convention used by R2R (Ready-to-Run) and JIT-compiled managed code in the CLR. The documentation describes how the runtime interoperates with WebAssembly's stack model, calling sequences, and garbage collection integration.

Key Changes:

  • Adds a new "Web Assembly ABI (R2R and JIT)" section to the CLR ABI documentation
  • Documents stack layout, argument passing conventions, prolog/epilog behavior, and calling sequences
  • Explains GC reference handling at call sites and the portable entry point mechanism

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

The prolog will increment the stack pointer, home any arguments that are stored on the linear stack, and zero initialize slots on the linear stack as appropriate. It will establish a frame pointer if one is needed.

It will also save a frame descriptor onto the stack, for use during GC and EH. For methods with EH or GC, a slot on the linear stack will be reserved for a "virtual IP" that will index into the EH and GC info to provide within-method information and allow external code to walk the managed stack frames.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frame descriptor is GCInfo, yes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will also refer to the EH info, so a bit more general.

Copy link
Member

@BrzVlad BrzVlad Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need the stack frames to be linked together in order to provide support for walking managed frames ? So, from the sp value in the current frame, shouldn't we be able to obtain the sp of the parent frame in order to get the descriptor information, etc ? I guess the plan would be to fetch the previous sp from sp[-1]? Would there be methods where sp can be dynamically incremented, in which case this wouldn't work ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the last frame's base address, the plan is that the frames are self-descriptive.

To get the base address (using the scheme above where sp can diverge from $__stack_pointer), we rely on the fact stack walks can only happen once the R2R code has called back into native code (either helper methods, or the interpreter). These calls are passed sp as arguments and save that to the global $__stack_pointer and perhaps some other global or similar for easy access by the unwinder.

The frame descriptor will be at a known offset from this saved sp (likely 0) and the size of the frame will be stored in the descriptor, so the external code can compute the address of the parent frame that way, eg parent_sp = sp + sp[0].frameSize.

For dynamic-sized frames a copy of the prior sp can be likewise stored at some other known offset from sp) to provide the necessary chaining. If the frame grows then this value can be re-established to reflect the new size. Or we can equivalently store the total frame size.

If we follow Katelyn's proposal of keeping $__stack_pointer in sync for all managed methods then there's a bit less ceremony required, but from there the unwinding proceeds the same way.


## Incoming argument ABI

The linear stack pointer is the first argument to all methods. At a native->managed transition it is the value of the `$__stack_pointer` global. This global is not updated within managed code, but is updated at managed->native boundaries. Within the method the stack pointer always points at the bottom (lowest address) of the stack; generally this is a fixed offset from the value the stack pointer held on entry, except in methods that can do dynamic allocation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means helper calls all become managed->native boundaries that require a stack pointer update at start and end, right? Is that a problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a code size cost for each helper call... it could be amortized with a custom wrapper for each helper that just does the $__stack_pointermaintenance.


## Outgoing call ABI

For direct managed calls, Wasm uses the Portable Entry Point feature to facilitate smooth interop with interpreted code. This means all managed calls are made indirectly, and the portable entry point is also passed as the last argument.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the long run will we optimize this for cases where we know both the caller and callee were crossgen'd? I'm fine with not specifying that yet though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can optimize that case. Though in general there is no guarantee the runtime will use R2R compiled callee code.

On Wasm this may be less of an issue because the cases where R2R method bodies end up being disqualified may not be possible.

Copy link
Member

@jkotas jkotas Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can optimize that case

Yes.

Though in general there is no guarantee the runtime will use R2R compiled callee code.

The fixups for the caller would have to verify that the directly callled method is going to use R2R too. If the fixup fails, R2R code for the caller would have to be rejected as well.

(We do something similar for ReJIT. If there is a ReJIT request for a method that got inlined, all methods that inlined it must be invalidated as well.)

@kg
Copy link
Member

kg commented Jan 7, 2026

This is probably a good time to raise that I think we shouldn't pass the stack pointer in an argument.
Reasons for it:

  • get_local 0 is 6 bytes when linking because the linker relocation needs to be 5 bytes so it can be patched by the linker, IIRC. (We're not linking, though!)
  • Loading a local might be faster than loading a global in wasm (I don't know how to verify this though)

Reasons against it:

  • Code size goes up because we need to copy the stack pointer into and out of the global at very many locations
  • More room for bugs caused by the stack pointer getting out of sync with the local
  • The extra argument makes it more likely that actual arguments won't occupy argument registers once our wasm is jitted/aot'd

Given that we're not linking I view the sp argument thing as a Potential Optimization and I feel like it's premature. Single or Yowl may have a good reason why we should still do it though based on their experiences.

I believe what we would do instead is export the stack_pointer from the runtime's wasm module (this may already be the default behavior for emscripten, to enable dynamic linking?) then grab the WebAssembly.Global for the stack_pointer and import it into every r2r module we load. Then all our code can manipulate the linear stack the same way clang generated code does.

@AndyAyersMS
Copy link
Member Author

we shouldn't pass the stack pointer in an argument.

Following on this, could we also use a small index global to pass the portable entry point? Then we would not need the interpreter->managed stub to invoke the method with an extra unused argument.

@SingleAccretion
Copy link
Contributor

Following on this, could we also use a small index global to pass the portable entry point? Then we would not need the interpreter->managed stub to invoke the method with an extra unused argument.

It doesn't seem necessary to optimize for the stub case? Since stubs will be a very small proportion of the overall code, and every managed callsite will need to be made at least 2 bytes bigger (for global.set). In addition to the inflexibilities of hardcoding global indices.

@yowl
Copy link
Contributor

yowl commented Jan 8, 2026

How will a global.set work when threads are a thing?

@kg
Copy link
Member

kg commented Jan 8, 2026

How will a global.set work when threads are a thing?

Aren't globals functionally thread-local in wasm? Is it different for wasi?

@SingleAccretion
Copy link
Contributor

SingleAccretion commented Jan 8, 2026

Aren't globals functionally thread-local in wasm? Is it different for wasi?

I think @yowl is referring to the complexities of this: https://github.com/WebAssembly/shared-everything-threads/blob/main/proposals/shared-everything-threads/Overview.md#thread-local-storage-tls.

Without instance-per-thread, things get a bit expensive for globals. I don't know what the current expected implementation strategy is. I know it (__stack_pointer access cost) was brought up in the shared-everything proposal evolution history and they discussed things like giving a special TLS slot just for the stack pointer. This would need more research.

Edit: found some discussion about this - https://github.com/WebAssembly/meetings/blob/ca764085f4ac4c750b0500d9f2b7e1648636f503/threads/2025/THREADS-03-04.md. This also reminds us that an imported global requires two indirections instead of one.

Another edit: more - https://github.com/WebAssembly/meetings/blob/ca764085f4ac4c750b0500d9f2b7e1648636f503/threads/2024/THREADS-07-09.md.

@kg
Copy link
Member

kg commented Jan 8, 2026

Aren't globals functionally thread-local in wasm? Is it different for wasi?

I think @yowl is referring to the complexities of this: https://github.com/WebAssembly/shared-everything-threads/blob/main/proposals/shared-everything-threads/Overview.md#thread-local-storage-tls.

Without instance-per-thread, things get a bit expensive for globals. I don't know what the current expected implementation strategy is. I know it (__stack_pointer access cost) was brought up in the shared-everything proposal evolution history and they discussed things like giving a special TLS slot just for the stack pointer. This would need more research.

Edit: found some discussion about this - https://github.com/WebAssembly/meetings/blob/ca764085f4ac4c750b0500d9f2b7e1648636f503/threads/2025/THREADS-03-04.md. This also reminds us that an imported global requires two indirections instead of one.

This also appears to mention that importing a global turns it into two loads instead of one, which is a bit of a problem. But I think we have to import sp no matter what and it's just a question of how often we're going to touch it.

@pavelsavara pavelsavara added the arch-wasm WebAssembly architecture label Jan 8, 2026
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to 'arch-wasm': @lewing, @pavelsavara
See info in area-owners.md if you want to be subscribed.


## Incoming argument ABI

The linear stack pointer is the first argument to all methods. At a native->managed transition it is the value of the `$__stack_pointer` global. This global is not updated within managed code, but is updated at managed->native boundaries. Within the method the stack pointer always points at the bottom (lowest address) of the stack; generally this is a fixed offset from the value the stack pointer held on entry, except in methods that can do dynamic allocation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PInvokes, I expect that this update will be done around the callsite in the managed code.

Where do you expect it to be done for FCalls? We assume that FCalls have the same managed calling convention. Can this update be done inside the FCall macro somehow, so that FCalls can continue to have managed calling convention?

If we are not able to do that, I guess we will need to create some sort of FCalls wrappers. It is doable, but it is not pretty - we have been there in the past.

For reference, what does native AOT / LLVM do for FCalls currently?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For NAOT/LLVM it looks like FCalls have an extra initial arg that they ignore:

https://github.com/dotnet/runtimelab/blob/7706cd182716062d4fa550e88abd004e1a82dcd5/src/coreclr/nativeaot/Runtime/MathHelpers.cpp#L12

I don't see where the stack pointer global is updated; maybe NAOT/LLVM doesn't need this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, what does native AOT / LLVM do for FCalls currently?

NAOT-LLVM shadow stack is allocated separately from the __stack_pointer stack, so we only need to track it for the purposes of transition frames with virtual unwinding (another way of putting it is that it is only used for managed code).

Can this update be done inside the FCall macro somehow, so that FCalls can continue to have managed calling convention?

I don't know if it is possible with __attribute__((naked)) trickery to do it all in one function, but it is definitely possible with __asm to insert a stub with the managed calling convention (that'd do global.set __stack_pointer) into FCIMPL.


The prolog will increment the stack pointer, home any arguments that are stored on the linear stack, and zero initialize slots on the linear stack as appropriate. It will establish a frame pointer if one is needed.

It will also save a frame descriptor onto the stack, for use during GC and EH. For methods with EH or GC, a slot on the linear stack will be reserved for a "virtual IP" that will index into the EH and GC info to provide within-method information and allow external code to walk the managed stack frames.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For methods with EH or GC

It is not clear what a "method with GC" means. Should this say for methods with calls or EH (ie non-leaf methods)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, methods with GC safe points (which will always be at calls) or EH.

```
Initially the cell will contain code to determine if the target method has R2R code or must be interpreted. If there is R2R code for the method it is fixed up as needed. Once the target is resolved the cell can be updated to just refer to the R2R code directly, if there is any, or to a thunk for invoking the interpreter.

For indirect managed calls the sequence is similar, but the portable entry point is obtained by calling a resolve helper:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should say "virtual managed calls". Indirect calls (calli) should get the portable-entry-point to call from IL stack, no need to call resolve helper.

Also, there is a potential optimization for vtable-based virtual calls to just fetch the entrypoint by indexing into vtable like we do everywhere else.

@AndyAyersMS
Copy link
Member Author

Some crude estimates of the size costs of an SP arg vs maintaining the global SP.

Global SP always in sync

;; PROLOG

global.get $__stack_pointer    ;; (2, if can get a small index), else 6.
i32.const FRAMESIZE            ;; (2 typically)
i32.sub                        ;; 1
dup                            ;; 1
global.set $__stack_pointer    ;; (2/6)
local.set sp                   ;; 2

;; EPILOG

local.get sp                   ;; 2
i32.const FRAMESIZE            ;; 2ish
i32.add                        ;; 1
global.set $__stack_pointer    ;; 2/6

So 10/18 bytes per prolog, 7/11 bytes per epilog

No overhead at call sites. Smaller signatures.

Global SP lazy sync at boundaries

;; PROLOG

local.get sp                  ;; 2
i32.const FRAMESIZE           ;; 2
i32.sub                       ;; 1
local.set sp                  ;; 2

;; EPILOG

(empty)

So 7 bytes per prolog, 0 per epilog

;; unmanaged call sites & fcalls

local.get  sp                 ;; 2
global.set $__stack_pointer   ;; 2/6   (~0 amortizable for fcalls)

;; managed call sites

local.get  sp                 ;; 2

For the current x86 crossgen SPMI collection (which may not represent the set of methods we care about) there are 273829 methods / prologs, 312716 epilogs, 1504289 managed call sites, and 163641 helper call sites.

So assuming the optimistic case where we can encode the global SP index in one byte and wrap FCalls with a global SP update (size assumed negligible) I get size estimates like:

global SP   in sync: 5.581M bytes  (~20 bytes/method)
global SP lazy sync: 4.925M bytes  (~18 bytes/method)

If we can't get a small index for the SP global then the "sync" cost rises to 8.3M (~30 bytes/method).

This may overstate the difference somewhat. For leaf methods we generally won't need these SP sequences... I haven't tried to account for those yet.

@SingleAccretion
Copy link
Contributor

SingleAccretion commented Jan 9, 2026

Some crude estimates of the size costs of an SP arg vs maintaining the global SP.

This matches my calculations based on NAOT-LLVM data from last year almost perfectly: "50%" code size for the large index encoding, about the same for the small encoding.

For leaf methods we generally won't need these SP sequences... I haven't tried to account for those yet.

The proportion of truly leaf methods is going to be rather low. It was on the order of < 5% on real-world NAOT-LLVM data (this is from dotnet/runtimelab#2697). This is because most methods may throw (an NRE), requiring a helper call.

@jkotas
Copy link
Member

jkotas commented Jan 9, 2026

This is because most methods may throw (an NRE), requiring a helper call.

Was this with assumption that this can be null? It is a gray area. this can be never null in C#. I think it would be ok to assume that this is never null for wasm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arch-wasm WebAssembly architecture needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants