-
Notifications
You must be signed in to change notification settings - Fork 5.3k
[Wasm RyuJIT] Initial writeup on the calling convention #122988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Describe the calling convention used by R2R (and perhaps someday jitted) Wasm code.
|
PTAL @dotnet/wasm-contrib... this is a first draft so please comment / help fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds comprehensive documentation for the WebAssembly calling convention used by R2R (Ready-to-Run) and JIT-compiled managed code in the CLR. The documentation describes how the runtime interoperates with WebAssembly's stack model, calling sequences, and garbage collection integration.
Key Changes:
- Adds a new "Web Assembly ABI (R2R and JIT)" section to the CLR ABI documentation
- Documents stack layout, argument passing conventions, prolog/epilog behavior, and calling sequences
- Explains GC reference handling at call sites and the portable entry point mechanism
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
|
||
| The prolog will increment the stack pointer, home any arguments that are stored on the linear stack, and zero initialize slots on the linear stack as appropriate. It will establish a frame pointer if one is needed. | ||
|
|
||
| It will also save a frame descriptor onto the stack, for use during GC and EH. For methods with EH or GC, a slot on the linear stack will be reserved for a "virtual IP" that will index into the EH and GC info to provide within-method information and allow external code to walk the managed stack frames. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frame descriptor is GCInfo, yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will also refer to the EH info, so a bit more general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need the stack frames to be linked together in order to provide support for walking managed frames ? So, from the sp value in the current frame, shouldn't we be able to obtain the sp of the parent frame in order to get the descriptor information, etc ? I guess the plan would be to fetch the previous sp from sp[-1]? Would there be methods where sp can be dynamically incremented, in which case this wouldn't work ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the last frame's base address, the plan is that the frames are self-descriptive.
To get the base address (using the scheme above where sp can diverge from $__stack_pointer), we rely on the fact stack walks can only happen once the R2R code has called back into native code (either helper methods, or the interpreter). These calls are passed sp as arguments and save that to the global $__stack_pointer and perhaps some other global or similar for easy access by the unwinder.
The frame descriptor will be at a known offset from this saved sp (likely 0) and the size of the frame will be stored in the descriptor, so the external code can compute the address of the parent frame that way, eg parent_sp = sp + sp[0].frameSize.
For dynamic-sized frames a copy of the prior sp can be likewise stored at some other known offset from sp) to provide the necessary chaining. If the frame grows then this value can be re-established to reflect the new size. Or we can equivalently store the total frame size.
If we follow Katelyn's proposal of keeping $__stack_pointer in sync for all managed methods then there's a bit less ceremony required, but from there the unwinding proceeds the same way.
|
|
||
| ## Incoming argument ABI | ||
|
|
||
| The linear stack pointer is the first argument to all methods. At a native->managed transition it is the value of the `$__stack_pointer` global. This global is not updated within managed code, but is updated at managed->native boundaries. Within the method the stack pointer always points at the bottom (lowest address) of the stack; generally this is a fixed offset from the value the stack pointer held on entry, except in methods that can do dynamic allocation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means helper calls all become managed->native boundaries that require a stack pointer update at start and end, right? Is that a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a code size cost for each helper call... it could be amortized with a custom wrapper for each helper that just does the $__stack_pointermaintenance.
|
|
||
| ## Outgoing call ABI | ||
|
|
||
| For direct managed calls, Wasm uses the Portable Entry Point feature to facilitate smooth interop with interpreted code. This means all managed calls are made indirectly, and the portable entry point is also passed as the last argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the long run will we optimize this for cases where we know both the caller and callee were crossgen'd? I'm fine with not specifying that yet though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we can optimize that case. Though in general there is no guarantee the runtime will use R2R compiled callee code.
On Wasm this may be less of an issue because the cases where R2R method bodies end up being disqualified may not be possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we can optimize that case
Yes.
Though in general there is no guarantee the runtime will use R2R compiled callee code.
The fixups for the caller would have to verify that the directly callled method is going to use R2R too. If the fixup fails, R2R code for the caller would have to be rejected as well.
(We do something similar for ReJIT. If there is a ReJIT request for a method that got inlined, all methods that inlined it must be invalidated as well.)
|
This is probably a good time to raise that I think we shouldn't pass the stack pointer in an argument.
Reasons against it:
Given that we're not linking I view the sp argument thing as a Potential Optimization and I feel like it's premature. Single or Yowl may have a good reason why we should still do it though based on their experiences. I believe what we would do instead is export the stack_pointer from the runtime's wasm module (this may already be the default behavior for emscripten, to enable dynamic linking?) then grab the WebAssembly.Global for the stack_pointer and import it into every r2r module we load. Then all our code can manipulate the linear stack the same way clang generated code does. |
Following on this, could we also use a small index global to pass the portable entry point? Then we would not need the interpreter->managed stub to invoke the method with an extra unused argument. |
It doesn't seem necessary to optimize for the stub case? Since stubs will be a very small proportion of the overall code, and every managed callsite will need to be made at least 2 bytes bigger (for |
|
How will a |
Aren't globals functionally thread-local in wasm? Is it different for wasi? |
I think @yowl is referring to the complexities of this: https://github.com/WebAssembly/shared-everything-threads/blob/main/proposals/shared-everything-threads/Overview.md#thread-local-storage-tls. Without instance-per-thread, things get a bit expensive for globals. I don't know what the current expected implementation strategy is. I know it ( Edit: found some discussion about this - https://github.com/WebAssembly/meetings/blob/ca764085f4ac4c750b0500d9f2b7e1648636f503/threads/2025/THREADS-03-04.md. This also reminds us that an imported global requires two indirections instead of one. Another edit: more - https://github.com/WebAssembly/meetings/blob/ca764085f4ac4c750b0500d9f2b7e1648636f503/threads/2024/THREADS-07-09.md. |
This also appears to mention that importing a global turns it into two loads instead of one, which is a bit of a problem. But I think we have to import sp no matter what and it's just a question of how often we're going to touch it. |
|
Tagging subscribers to 'arch-wasm': @lewing, @pavelsavara |
|
|
||
| ## Incoming argument ABI | ||
|
|
||
| The linear stack pointer is the first argument to all methods. At a native->managed transition it is the value of the `$__stack_pointer` global. This global is not updated within managed code, but is updated at managed->native boundaries. Within the method the stack pointer always points at the bottom (lowest address) of the stack; generally this is a fixed offset from the value the stack pointer held on entry, except in methods that can do dynamic allocation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For PInvokes, I expect that this update will be done around the callsite in the managed code.
Where do you expect it to be done for FCalls? We assume that FCalls have the same managed calling convention. Can this update be done inside the FCall macro somehow, so that FCalls can continue to have managed calling convention?
If we are not able to do that, I guess we will need to create some sort of FCalls wrappers. It is doable, but it is not pretty - we have been there in the past.
For reference, what does native AOT / LLVM do for FCalls currently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For NAOT/LLVM it looks like FCalls have an extra initial arg that they ignore:
I don't see where the stack pointer global is updated; maybe NAOT/LLVM doesn't need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, what does native AOT / LLVM do for FCalls currently?
NAOT-LLVM shadow stack is allocated separately from the __stack_pointer stack, so we only need to track it for the purposes of transition frames with virtual unwinding (another way of putting it is that it is only used for managed code).
Can this update be done inside the FCall macro somehow, so that FCalls can continue to have managed calling convention?
I don't know if it is possible with __attribute__((naked)) trickery to do it all in one function, but it is definitely possible with __asm to insert a stub with the managed calling convention (that'd do global.set __stack_pointer) into FCIMPL.
|
|
||
| The prolog will increment the stack pointer, home any arguments that are stored on the linear stack, and zero initialize slots on the linear stack as appropriate. It will establish a frame pointer if one is needed. | ||
|
|
||
| It will also save a frame descriptor onto the stack, for use during GC and EH. For methods with EH or GC, a slot on the linear stack will be reserved for a "virtual IP" that will index into the EH and GC info to provide within-method information and allow external code to walk the managed stack frames. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For methods with EH or GC
It is not clear what a "method with GC" means. Should this say for methods with calls or EH (ie non-leaf methods)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, methods with GC safe points (which will always be at calls) or EH.
| ``` | ||
| Initially the cell will contain code to determine if the target method has R2R code or must be interpreted. If there is R2R code for the method it is fixed up as needed. Once the target is resolved the cell can be updated to just refer to the R2R code directly, if there is any, or to a thunk for invoking the interpreter. | ||
|
|
||
| For indirect managed calls the sequence is similar, but the portable entry point is obtained by calling a resolve helper: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should say "virtual managed calls". Indirect calls (calli) should get the portable-entry-point to call from IL stack, no need to call resolve helper.
Also, there is a potential optimization for vtable-based virtual calls to just fetch the entrypoint by indexing into vtable like we do everywhere else.
|
Some crude estimates of the size costs of an SP arg vs maintaining the global SP. Global SP always in syncSo 10/18 bytes per prolog, 7/11 bytes per epilog No overhead at call sites. Smaller signatures. Global SP lazy sync at boundariesFor the current x86 crossgen SPMI collection (which may not represent the set of methods we care about) there are 273829 methods / prologs, 312716 epilogs, 1504289 managed call sites, and 163641 helper call sites. So assuming the optimistic case where we can encode the global SP index in one byte and wrap FCalls with a global SP update (size assumed negligible) I get size estimates like: If we can't get a small index for the SP global then the "sync" cost rises to 8.3M (~30 bytes/method). This may overstate the difference somewhat. For leaf methods we generally won't need these SP sequences... I haven't tried to account for those yet. |
This matches my calculations based on NAOT-LLVM data from last year almost perfectly: "50%" code size for the large index encoding, about the same for the small encoding.
The proportion of truly leaf methods is going to be rather low. It was on the order of < 5% on real-world NAOT-LLVM data (this is from dotnet/runtimelab#2697). This is because most methods may throw (an NRE), requiring a helper call. |
Was this with assumption that |
Describe the calling convention used by R2R (and perhaps someday jitted) Wasm code.