-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NativeAOT] Evaluate use/benefits of compact unwinding on osx-x64 and osx-arm64 #76371
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
For reference, here's how the custom prolog unwinding code would look like in C. First, we would need to detect that we are in prolog: #if defined(TARGET_AMD64) && defined(TARGET_OSX)
// Compact unwinding on macOS cannot properly handle unwinding the function prolog
// so we have to handle it explicitly
if ((PTR_UInt8)pRegisterSet->IP < (PTR_UInt8)pNativeMethodInfo->pMethodStartAddress + decoder.GetPrologSize())
{
return UnwindProlog(pMethodInfo, pRegisterSet, ppvRetAddrLocation);
}
#endif ...and we would need couple of definitions/macros (shared with existing #ifdef TARGET_AMD64
#define SIZE64_PREFIX 0x48
#define ADD_IMM8_OP 0x83
#define ADD_IMM32_OP 0x81
#define JMP_IMM8_OP 0xeb
#define JMP_IMM32_OP 0xe9
#define JMP_IND_OP 0xff
#define LEA_OP 0x8d
#define REPNE_PREFIX 0xf2
#define REP_PREFIX 0xf3
#define POP_OP 0x58
#define PUSH_OP 0x50
#define RET_OP 0xc3
#define RET_OP_2 0xc2
#define INT3_OP 0xcc
#define IS_REX_PREFIX(x) (((x) & 0xf0) == 0x40)
#endif ...and finally the unwinding method: bool UnixNativeCodeManager::UnwindProlog(MethodInfo * pMethodInfo,
REGDISPLAY * pRegisterSet,
PTR_PTR_VOID * ppvRetAddrLocation)
{
#if defined(TARGET_AMD64)
UnixNativeMethodInfo* pNativeMethodInfo = (UnixNativeMethodInfo*)pMethodInfo;
uint8_t* pNextByte = (uint8_t*)pNativeMethodInfo->pMethodStartAddress;
uint32_t stackOffset = 0;
while (pNextByte < (uint8_t*)pRegisterSet->IP)
{
if ((pNextByte[0] & 0xf8) == PUSH_OP)
{
stackOffset += 8;
pNextByte += 1;
}
else if (IS_REX_PREFIX(pNextByte[0]) && ((pNextByte[1] & 0xf8) == PUSH_OP))
{
stackOffset += 8;
pNextByte += 2;
}
else if ((pNextByte[0] & 0xf8) == SIZE64_PREFIX &&
pNextByte[1] == ADD_IMM8_OP &&
pNextByte[2] == 0xec)
{
// sub rsp, imm8
stackOffset += pNextByte[3];
pNextByte += 4;
}
else if ((pNextByte[0] & 0xf8) == SIZE64_PREFIX &&
pNextByte[1] == ADD_IMM32_OP &&
pNextByte[2] == 0xec)
{
// sub rsp, imm32
stackOffset +=
(uint32_t)pNextByte[3] |
((uint32_t)pNextByte[4] << 8) |
((uint32_t)pNextByte[5] << 16) |
((uint32_t)pNextByte[6] << 24);
pNextByte += 7;
}
else
{
// Bail out for anything that we cannot handle. This could be a breakpoint
// (int 3) inserted by a debugger, or some more complicated prolog pattern
// like the stack probing:
//
// lea r11, [rsp-XXX]
// call __chkstk
// mov rsp, r11
//
// Additionally, these sequences may establish the prolog frame but we don't
// need to handle them since they are always the last instruction of the
// prolog and thus regular unwinding should work:
//
// lea rbp, [rsp+IMM8]
// lea rbp, [rsp+IMM32]
return false;
}
}
*ppvRetAddrLocation = (PTR_PTR_VOID)(pRegisterSet->GetSP() + stackOffset);
return true;
#else
PORTABILITY_ASSERT("UnwindProlog");
#endif
} |
I looked into prototyping this on ARM64 Apple platforms: https://github.com/filipnavara/runtime/pull/new/arm64-compact-unwind The branch is on top of the frameless prototype from issue #35274 (comment), only the last commit contains the JIT and ObjWriter changes relevant to this PR. While the two changes are somewhat orthogonal I also implemented more generic algorithm for computing the compact unwinding code and the frameless methods provided additional test cases. The rough overview of the changes:
Challenges:
So, how well does it perform? To give you an idea of how big is the difference I recompiled an empty .NET MAUI app with the above changes. The baseline was compiled with frameless methods, which saves around 27Kb of code size compared to The code size increase was extremely disproportional. For example, method |
One more thought - the code size increase would likely be possible to mitigate with enabling support for double-aligned frames, ie. access locals through SP if possible. That would be a larger change though.
|
Thanks for looking into this! Cc @VSadov since he knows about unwinding |
I have slightly more refined version of the prototype: https://github.com/filipnavara/runtime/tree/arm64-compact-unwind-1. I managed to mitigate most of the code size increase (aside from the +4 bytes for prolog with temporaries/locals and other code size changes related to alignment). Turns out ARM32 already has the optimization for turning FP-based offsets into SP-based offsets in The conservative condition is enabling the Apple-style prologs for all methods with With these tweaks the stats for
We could further tweak the heuristic to opt-in smaller methods with exception handling into the Apple prologs. This can likely save another 10% in size of the DWARF unwinding data but it's a more nuanced heuristic to get right. |
Latest branch: https://github.com/dotnet/runtime/compare/main...filipnavara:arm64-compact-unwind-3?expand=1 It passed the CI. |
Apple platforms use compact unwinding information to efficiently encode information on how to do stack unwinding. Unlike the DWARF CFI information that is currently used by NativeAOT on macOS and Linux the compact unwinding information is smaller. It also does not encode enough information to do asynchronous unwinding in prolog/epilog of the functions. The benefit of using the compact unwinding codes would be smaller size of the resulting binaries.
Upon investigation I found that ILCompiler already emits the DWARF CFI only for prologs and not for epilogs. UnixNativeCodeManager handles the epilogs by doing code inspection. Similar approach can be employed to unwind the prologs. As an experiment I took an osx-x64 object file produced by the NativeAOT compilation process and for every function I compared the results of trivial prolog x64 code walk with the offsets in the actual DWARF CFI code. For vast majority of the cases the prolog only uses two different instructions (
push REG
andsub RSP, <value>
) before establishing theRBP
frame that can already be processed with the compact unwinding information. Only one method uses more complex pattern to allocate a frame that's larger than page size and where stack probing is needed. It would be simple to recognize that pattern too.To be able to use the combination of custom prolog unwinding and the compact unwinding for method body we would need to know the size of the prolog. Unfortunately that information is currently not stored anywhere. The
GcInfo
structure can optionally store it in some cases but for majority of uses it's not present at the moment. We would likely need to store it as extra byte in the LSDA structure.It's not obvious whether using the compact unwinding would be a clear win. It adds code complexity that is specific to a single platform. I don't have any numbers at the moment to show how much space could be saved by the compact encoding in comparison to the current DWARF CFI encoding.
The text was updated successfully, but these errors were encountered: