-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Arm64] Implement stack probe helper #43250
[Arm64] Implement stack probe helper #43250
Conversation
4e0e02d
to
70c14bf
Compare
4843124
to
0c46021
Compare
@BruceForstall I believe this is ready for review now. I plan to do more testing next week. I know that there are two code size regressions related to "negative fp-offsets of locals"-issue - I am planning to look into how this can be mitigated separately. It seems that computing the local addresses based on @TamarChristinaArm I would like to ask your opinion about inlining more instructions in a prolog in order to do stack probing. In particular, how far we should go? Should we go beyond cc @dotnet/jit-contrib |
I'm a little confused with your inline examples. E.g., "For example,
why are we subtracting 0x8000? Shouldn't it be:
? |
fyi @janvorli |
@BruceForstall Sure, we could subtract I decided not to go into the math and simplified it by always subtracting |
Oh, I see; you always subtract 0x8000, but that's just for probing; when the actual SP subtract happens, it's of the actual required amount. Don't we need to move SP when probing on Linux? |
We do on linux-x64. However, even on linux-x64 we still could probe below SP but not very far. There is some limit when such access becomes treated as illegal and the app will be terminated by the kernel. As far as I remember, this check in linux memory manager was enabled for linux-x64 only, but not for linux-arm, linux-arm64. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good. Mostly, I'm requesting more comments. It makes sense to extract out genPushCalleeSavedRegisters
to reduce ifdefs; that's a nice change.
; x9 - points to the lowest address on the stack frame being allocated (i.e. [InitialSp - FrameSize]) | ||
; sp - points to some byte on the last probed page | ||
; On exit: | ||
; x9 - is preserved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you mention x30 is trashed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x30
is never trashed. x30
is the same register as lr
. I am using x30
name instead of lr
since the register is not used as link register and I wanted to emphasize that.
cmp sp, x30, lsl #0 | ||
bhs ProbeLoop ; if (sp >= x30), then we need to probe at least one more page | ||
|
||
mov sp, fp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does fp
get set/saved? By PROLOG_SAVE_REG_PAIR
?
bhs ProbeLoop ; if (sp >= x30), then we need to probe at least one more page | ||
|
||
mov sp, fp | ||
EPILOG_RESTORE_REG_PAIR fp, lr, 16! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume it's ok to have a sequence:
- Function prolog
- Call probe helper
- Probe helper probes, subtracting sp, then restores sp to location at call to probe helper, returns to caller
- Function changes
sp
In particular, between 3 & 4, the probed pages remain mapped (the OS never "reclaims" them) even though sp
has been reverted. (Presumably, e.g., the OS could use the probed space then for interrupt handler, say).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is true. Stack pages can only be reclaimed after the thread exits.
#define PAGE_SIZE_LOG12 12 | ||
#define PAGE_SIZE 4096 | ||
|
||
LEAF_ENTRY JIT_StackProbe, _TEXT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should add a header comment with "on entry" and "on exit" conditions spelled out.
src/coreclr/src/jit/compiler.h
Outdated
return 2 * eeGetPageSize(); | ||
return 2 * pageSize; | ||
#elif defined(TARGET_ARM64) | ||
constexpr target_size_t ldrLargestPositiveImmByteOffset = 0x8000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be worthwhile having a comment here (even a quite detailed comment) explaining this math, or at least a pointer to someplace else in the code that has such a comment.
src/coreclr/src/jit/lclvars.cpp
Outdated
@@ -5853,6 +5853,12 @@ void Compiler::lvaAssignVirtualFrameOffsetsToLocals() | |||
{ | |||
codeGen->SetSaveFpLrWithAllCalleeSavedRegisters(true); // Force using new frames | |||
} | |||
|
|||
if (compLclFrameSize >= getVeryLargeFrameSize()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment here would be useful
src/coreclr/src/jit/codegenxarch.cpp
Outdated
@@ -8944,4 +8944,63 @@ void CodeGen::genProfilingLeaveCallback(unsigned helper) | |||
|
|||
#endif // PROFILING_SUPPORTED | |||
|
|||
/*----------------------------------------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming this is just extracted and logic is unchanged
src/coreclr/src/jit/target.h
Outdated
#define RBM_STACK_PROBE_HELPER_ARG RBM_R9 | ||
#define REG_STACK_PROBE_HELPER_CALL_TARGET REG_IP0 | ||
#define RBM_STACK_PROBE_HELPER_CALL_TARGET RBM_IP0 | ||
#define RBM_STACK_PROBE_HELPER_TRASH RBM_NONE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the trash set be x30
?
|
||
int totalFrameSize = genTotalFrameSize(); | ||
|
||
bool useStackProbeHelper = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be more overall comments about the probing logic, here, preferably with examples for the various important cases.
@@ -1262,6 +1262,26 @@ GenerateProfileHelper ProfileTailcall, PROFILE_TAILCALL | |||
|
|||
#endif | |||
|
|||
#define PAGE_SIZE_LOG12 12 | |||
#define PAGE_SIZE 4096 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nit - I would prefer naming this PROBE_PAGE_SIZE to indicate that it isn't necessarily the OS page size.
bhs ProbeLoop ; if (sp >= x30), then we need to probe at least one more page | ||
|
||
mov sp, fp | ||
EPILOG_RESTORE_REG_PAIR fp, lr, 16! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is true. Stack pages can only be reclaimed after the thread exits.
That's a reasonably amount. It's mostly a code size thing more than anything really. For GCC we inline up to 4 instructions after which we emit an inline loop. So a maximum of 8 is fine.
Just a side note, the behavior of the kernel aside, that does violates the AAPCS see Also note that reading from an unallocated stack area is explicitly prohibited when MTE (Memory Tagging) is enabled. The commit ARM-software/abi-aa@c09ef09 has some more information on the salient parts. |
Hmm,
@TamarChristinaArm Doesn't this mean that the proposed algorithm would violate the rule since |
Correct it does, and on MTE enabled systems this may result in a hardware fault (depending on the mode the hardware is set to). There are proposals to slightly adjust the AAPCS to specifically allow probes (and only probes) below SP but those have yet to reach a conclusion. The reverse scheme of probing after dropping SP also has issues in that if you take a signal the signal handler doesn't know whether the SP is valid or not, so it has to check. |
The scheme where SP changes as we probe will have its own issues with stack unwinding. In particular, as in #42885 However, always calling a helper for probing seems too expensive alternative. Especially, given the fact that we need to force a specific frame type in the JIT where @TamarChristinaArm You mentioned that
meaning that on GCC you chose to break the rule and probe below SP? Perhaps, we can do the same unless we are strictly prohibited (as in the case with MTE). Or we can make such option configurable? @janvorli @BruceForstall What are your thoughts? |
long thread, I'll have a read :)
No those two parts were unrelated.. With GCC we drop then probe. We have a slightly different ABI (one that clang will also follow) for probing (which we do for stack clash mitigation) where we try to minimize the number of probes that we need to emit since the storing of So for int foo (){
volatile int x[57000];
x[0] = 0;
} we generate: foo:
sub sp, sp, #65536
str xzr, [sp, 1024]
sub sp, sp, #65536
str xzr, [sp, 1024]
sub sp, sp, #65536
str xzr, [sp, 1024]
mov x12, 31392
sub sp, sp, x12
str wzr, [sp]
add sp, sp, 2720
add sp, sp, 225280
ret with |
@TamarChristinaArm I see, the scheme is quite different from what we do. |
I guess that's why your probes are |
@echesakovMSFT could doing this cause |
@BruceForstall No, since the |
@BruceForstall no, I should have been more precise here. With MTE enabled the stack is colored based on who allocated the space. An unallocated stack space is uncolored and so any access of it is invalid. What invalid here means depends on the value of There's no real particular reason why we used |
…n.cpp to CodeGen::genPushCalleeSavedRegisters in codegenarmarch.cpp
…runhelpers.h jitinterface.h
…L_TARGET and RBM_STACK_PROBE_HELPER_TRASH in target.h
…the same as initReg we clear *pInitRegZeroed in codegenarmarch.cpp
…S src/coreclr/vm/arm64/asmhelpers.asm
91b927a
to
2772cb2
Compare
… in src/coreclr/jit/codegencommon.cpp
…p src/coreclr/jit/codegencommon.cpp
…r/jit/codegen.h src/coreclr/jit/codegenarm.cpp src/coreclr/jit/codegencommon.cpp src/coreclr/jit/codegenxarch.cpp
…rc/coreclr/vm/jitinterface.cpp
…src/coreclr/vm/arm/asmhelpers.asm
… src/coreclr/jit/codegenarm.cpp src/coreclr/jit/codegenarm64.cpp
…/coreclr/jit/lclvars.cpp src/coreclr/jit/target.h
48ae7d9
to
bcb4a74
Compare
Extracted refactoring changes to #48199 |
Fixes #13519
Below is an algorithm I propose and its comparison to the current implementation:
Case 1:
compiler->compLclFrameSize < getVeryLargeFrameSize()
. I am going to talk what should be chosen as a value returned bygetVeryLargeFrameSize()
at a later point, but note that currently it is equal to three page sizes (and let's assume for now thatpageSize = 0x1000
). In other words, if distance between a current value ofsp
and the final value ofsp
is less than some predefined value then the JIT inlines a stack probing instructions sequence into a function prolog.Currently, this is done be inserting a sequence of
mov tempReg, #imm
followed byldr wzr, [sp, tempReg]
. For example, forcompLclFrameSize = 0x2000
the JIT emitsYou can immediately see that the current way is suboptimal in both performance and code size. First, the memory access is unaligned. Second, we don't need to emit 4 instructions in order to probe 2 pages.
There is more optimal implementation I propose and it utilizes a fact that
ldr
(immediate) can address up to 32 Kbytes (in positive direction) of data. In order to to that, the JIT would emitsub tempReg, sp, 0x8000
followed byldr xzr, [tempReg, #imm1]; ldr xzr, [tempReg, #imm2]
unless reaches the stack frame boundary. For example, for the samecompLclFrameSize = 0x2000
the JIT would emitthat would save one instruction.
Case 2
compiler->compLclFrameSize >= getVeryLargeFrameSize()
. This is what I was trying originally address by this PR and replace the inlined stack probing with a helper call. The mechanics is similar to other platforms but slighly complicated by the fact that Arm64 can have up to 6 different frame types as defined in codegencommon.cppTurns out, that for this case we only care about two -
frameType = 3
andframeType = 5
. The major difference between them is a location wherefp, lr
record is stored on the stack - at the bottom (frameType = 3
) or at the top (frameType = 5
). At the moment, the JIT usesframeType = 5
for methods withlocalloc
andGS
cookiesOtherwise,
frameType = 3
is used.In order, to be able to call a helper,
lr
must be saved on the stack before the call. That means, that, in order to call the stack probing helper, the JIT would need to forceframeType = 5
. However, there is an issue with approach. At the moment, addresses of locals are computed based onfp
value andframeType = 5
meaning that their offsets are becoming negative (that might cause regressions in extreme cases with large number of locals whenldr\sdr
wouldn't be able to encode such offsets with immediate).Let's compare the JIT generated code for this case. Suppose
compLclFrameSize = 0x10000
, the current implementation of the JIT inlines a stack probing loopThe JIT with stack probe helper would emit
Note, that
x9
contains the final value ofsp
andfp, lr
are stored on the stack before the call.While it seems that we only got one instruction win - it is more that this. First, we avoided having a loop in a function prolog. As Tamar commented in #43789 (comment) such loop can cause performance degradation and in order to avoid we must ensure that the loop is properly aligned. Second, for large frame sizes that are more likely to cause StackOverflow we are not able to take advantages of JanV's work in #32167 and display stack trace during StackOverflow. Using the helper solves both issues.
Let's talk about
getVeryLargeFrameSize()
.Given how we optimized the inlined sequence of instructions it might be beneficial to increase the value and defer a moment when the JIT would need to call a helper. I propose to have the value of
getVeryLargeFrameSize()
to be such that stack probing can be done by inlining onesub tempReg, sp, 0x8000
followed by up to 8ldr xzr, [tempReg, #imm]
.For example,
compLclFrameSize = 0x3000
compLclFrameSize = 0x4000
compLclFrameSize = 0x8D80
compLclFrameSize = 0x8E00
The following is a chart comparing prolog sizes for different values of
compLclFrameSize
for the current implementation ("base clrjit.dll") and the proposed implementation ("diff clrjit.dll").As you can see, for the smaller frame sizes - the proposed implementation produces smaller code size, and it keeps inlining the above-mentioned instruction sequences until it had to compute a new value of
tempReg
and calls a helper after that point.