Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Arm64] Implement stack probe helper #43250

Closed

Conversation

echesakov
Copy link
Contributor

@echesakov echesakov commented Oct 10, 2020

Fixes #13519

Below is an algorithm I propose and its comparison to the current implementation:

Case 1: compiler->compLclFrameSize < getVeryLargeFrameSize(). I am going to talk what should be chosen as a value returned by getVeryLargeFrameSize() at a later point, but note that currently it is equal to three page sizes (and let's assume for now that pageSize = 0x1000). In other words, if distance between a current value of sp and the final value of sp is less than some predefined value then the JIT inlines a stack probing instructions sequence into a function prolog.

Currently, this is done be inserting a sequence of mov tempReg, #imm followed by ldr wzr, [sp, tempReg]. For example, for compLclFrameSize = 0x2000 the JIT emits

        9281FFE9          movn    x9, #0xfff
        B8696BFF          ldr     wzr, [sp, x9]
        9283FFE9          movn    x9, #0x1fff
        B8696BFF          ldr     wzr, [sp, x9]

You can immediately see that the current way is suboptimal in both performance and code size. First, the memory access is unaligned. Second, we don't need to emit 4 instructions in order to probe 2 pages.

There is more optimal implementation I propose and it utilizes a fact that ldr (immediate) can address up to 32 Kbytes (in positive direction) of data. In order to to that, the JIT would emit sub tempReg, sp, 0x8000 followed by ldr xzr, [tempReg, #imm1]; ldr xzr, [tempReg, #imm2] unless reaches the stack frame boundary. For example, for the same compLclFrameSize = 0x2000 the JIT would emit

        D14023E9          sub     x9, sp, #8, LSL #12
        F978013F          ldr     xzr, [x9,#0x7000]
        F970013F          ldr     xzr, [x9,#0x6000]

that would save one instruction.

Case 2 compiler->compLclFrameSize >= getVeryLargeFrameSize(). This is what I was trying originally address by this PR and replace the inlined stack probing with a helper call. The mechanics is similar to other platforms but slighly complicated by the fact that Arm64 can have up to 6 different frame types as defined in codegencommon.cpp

Turns out, that for this case we only care about two - frameType = 3 and frameType = 5. The major difference between them is a location where fp, lr record is stored on the stack - at the bottom (frameType = 3) or at the top (frameType = 5). At the moment, the JIT uses frameType = 5 for methods with localloc and GS cookies

Otherwise, frameType = 3 is used.
In order, to be able to call a helper, lr must be saved on the stack before the call. That means, that, in order to call the stack probing helper, the JIT would need to force frameType = 5. However, there is an issue with approach. At the moment, addresses of locals are computed based on fp value and frameType = 5 meaning that their offsets are becoming negative (that might cause regressions in extreme cases with large number of locals when ldr\sdr wouldn't be able to encode such offsets with immediate).

Let's compare the JIT generated code for this case. Suppose compLclFrameSize = 0x10000, the current implementation of the JIT inlines a stack probing loop

        9281FFE9          movn    x9, #0xfff
        928001E0          movn    x0, #15
        F2BFFFC0          movk    x0, #0xfffe LSL #16
        B8696BFF          ldr     wzr, [sp, x9]
        D1400529          sub     x9, x9, #1, LSL #12
        EB09001F          cmp     x0, x9
        54FFFFA9          bls     pc-16 (-4 instructions)

The JIT with stack probe helper would emit

        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
        D2800209          movz    x9, #16
        F2A00029          movk    x9, #1 LSL #16
        CB2963E9          sub     x9, sp, x9, LSL #0
        94000000          bl      CORINFO_HELP_STACK_PROBE

Note, that x9 contains the final value of sp and fp, lr are stored on the stack before the call.

While it seems that we only got one instruction win - it is more that this. First, we avoided having a loop in a function prolog. As Tamar commented in #43789 (comment) such loop can cause performance degradation and in order to avoid we must ensure that the loop is properly aligned. Second, for large frame sizes that are more likely to cause StackOverflow we are not able to take advantages of JanV's work in #32167 and display stack trace during StackOverflow. Using the helper solves both issues.

Let's talk about getVeryLargeFrameSize().
Given how we optimized the inlined sequence of instructions it might be beneficial to increase the value and defer a moment when the JIT would need to call a helper. I propose to have the value of getVeryLargeFrameSize() to be such that stack probing can be done by inlining one sub tempReg, sp, 0x8000 followed by up to 8 ldr xzr, [tempReg, #imm].

For example, compLclFrameSize = 0x3000

        D14023E9          sub     x9, sp, #8, LSL #12
        F978013F          ldr     xzr, [x9,#0x7000]
        F970013F          ldr     xzr, [x9,#0x6000]
        F968013F          ldr     xzr, [x9,#0x5000]

compLclFrameSize = 0x4000

        D14023E9          sub     x9, sp, #8, LSL #12
        F978013F          ldr     xzr, [x9,#0x7000]
        F970013F          ldr     xzr, [x9,#0x6000]
        F968013F          ldr     xzr, [x9,#0x5000]
        F960013F          ldr     xzr, [x9,#0x4000]

compLclFrameSize = 0x8D80

        D14023E9          sub     x9, sp, #8, LSL #12
        F978013F          ldr     xzr, [x9,#0x7000]
        F970013F          ldr     xzr, [x9,#0x6000]
        F968013F          ldr     xzr, [x9,#0x5000]
        F960013F          ldr     xzr, [x9,#0x4000]
        F958013F          ldr     xzr, [x9,#0x3000]
        F950013F          ldr     xzr, [x9,#0x2000]
        F948013F          ldr     xzr, [x9,#0x1000]
        F940013F          ldr     xzr, [x9]

compLclFrameSize = 0x8E00

        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
        D291C209          mov     x9, #0x8e10
        CB2963E9          sub     x9, sp, x9, LSL #0
        94000000          bl      CORINFO_HELP_STACK_PROBE

The following is a chart comparing prolog sizes for different values of compLclFrameSizefor the current implementation ("base clrjit.dll") and the proposed implementation ("diff clrjit.dll").

Chart

As you can see, for the smaller frame sizes - the proposed implementation produces smaller code size, and it keeps inlining the above-mentioned instruction sequences until it had to compute a new value of tempReg and calls a helper after that point.

@echesakov echesakov added arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Oct 10, 2020
@echesakov echesakov self-assigned this Oct 10, 2020
@echesakov echesakov force-pushed the Arm64-Implement-Jit-StackProbe-Helper branch from 4e0e02d to 70c14bf Compare October 12, 2020 20:05
@echesakov echesakov mentioned this pull request Oct 20, 2020
29 tasks
@echesakov echesakov force-pushed the Arm64-Implement-Jit-StackProbe-Helper branch 2 times, most recently from 4843124 to 0c46021 Compare October 30, 2020 00:43
@echesakov echesakov marked this pull request as ready for review October 30, 2020 00:44
@echesakov
Copy link
Contributor Author

@BruceForstall I believe this is ready for review now. I plan to do more testing next week. I know that there are two code size regressions related to "negative fp-offsets of locals"-issue - I am planning to look into how this can be mitigated separately. It seems that computing the local addresses based on sp value, as we discussed, should be sufficient.

@TamarChristinaArm I would like to ask your opinion about inlining more instructions in a prolog in order to do stack probing. In particular, how far we should go? Should we go beyond sub followed by 8 ldr-s? Or it seems to be a reasonable boundary and after that point it's better to switch to a helper call.

cc @dotnet/jit-contrib

@BruceForstall
Copy link
Member

I'm a little confused with your inline examples. E.g.,

"For example, compLclFrameSize = 0x3000"

        D14023E9          sub     x9, sp, #8, LSL #12
        F978013F          ldr     xzr, [x9,#0x7000]
        F970013F          ldr     xzr, [x9,#0x6000]
        F968013F          ldr     xzr, [x9,#0x5000]

why are we subtracting 0x8000? Shouldn't it be:

        sub     x9, sp, #3, LSL #12
        ldr     xzr, [x9,#0x2000] // one ldr per page
        ldr     xzr, [x9,#0x1000]
        ldr     xzr, [x9,#0x0] // always probe the very bottom last

?

@BruceForstall
Copy link
Member

fyi @janvorli

@echesakov
Copy link
Contributor Author

I'm a little confused with your inline examples. E.g.,

"For example, compLclFrameSize = 0x3000"

        D14023E9          sub     x9, sp, #8, LSL #12
        F978013F          ldr     xzr, [x9,#0x7000]
        F970013F          ldr     xzr, [x9,#0x6000]
        F968013F          ldr     xzr, [x9,#0x5000]

why are we subtracting 0x8000? Shouldn't it be:

        sub     x9, sp, #3, LSL #12
        ldr     xzr, [x9,#0x2000] // one ldr per page
        ldr     xzr, [x9,#0x1000]
        ldr     xzr, [x9,#0x0] // always probe the very bottom last

?

@BruceForstall Sure, we could subtract 0x3000. In fact we could subtract any value that is min(0x8000, currentSpToFinalSp- currentSpToTempReg) and can be encoded in one sub tempReg, sp, #imm instruction (i.e. it must be either smaller than 0x1000 or be a multiple of 0x1000).

I decided not to go into the math and simplified it by always subtracting 0x8000 since ldr xzr, [tempReg, #imm] can encode any positive offset in range [0, 0x8000 - 8]

@BruceForstall
Copy link
Member

Oh, I see; you always subtract 0x8000, but that's just for probing; when the actual SP subtract happens, it's of the actual required amount.

Don't we need to move SP when probing on Linux?

@echesakov
Copy link
Contributor Author

echesakov commented Nov 3, 2020

Oh, I see; you always subtract 0x8000, but that's just for probing; when the actual SP subtract happens, it's of the actual required amount.

Don't we need to move SP when probing on Linux?

We do on linux-x64. However, even on linux-x64 we still could probe below SP but not very far. There is some limit when such access becomes treated as illegal and the app will be terminated by the kernel. As far as I remember, this check in linux memory manager was enabled for linux-x64 only, but not for linux-arm, linux-arm64.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good. Mostly, I'm requesting more comments. It makes sense to extract out genPushCalleeSavedRegisters to reduce ifdefs; that's a nice change.

; x9 - points to the lowest address on the stack frame being allocated (i.e. [InitialSp - FrameSize])
; sp - points to some byte on the last probed page
; On exit:
; x9 - is preserved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you mention x30 is trashed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x30 is never trashed. x30 is the same register as lr. I am using x30 name instead of lr since the register is not used as link register and I wanted to emphasize that.

cmp sp, x30, lsl #0
bhs ProbeLoop ; if (sp >= x30), then we need to probe at least one more page

mov sp, fp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does fp get set/saved? By PROLOG_SAVE_REG_PAIR?

bhs ProbeLoop ; if (sp >= x30), then we need to probe at least one more page

mov sp, fp
EPILOG_RESTORE_REG_PAIR fp, lr, 16!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it's ok to have a sequence:

  1. Function prolog
  2. Call probe helper
  3. Probe helper probes, subtracting sp, then restores sp to location at call to probe helper, returns to caller
  4. Function changes sp

In particular, between 3 & 4, the probed pages remain mapped (the OS never "reclaims" them) even though sp has been reverted. (Presumably, e.g., the OS could use the probed space then for interrupt handler, say).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is true. Stack pages can only be reclaimed after the thread exits.

#define PAGE_SIZE_LOG12 12
#define PAGE_SIZE 4096

LEAF_ENTRY JIT_StackProbe, _TEXT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should add a header comment with "on entry" and "on exit" conditions spelled out.

return 2 * eeGetPageSize();
return 2 * pageSize;
#elif defined(TARGET_ARM64)
constexpr target_size_t ldrLargestPositiveImmByteOffset = 0x8000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be worthwhile having a comment here (even a quite detailed comment) explaining this math, or at least a pointer to someplace else in the code that has such a comment.

@@ -5853,6 +5853,12 @@ void Compiler::lvaAssignVirtualFrameOffsetsToLocals()
{
codeGen->SetSaveFpLrWithAllCalleeSavedRegisters(true); // Force using new frames
}

if (compLclFrameSize >= getVeryLargeFrameSize())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment here would be useful

@@ -8944,4 +8944,63 @@ void CodeGen::genProfilingLeaveCallback(unsigned helper)

#endif // PROFILING_SUPPORTED

/*-----------------------------------------------------------------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this is just extracted and logic is unchanged

#define RBM_STACK_PROBE_HELPER_ARG RBM_R9
#define REG_STACK_PROBE_HELPER_CALL_TARGET REG_IP0
#define RBM_STACK_PROBE_HELPER_CALL_TARGET RBM_IP0
#define RBM_STACK_PROBE_HELPER_TRASH RBM_NONE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the trash set be x30?


int totalFrameSize = genTotalFrameSize();

bool useStackProbeHelper = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be more overall comments about the probing logic, here, preferably with examples for the various important cases.

@@ -1262,6 +1262,26 @@ GenerateProfileHelper ProfileTailcall, PROFILE_TAILCALL

#endif

#define PAGE_SIZE_LOG12 12
#define PAGE_SIZE 4096
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit - I would prefer naming this PROBE_PAGE_SIZE to indicate that it isn't necessarily the OS page size.

bhs ProbeLoop ; if (sp >= x30), then we need to probe at least one more page

mov sp, fp
EPILOG_RESTORE_REG_PAIR fp, lr, 16!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is true. Stack pages can only be reclaimed after the thread exits.

@TamarChristinaArm
Copy link
Contributor

@TamarChristinaArm I would like to ask your opinion about inlining more instructions in a prolog in order to do stack probing. In particular, how far we should go? Should we go beyond sub followed by 8 ldr-s? Or it seems to be a reasonable boundary and after that point it's better to switch to a helper call.

That's a reasonably amount. It's mostly a code size thing more than anything really. For GCC we inline up to 4 instructions after which we emit an inline loop. So a maximum of 8 is fine.

We do on linux-x64. However, even on linux-x64 we still could probe below SP but not very far. There is some limit when such access becomes treated as illegal and the app will be terminated by the kernel. As far as I remember, this check in linux memory manager was enabled for linux-x64 only, but not for linux-arm, linux-arm64.

Just a side note, the behavior of the kernel aside, that does violates the AAPCS see Universal stack constraints at https://github.com/ARM-software/abi-aa/blob/master/aapcs64/aapcs64.rst#the-stack and some tools like valgrind use this invariant to detect invalid memory access.

Also note that reading from an unallocated stack area is explicitly prohibited when MTE (Memory Tagging) is enabled. The commit ARM-software/abi-aa@c09ef09 has some more information on the salient parts.

@echesakov
Copy link
Contributor Author

Just a side note, the behavior of the kernel aside, that does violates the AAPCS see Universal stack constraints at https://github.com/ARM-software/abi-aa/blob/master/aapcs64/aapcs64.rst#the-stack and some tools like valgrind use this invariant to detect invalid memory access.

Hmm,

A process may only access (for reading or writing) the closed interval of the entire stack delimited by [SP, stack-base – 1].

@TamarChristinaArm Doesn't this mean that the proposed algorithm would violate the rule since sp is never adjusted before the probing finishes?

@TamarChristinaArm
Copy link
Contributor

@TamarChristinaArm Doesn't this mean that the proposed algorithm would violate the rule since sp is never adjusted before the probing finishes?

Correct it does, and on MTE enabled systems this may result in a hardware fault (depending on the mode the hardware is set to).

There are proposals to slightly adjust the AAPCS to specifically allow probes (and only probes) below SP but those have yet to reach a conclusion.

The reverse scheme of probing after dropping SP also has issues in that if you take a signal the signal handler doesn't know whether the SP is valid or not, so it has to check.

@echesakov
Copy link
Contributor Author

Correct it does, and on MTE enabled systems this may result in a hardware fault (depending on the mode the hardware is set to).

There are proposals to slightly adjust the AAPCS to specifically allow probes (and only probes) below SP but those have yet to reach a conclusion.

The reverse scheme of probing after dropping SP also has issues in that if you take a signal the signal handler doesn't know whether the SP is valid or not, so it has to check.

The scheme where SP changes as we probe will have its own issues with stack unwinding. In particular, as in #42885

However, always calling a helper for probing seems too expensive alternative. Especially, given the fact that we need to force a specific frame type in the JIT where fp, lr pair is placed on top of the locals. This has already caused some regressions that I am investigating at the moment where we have loads from/stores to locals on the stack and their address computed relative to fp meaning all the offsets become negative and some of them non-encodable with str,ldr immediates. Although, I should be able to resolve them by allowing to compute the local address based on sp value.

@TamarChristinaArm You mentioned that

For GCC we inline up to 4 instructions after which we emit an inline loop. So a maximum of 8 is fine.

meaning that on GCC you chose to break the rule and probe below SP? Perhaps, we can do the same unless we are strictly prohibited (as in the case with MTE). Or we can make such option configurable?

@janvorli @BruceForstall What are your thoughts?

@TamarChristinaArm
Copy link
Contributor

The scheme where SP changes as we probe will have its own issues with stack unwinding. In particular, as in #42885

long thread, I'll have a read :)

meaning that on GCC you chose to break the rule and probe below SP? Perhaps, we can do the same unless we are strictly prohibited (as in the case with MTE). Or we can make such option configurable?

No those two parts were unrelated.. With GCC we drop then probe. We have a slightly different ABI (one that clang will also follow) for probing (which we do for stack clash mitigation) where we try to minimize the number of probes that we need to emit since the storing of lr counts as an implicit probe.

So for

int foo (){
  volatile int x[57000];
  x[0] = 0;
}

we generate:

foo:
        sub     sp, sp, #65536
        str     xzr, [sp, 1024]
        sub     sp, sp, #65536
        str     xzr, [sp, 1024]
        sub     sp, sp, #65536
        str     xzr, [sp, 1024]
        mov     x12, 31392
        sub     sp, sp, x12
        str     wzr, [sp]
        add     sp, sp, 2720
        add     sp, sp, 225280
        ret

with -O2 -fstack-clash-protection on GCC 10 or newer for guard page size of 64kb.

@echesakov
Copy link
Contributor Author

@TamarChristinaArm I see, the scheme is quite different from what we do.

@BruceForstall
Copy link
Member

Also note that reading from an unallocated stack area is explicitly prohibited when MTE (Memory Tagging) is enabled

I guess that's why your probes are str instead of ldr?

@BruceForstall
Copy link
Member

Oh, I see; you always subtract 0x8000, but that's just for probing; when the actual SP subtract happens, it's of the actual required amount.

@echesakovMSFT could doing this cause sp to point beyond the guard pages such that some OS activity like an interrupt handler using this stack will crash if it reads/writes to the stack?

@echesakov
Copy link
Contributor Author

Oh, I see; you always subtract 0x8000, but that's just for probing; when the actual SP subtract happens, it's of the actual required amount.

@echesakovMSFT could doing this cause sp to point beyond the guard pages such that some OS activity like an interrupt handler using this stack will crash if it reads/writes to the stack?

@BruceForstall No, since the sp never changes during the probe - I am using a scratch register to store a base of the location to probe and the immediate values in ldr to compute the exact address. However, as Tamar pointed out above, such method would violate the calling convention. In fact, the current implementation also violates the convention, so I am thinking how to re-design the algorithm so it would fit into our frame types model and wouldn't cause significant regressions.

@TamarChristinaArm
Copy link
Contributor

Also note that reading from an unallocated stack area is explicitly prohibited when MTE (Memory Tagging) is enabled

I guess that's why your probes are str instead of ldr?

@BruceForstall no, I should have been more precise here. With MTE enabled the stack is colored based on who allocated the space. An unallocated stack space is uncolored and so any access of it is invalid. What invalid here means depends on the value of SCTLR_ELx.TCF but one possible mode is a synchronous data exception being raised.

There's no real particular reason why we used str in this case. Both str and ldr work out to about the same functionally and performance wise in this case.

@echesakov echesakov force-pushed the Arm64-Implement-Jit-StackProbe-Helper branch from 91b927a to 2772cb2 Compare January 25, 2021 19:02
@echesakov echesakov marked this pull request as draft January 25, 2021 19:07
… src/coreclr/jit/codegenarm.cpp src/coreclr/jit/codegenarm64.cpp
…/coreclr/jit/lclvars.cpp src/coreclr/jit/target.h
@echesakov echesakov force-pushed the Arm64-Implement-Jit-StackProbe-Helper branch from 48ae7d9 to bcb4a74 Compare February 5, 2021 23:25
@echesakov
Copy link
Contributor Author

Extracted refactoring changes to #48199
Will open PR with Arm64 implementation later

@echesakov echesakov closed this Feb 26, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Mar 29, 2021
@echesakov echesakov deleted the Arm64-Implement-Jit-StackProbe-Helper branch April 13, 2021 20:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Arm64] Implement stack probing using helper
5 participants