-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use xmm for stack prolog zeroing rather than rep stos #32538
Conversation
Example change from sub rsp, 88
- mov rsi, rcx
- lea rdi, [rsp+20H]
- mov ecx, 12
- xor rax, rax
- rep stosd
- mov rcx, rsi
+ xorps xmm4, xmm4
+ movdqu xmmword ptr [rsp+20H], xmm4
+ movdqu xmmword ptr [rsp+30H], xmm4
+ movdqu xmmword ptr [rsp+40H], xmm4
mov qword ptr [rsp+50H], rdx
sub rsp, 304
- mov rsi, rcx
- lea rdi, [rsp+40H]
- mov ecx, 60
- xor rax, rax
- rep stosd
- mov rcx, rsi
+ xorps xmm4, xmm4
+ mov rax, -240
+ movups xmmword ptr [rsp+rax+130H], xmm4
+ movups xmmword ptr [rsp+rax+140H], xmm4
+ movups xmmword ptr [rsp+rax+150H], xmm4
+ add rax, 48
+ jne SHORT -5 instr
mov rbp, rcx Though this is still within the range amount of |
// As we output multiple instructions for SIMD zeroing, we want to balance code size, with throughput | ||
// so cap max size at 16 * MAX SIMD length | ||
int initMaxSIMDSize = 16 * XMM_REGSIZE_BYTES; | ||
if ((untrLclHi - untrLclLo) <= initMaxSIMDSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be locals in this region that do not need to be zero-initialized. Does it add up to anything significant? We may be able to do better by skipping zero-initialization for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is handled by the else branch when genUseBlockInit
is false where it just zeros out the GC fields (not sure of heuristics that determine it); starts just after the shown GitHub diff at else if (genInitStkLclCnt > 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g. it may be worth it to build a bit-mask what needs to be zero-initialized, and then go over the bitmask to do the zero-init.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do have an issue open for double zeroing #2325
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use use the same logic the genInitStkLclCnt
else block uses to drop any xmm
blocks that don't have something to zero? (in a follow up)
cc @erozenfeld Do we need to go all the way to 256 bytes for methods in Kestrel? It would be good to highlight the methods that need a lot of prolog zeroing so we can see if other parts of the heuristic need reworking, or whether other changes -- either to the sources or to the jit -- might be able to remove the need for this much zeroing. Can you look at code size impact outside of corelib (say via |
256 bytes looks to be what Clang/LLVM do by default (or 512 bytes for AVX2): https://godbolt.org/z/geQ9u9 Perhaps this is one of the cases where we could generate the larger code unless the user has explicitly opted for "small code generation"? |
When optimizing for size; MSVC ( |
Is there an easy way to check if code is being produced for R2R? Could have lower limits for that? |
Jitted code will have more zeroing than typical C/C++ code so a more conservative cutover may be appropriate.
Yes, |
We have important customers that are not using tiered-JIT and care about code quality. I do not think we should be changing the cut-off for R2R. |
As @tannergooding mentions 16 simd instructions is the clang switch over point; so either 16x xmm (256bytes) or 16x ymm (512bytes). Intel suggests 128 bytes as switch over for
The Kestrel method I was looking at https://gist.github.com/benaadams/0a44b0cb1c0aab57d6525a0eedc0672f is 10x xmm Even though it is high, hopefully it is infrequent? |
26a37bd
to
a6b5fa4
Compare
Rebased to kick start CI |
cc @dotnet/jit-contrib |
In Of those 168 bytes, only 64 bytes are GC refs (you can see this by looking at the untracked GC lifetimes, say via So perhaps the "bit vector" approach has some appeal? |
If we go that route the logic should live in |
Think I discovered what was going wrong in the other PR with alignment; so have included aligning |
I was wondering if it could be made a little more generic and used for |
Right, it would be nice to share logic with other similar cases, though there can be different constraints on codegen (gc safe vs non-gc safe, for instance) that impact the allowable code sequences. |
Linux build issue looks like #2234:
|
Perf failures look like download issues, #32835.
|
@benaadams can you share out your perf tests? I'd like to see us run them on an AMD box and perhaps get them into the performance repo. |
Sure https://gist.github.com/benaadams/fed05032ff6c27e2d245135fe51e5496 |
I'll run these on my Ryzen 1800X and Ryzen 3900X tomorrow for additional reference points. Edit: Started building the branch so I can run tests, will expect results shortly. |
Thanks, @tannergooding -- looks qualitatively similar to the intel results. And @benaadams thanks for all your hard work here, this is a really nice improvement. @dotnet/jit-contrib I think this is ready to merge, any final comments? |
I'll review this later today. |
@erozenfeld didn't see your comment in time -- we can revise if needed. |
It might be good to check the perf numbers tomorrow after they have gotten a chance to run 😄 |
@erozenfeld I was wondering if it changed the cost/benefit for "JIT: Support object stack allocation" any? #11192 e.g. class Foo
{
public long f1;
public long f2;
public long f3;
public long f4;
public Foo(long f1, long f2)
{
this.f1 = f1;
this.f2 = f2;
}
}
class Test
{
static long f1;
static long f2;
public static int Main()
{
Foo foo = new Foo(f1, f2);
return (int)(foo.f1 + foo.f2);
}
} Changes from with the object allocation ;* V00 loc0 [V00 ] ( 0, 0 ) ref -> zero-ref class-hnd exact
; V01 OutArgs [V01 ] ( 1, 1 ) lclBlk (32) [rsp+0x00] "OutgoingArgSpace"
; V02 tmp1 [V02,T00] ( 5, 10 ) ref -> rax class-hnd exact "NewObj constructor temp"
; V03 tmp2 [V03,T01] ( 2, 4 ) long -> rdx "Inlining Arg"
; V04 tmp3 [V04,T02] ( 2, 4 ) long -> rcx "Inlining Arg"
;
; Lcl frame size = 40
G_M59477_IG01:
sub rsp, 40
;; bbWeight=1 PerfScore 0.25
G_M59477_IG02:
mov rcx, 0xD1FFAB1E
call CORINFO_HELP_NEWSFAST
mov rdx, qword ptr [reloc classVar[0xd1ffab1e]]
mov rcx, qword ptr [reloc classVar[0xd1ffab1e]]
mov qword ptr [rax+8], rdx
mov qword ptr [rax+16], rcx
mov edx, dword ptr [rax+8]
add edx, dword ptr [rax+16]
mov eax, edx
;; bbWeight=1 PerfScore 9.50
G_M59477_IG03:
add rsp, 40
ret
;; bbWeight=1 PerfScore 1.25 With with ; V00 loc0 [V00,T02] ( 3, 3 ) long -> rax class-hnd exact
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] "OutgoingArgSpace"
;* V02 tmp1 [V02 ] ( 0, 0 ) long -> zero-ref class-hnd exact "NewObj constructor temp"
; V03 tmp2 [V03,T00] ( 2, 4 ) long -> rax "Inlining Arg"
; V04 tmp3 [V04,T01] ( 2, 4 ) long -> rdx "Inlining Arg"
; V05 tmp4 [V05 ] ( 4, 4 ) struct (40) [rsp+0x08] do-not-enreg[XSF] must-init addr-exposed "MorphAllocObjNodeIntoStackAlloc temp"
-; Lcl frame size = 48
+; Lcl frame size = 40
G_M59477_IG01:
- push rdi
- sub rsp, 48
- lea rdi, [rsp+08H]
- mov ecx, 10
- xor rax, rax
- rep stosd
+ sub rsp, 40
+ vxorps xmm4, xmm4
+ vmovdqa xmmword ptr [rsp], xmm4
+ vmovdqa xmmword ptr [rsp+10H], xmm4
+ xor rax, rax
+ mov qword ptr [rsp+20H], rax
- ;; bbWeight=1 PerfScore 27.25
+ ;; bbWeight=1 PerfScore 3.83
G_M59477_IG02:
mov rax, 0xD1FFAB1E
- mov qword ptr [rsp+08H], rax
+ mov qword ptr [rsp], rax
mov rax, qword ptr [reloc classVar[0xd1ffab1e]]
mov rdx, qword ptr [reloc classVar[0xd1ffab1e]]
- mov qword ptr [rsp+10H], rax
- mov qword ptr [rsp+18H], rdx
- lea rax, [rsp+08H]
+ mov qword ptr [rsp+08H], rax
+ mov qword ptr [rsp+10H], rdx
+ lea rax, [rsp]
mov edx, dword ptr [rax+8]
add edx, dword ptr [rax+16]
mov eax, edx
;; bbWeight=1 PerfScore 10.00
G_M59477_IG03:
- add rsp, 48
+ add rsp, 40
- pop rdi
ret
;; bbWeight=1 PerfScore 1.75 |
From there, if the G_M59477_IG02:
mov rax, 0xD1FFAB1E
mov qword ptr [rsp], rax
mov rax, qword ptr [reloc classVar[0xd1ffab1e]]
mov rdx, qword ptr [reloc classVar[0xd1ffab1e]]
mov qword ptr [rsp+08H], rax
mov qword ptr [rsp+10H], rdx
+ mov edx, dword ptr [rsp+08H]
+ add edx, dword ptr [rsp+10H]
- lea rax, [rsp]
- mov edx, dword ptr [rax+8]
- add edx, dword ptr [rax+16]
mov eax, edx Since its operating on the framepointer, could then eliminate the reg->stack, stack->reg? G_M59477_IG02:
mov rax, 0xD1FFAB1E
mov qword ptr [rsp], rax
mov rax, qword ptr [reloc classVar[0xd1ffab1e]]
mov rdx, qword ptr [reloc classVar[0xd1ffab1e]]
- mov qword ptr [rsp+08H], rax
- mov qword ptr [rsp+10H], rdx
- mov edx, dword ptr [rsp+08H]
- add edx, dword ptr [rsp+10H]
+ add edx, eax
mov eax, edx Reverse the add G_M59477_IG02:
mov rax, 0xD1FFAB1E
mov qword ptr [rsp], rax
mov rax, qword ptr [reloc classVar[0xd1ffab1e]]
mov rdx, qword ptr [reloc classVar[0xd1ffab1e]]
- add edx, eax
- mov eax, edx
+ add eax, edx Then remove the write never read G_M59477_IG02:
- mov rax, 0xD1FFAB1E
- mov qword ptr [rsp], rax
mov rax, qword ptr [reloc classVar[0xd1ffab1e]]
mov rdx, qword ptr [reloc classVar[0xd1ffab1e]]
add eax, edx From there determine don't need any stack anyway? 🤔 |
I'm glad this will potentially help with CQ of methods with object stack allocation. Currently the main issue is not the quality of the code that's generated but the fact that very few allocations can be safely moved to the stack without an interprocedural escape analysis. We have some ideas on how to approach that problem but this work is not in near-term plans. |
System.Text.Json tests might be one to look at |
No problem. I reviewed the changes and they look good. |
Thanks for the insight Ben! Looking at the data we are seeing big wins (7%-20%) across the suite of JSON tests. I have added an image of a graph from the report below. That kind of drop is what we are seeing across over half of the JSON tests, with the other half within the noise range of the tests. You can find the full reports here: |
I looked but there were no new perf reports on our internal share. Must be jitted code is now too fast to measure. |
Courteously of @sebastienros source +5% to Plaintext (extra 200k requests per second)
|
Use
xmm
registers to clear prolog blocks rather thanrep stosd
.rep stos
does have a shallower growth rate so it will eventually overtake; however unlikely in the size range for the stack clearance:Also its variability based on small changes in stack changes is problematic as just adding an extra variable to a method (with no other changes) could add or remove 9ns from its running time. Whereas the xmm clear in this PR is much smoother and more expected changes.
Focusing in on the more common <= 256 range
And throughput
As the prolog doesn't support
jmp
labels; I have taught the x86/x64 Jit to understand relativejmp
s by instruction count (as per Arm) for the prolog.e.g. output prolog for a 1024 byte stack leap method:
Switch to
movups
rather thanmovdqu
as is shorter encoding with same functionality to reduce code size regressions.Contributes to #8890
Resolves #32396