-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable FEATURE_MULTICASTSTUB_AS_IL for Windows x86 #104192
Conversation
@EgorBot -intel -amd -arm64 using BenchmarkDotNet.Attributes;
using System;
namespace BenchmarkGround
{
public struct GCStruct
{
public object a, b, c, d, e, f, h, g;
}
public class Bench
{
private readonly object obj = new object();
private readonly Guid guid = Guid.NewGuid();
private readonly GCStruct gcStruct = new GCStruct
{
a = new object(),
g = new object()
};
private readonly Action<int> singleArg = (Action<int>)delegate { } + delegate { };
private readonly Action<int> manyCast = (Action<int>)delegate { }
+
delegate { }
+ delegate { }
+ delegate { }
+ delegate { }
+ delegate { };
private readonly Func<int, double, Guid, GCStruct, object, double> manyArg_RetFPU = (Func<int, double, Guid, GCStruct, object, double>)
delegate { return 123.456; }
+ delegate { return 654.321; };
[Benchmark]
public void SingleArg() => singleArg(42);
[Benchmark]
public void ManyCast() => manyCast(42);
[Benchmark]
public double ManyArg_RetFPU() => manyArg_RetFPU(42, 123.0, guid, gcStruct, obj);
}
} |
Tagging subscribers to this area: @mangod9 |
Benchmark results on Intel
|
Benchmark results on Arm64
|
Benchmark results on Amd
|
src/coreclr/vm/comdelegate.cpp
Outdated
pCode->EmitBRTRUE(invokeTraceHelper); | ||
pCode->EmitBR(debuggerCheckEnd); // Tune branch prediction to prefer non-debugging path | ||
|
||
pCode->EmitLabel(invokeTraceHelper); | ||
|
||
pCode->EmitLoadThis(); | ||
pCode->EmitLDLOC(dwLoopCounterNum); | ||
pCode->EmitCALL(METHOD__STUBHELPERS__MULTICAST_DEBUGGER_TRACE_HELPER, 2, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pCode->EmitBRTRUE(invokeTraceHelper); | |
pCode->EmitBR(debuggerCheckEnd); // Tune branch prediction to prefer non-debugging path | |
pCode->EmitLabel(invokeTraceHelper); | |
pCode->EmitLoadThis(); | |
pCode->EmitLDLOC(dwLoopCounterNum); | |
pCode->EmitCALL(METHOD__STUBHELPERS__MULTICAST_DEBUGGER_TRACE_HELPER, 2, 0); | |
pCode->EmitBRTRUE(invokeTraceHelper); |
And move the debugging path to be after RET
:
pCode->EmitRET();
#ifdef DEBUGGING_SUPPORTED
// Emit debugging support at the end of the method for better perf
pCode->EmitLabel(invokeTraceHelper);
pCode->EmitLoadThis();
pCode->EmitLDLOC(dwLoopCounterNum);
pCode->EmitCALL(METHOD__STUBHELPERS__MULTICAST_DEBUGGER_TRACE_HELPER, 2, 0);
pCode->EmitBR(debuggerCheckEnd);
#endif
This should be even better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was exactly what I did locally, and resulted in exact same codegen with current, at least for x86. The compiled asm code block order did follow IL block order exactly.
@dotnet/jit-contrib Do you have any suggestion on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code has atypical loop. It looks like the JIT tried to reorder the basic blocks to turn it a more regular loop. I am not sure whether there is anything to fix in the JIT (the JIT would have to have profile data to do better).
It may be still worth it to move TraceHelper call to be at the end in IL. It provides stronger hint about the desired code layout to the JIT and makes this optimization a bit less fragile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well normalizing the loop closer to a for
loop results in about 5% improvement for ManyArg case, but 5% regression for ManyCast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codegen for x64 now:
; Assembly listing for method System.Action:IL_STUB_MulticastDelegate_Invoke():this (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; optimized using Synthesized PGO
; rsp based frame
; partially interruptible
; with Synthesized PGO: fgCalledCount is 100
; No PGO data
G_M000_IG01: ;; offset=0x0000
push rsi
push rbx
sub rsp, 40
mov rbx, rcx
G_M000_IG02: ;; offset=0x0009
xor esi, esi
cmp qword ptr [rbx+0x30], 0
jle SHORT G_M000_IG05
G_M000_IG03: ;; offset=0x0012
test dword ptr [(reloc 0x7ff824c55210)], 512
jne SHORT G_M000_IG06
G_M000_IG04: ;; offset=0x001E
mov rcx, gword ptr [rbx+0x28]
cmp esi, dword ptr [rcx+0x08]
jae SHORT G_M000_IG07
mov rax, gword ptr [rcx+8*rsi+0x10]
mov rcx, gword ptr [rax+0x08]
call [rax+0x18]System.Action:Invoke():this
inc esi
movsxd rcx, esi
cmp rcx, qword ptr [rbx+0x30]
jl SHORT G_M000_IG03
G_M000_IG05: ;; offset=0x003E
add rsp, 40
pop rbx
pop rsi
ret
G_M000_IG06: ;; offset=0x0045
mov rcx, rbx
mov edx, esi
call System.StubHelpers.StubHelpers:MulticastDebuggerTraceHelper(System.Object,int)
jmp SHORT G_M000_IG04
G_M000_IG07: ;; offset=0x0051
call CORINFO_HELP_RNGCHKFAIL
int3
; Total bytes of code 87
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise. Nice simplification!
src/coreclr/vm/comdelegate.cpp
Outdated
pCode->EmitBRTRUE(invokeTraceHelper); | ||
pCode->EmitBR(debuggerCheckEnd); // Tune branch prediction to prefer non-debugging path | ||
|
||
pCode->EmitLabel(invokeTraceHelper); | ||
|
||
pCode->EmitLoadThis(); | ||
pCode->EmitLDLOC(dwLoopCounterNum); | ||
pCode->EmitCALL(METHOD__STUBHELPERS__MULTICAST_DEBUGGER_TRACE_HELPER, 2, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code has atypical loop. It looks like the JIT tried to reorder the basic blocks to turn it a more regular loop. I am not sure whether there is anything to fix in the JIT (the JIT would have to have profile data to do better).
It may be still worth it to move TraceHelper call to be at the end in IL. It provides stronger hint about the desired code layout to the JIT and makes this optimization a bit less fragile.
|
||
// initialize counter | ||
pCode->EmitLDC(0); | ||
pCode->EmitSTLOC(dwLoopCounterNum); | ||
|
||
// Make the shape of the loop similar to what C# compiler emits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new shape seems to be missing the invocation of the trace helper at the end. E.g. if the invocation count is 2, the trace helper should be called 3 times. It is only called 2 times if I am reading the code correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Result codegen:
; Assembly listing for method System.Action:IL_STUB_MulticastDelegate_Invoke():this (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; optimized using Synthesized PGO
; rsp based frame
; partially interruptible
; with Synthesized PGO: fgCalledCount is 100
; No PGO data
G_M000_IG01: ;; offset=0x0000
push rsi
push rbx
sub rsp, 40
mov rbx, rcx
G_M000_IG02: ;; offset=0x0009
xor esi, esi
jmp SHORT G_M000_IG04
G_M000_IG03: ;; offset=0x000D
mov rcx, gword ptr [rbx+0x28]
cmp esi, dword ptr [rcx+0x08]
jae SHORT G_M000_IG08
mov rax, gword ptr [rcx+8*rsi+0x10]
mov rcx, gword ptr [rax+0x08]
call [rax+0x18]System.Action:Invoke():this
inc esi
G_M000_IG04: ;; offset=0x0024
test dword ptr [(reloc 0x7ff84ea05210)], 512
jne SHORT G_M000_IG06
G_M000_IG05: ;; offset=0x0030
movsxd rcx, esi
cmp rcx, qword ptr [rbx+0x30]
jl SHORT G_M000_IG03
jmp SHORT G_M000_IG07
G_M000_IG06: ;; offset=0x003B
mov rcx, rbx
mov edx, esi
call System.StubHelpers.StubHelpers:MulticastDebuggerTraceHelper(System.Object,int)
jmp SHORT G_M000_IG05
G_M000_IG07: ;; offset=0x0047
add rsp, 40
pop rbx
pop rsi
ret
G_M000_IG08: ;; offset=0x004E
call CORINFO_HELP_RNGCHKFAIL
int3
; Total bytes of code 84
JIT reorders the debugging block in front of the ret
. Benchmark shows no regression. It's just a bit strange with the unconditional jumps.
Closes #103958 .
Local benchmark result for
Action<int>
on x86: