-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT should use boxing helper to box structs with many GC pointers #101699
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
Marking as Future as we already have a lot of issues for 9.0, but I think I can try to land it this week |
Working prototype (CoreCLR): EgorBo@1c9f3a4 |
@jkotas what would be the best way to handle possible gc starvation for huge structs - can we handle that on VM's side inside the |
|
I have some questions about further optimizing GC pointer copying: Currently the actual copy is done in FCall so that GC can only happen at chunk boundaries. Large copy are split in to chunks and implicit GC poll happens when transiting between managed and FCall every time. The actual |
I'd expect NEON to kick in here implicitly as is.
Probably? Depends on how often we see large inputs here, we, probably, do see them when we do various Array/Span CopyTo, but in case of the struct copy discussed here - unlikely (a struct must be really big to see impact from AVX).
I'd guess that the simplest option would be to have a dynamic ISA check in the existing code? Just like we e.g. check for atomics on arm64 in C++ code. Obviously, it means + one more condition overhead for small inputs PS: one imprortant thing we have to keep in mind is the atomicity guarantees, e.g. SIMD loads/store do have them when input is 16 byte aligned (and the alignment is not changed) on x86/64 |
hm, I finished my prototype but for some reason I am not seeing the expected perf improvements compared to raw box helper (on Windows-x64) 🤔 GcStruct9 s9;
[Benchmark]
public object BoxS9() => s9;
record struct GcStruct9(
string a1, string a2, string a3,
string a4, string a5, string a6,
string a7, string a8, string a9); Codegen in Main: ; Method MyBench:BoxS9():System.Object:this (FullOpts)
G_M000_IG01:
push rdi
push rsi
push rbx
sub rsp, 32
mov rsi, rcx
G_M000_IG02:
mov rcx, 0x7FFBDD35BE20
call CORINFO_HELP_NEWSFAST
mov rbx, rax
add rsi, 40
lea rdi, bword ptr [rbx+0x08]
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
mov rax, rbx
G_M000_IG03:
add rsp, 32
pop rbx
pop rsi
pop rdi
ret
; Total bytes of code: 92 Codegen in the prototype: ; Method MyBench:BoxS9():System.Object:this (FullOpts)
G_M5744_IG01:
push rsi
push rbx
sub rsp, 40
mov rbx, rcx
G_M5744_IG02:
mov rcx, 0x7FFB87E5B0C8 ; GcStruct9
call CORINFO_HELP_NEWSFAST
mov rsi, rax
lea rcx, bword ptr [rsi+0x08]
lea rdx, bword ptr [rbx+0x28]
mov r8d, 72
call CORINFO_HELP_ASSIGN_STRUCT
mov rax, rsi
G_M5744_IG03:
add rsp, 40
pop rbx
pop rsi
ret
; Total bytes of code: 56 Main: 13.5 ns At first, I thought that the box helper is faster because it knows that the pointers never overlap, but it seems to be calling the same routine as my prototype... |
You have not done this part: The JIT helper implementation should look like Buffer::BulkMoveWithWriteBarrier. It polls for GC among other things. |
How about inline array? I did an experiment about using AVX in copy, and disabling SSE. Benchmark code: [Params(1024, 200000, 4000000)]
public int ArraySize { get; set; }
[GlobalSetup]
public void Setup()
{
src_str = Enumerable.Range(1, ArraySize).Select(i => $"abcd{i}").ToArray();
src_str_unordered = [.. src_str];
new Random(20240430).Shuffle(src_str_unordered);
dst_str = new string[ArraySize];
src_nint = Enumerable.Range(1, ArraySize).Select(i => (nint)i).ToArray();
dst_nint = new nint[ArraySize];
}
private string[] src_str;
private string[] src_str_unordered;
private string[] dst_str;
private nint[] src_nint;
private nint[] dst_nint;
[Benchmark]
public string[] StringArray()
{
src_str.CopyTo(dst_str, 0);
return dst_str;
}
[Benchmark]
public string[] StringArray_Unordered()
{
src_str_unordered.CopyTo(dst_str, 0);
return dst_str;
}
[Benchmark]
public nint[] NintArray()
{
src_nint.CopyTo(dst_nint, 0);
return dst_nint;
}
} Results:
It seems that using longer vector size has more impact on relatively smaller size. Huge size are more limited by CPU cache I think? |
More numbers for smaller sizes:
Conclusion: using AVX is more impactful for arrays (size 256 ~ 1024), not structs (size 8 ~ 32). It can be a separated topic. |
@jkotas hm.. do you have an idea how a GC poll fixes the perf here? I am a bit puzzled as it seems to be fixing it indeed 🤔
@huoyaoyuan I guess if you figure out how to insert AVX there properly without noticeable regressions then it should be a good change? |
|
Unoptimized code is faster than optimized code: #101605 (comment)
#101605 (comment)
Prototype with some extra discussion at #101669.
The text was updated successfully, but these errors were encountered: