-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate adjusting herustics for unrolled block copies/initialization #82529
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue DetailsZero initialization becomes expensive with two bytes struct fields are involved: Given: unsafe struct S1
{
fixed byte a[10];
int b;
fixed byte c[23];
fixed byte d[24];
fixed byte e[25];
}
unsafe struct S2
{
fixed short a[10];
int b;
fixed short c[23];
fixed short d[24];
fixed short e[25];
} Unlike S1, initializing S2 does not inline the ; S1 X1() { S1 s = default; return s; }
C:X1():S1:this:
sub rsp, 24
vzeroupper
mov rax, qword ptr [(reloc)]
mov qword ptr [rsp+10H], rax
xor eax, eax
vxorps ymm0, ymm0
vmovdqu ymmword ptr[rsi], ymm0
vmovdqu ymmword ptr[rsi+20H], ymm0
vmovdqu xmmword ptr [rsi+40H], xmm0
mov qword ptr [rsi+50H], rax
mov rax, rsi
lea rdi, [(reloc)]
mov rdi, qword ptr [rdi]
cmp qword ptr [rsp+10H], rdi
je SHORT G_M61359_IG03
call [CORINFO_HELP_FAIL_FAST]
G_M61359_IG03:
nop
add rsp, 24
ret
; S2 X2() { S2 s = default; return s; }
C:X2():S2:this:
push rbx
sub rsp, 16
mov rax, qword ptr [(reloc)]
mov qword ptr [rsp+08H], rax
mov rbx, rsi
xor esi, esi
mov rdi, rbx
mov edx, 168
call [CORINFO_HELP_MEMSET]
mov rax, rbx
lea rdi, [(reloc)]
mov rdi, qword ptr [rdi]
cmp qword ptr [rsp+08H], rdi
je SHORT G_M46095_IG03
call [CORINFO_HELP_FAIL_FAST]
G_M46095_IG03:
nop
add rsp, 16
pop rbx
ret The codegen can try to match that of C++ compiler: X2(): # @X2()
mov rax, rdi
vxorps xmm0, xmm0, xmm0
vmovups ymmword ptr [rdi + 128], ymm0
vmovups ymmword ptr [rdi + 96], ymm0
vmovups ymmword ptr [rdi + 64], ymm0
vmovups ymmword ptr [rdi + 32], ymm0
vmovups ymmword ptr [rdi], ymm0
mov qword ptr [rdi + 160], 0
vzeroupper
ret https://godbolt.org/z/9G4dPs5jG
|
Whether or not we call the helper depends on the size of the struct. It is a heuristic, so this is really about whether or not the heuristic should be adjusted. runtime/src/coreclr/jit/lowerxarch.cpp Lines 475 to 496 in 3bc4f0e
|
Another issue was that even without the helper call ( X1(): # @X1()
mov rax, rdi
vxorps xmm0, xmm0, xmm0
vmovups ymmword ptr [rdi + 56], ymm0
vmovups ymmword ptr [rdi + 32], ymm0
vmovups ymmword ptr [rdi], ymm0
vzeroupper
ret + call to fail-fast (if necessary?) |
Some of the code is due to the stack cookie that we are storing/checking, which the native compiler doesn't do. If I change vxorps xmm0, xmm0, xmm0
vmovups ymmword ptr [rdi + 56], ymm0
vmovups ymmword ptr [rdi + 32], ymm0
vmovups ymmword ptr [rdi], ymm0 So clang is doing an "unaligned" block write instead of our: vxorps ymm0, ymm0
vmovdqu ymmword ptr[rsi], ymm0
vmovdqu ymmword ptr[rsi+20H], ymm0
vmovdqu xmmword ptr [rsi+40H], xmm0
mov qword ptr [rsi+50H], rax That definitely seems like something we might be able to improve. cc @tannergooding, not sure if there would be any concerns about writes that cross cache lines that we don't already have? (Presumably not, since a struct like this is anyway only 4-byte aligned) |
Shouldn't be an issue for most of the scenarios we have. Cache line splits tend to be a very small impact and so the two individual stores that would otherwise avoid the split need to be setup just right to be "faster" (or the splits need to be frequent enough to add up as measurable). Page splits are worse, but also much less frequent. |
@jakobbotsch, thanks for this pointer. It looks like clang is able to prove that these functions are not vulnerable to stack smashing, so it produces the same result even with |
Clang still does not seem to produce any stack guards for the following code that is clearly vulnerable to buffer overflows when I specify void foo(char* p);
S1 X1()
{
S1 s { };
foo(s.e);
return s;
} |
With that said, we could definitely omit it in some cases, I suppose. Today we just turn the check on whenever we see a struct with |
Setting to Future for now. I don't think we can accomodate it in .NET 8. |
With #83638, memset is inlined for int16 case as well: C:X2():S2:this:
sub rsp, 24
vzeroupper
mov rax, qword ptr [(reloc)]
mov qword ptr [rsp+10H], rax
xor eax, eax
vxorps ymm0, ymm0
vmovdqu ymmword ptr[rsi], ymm0
vmovdqu ymmword ptr[rsi+20H], ymm0
vmovdqu ymmword ptr[rsi+40H], ymm0
vmovdqu ymmword ptr[rsi+60H], ymm0
vmovdqu ymmword ptr[rsi+80H], ymm0
mov qword ptr [rsi+A0H], rax
mov rax, rsi
lea rdi, [(reloc)]
mov rdi, qword ptr [rdi]
cmp qword ptr [rsp+10H], rdi
je SHORT G_M46095_IG03
call [CORINFO_HELP_FAIL_FAST]
G_M46095_IG03:
nop
add rsp, 24
ret
C:.ctor():this:
ret Thanks @EgorBo! :) |
Zero initialization becomes expensive with two bytes struct fields are involved:
Given:
Unlike S1, initializing S2 does not inline the
CORINFO_HELP_MEMSET
call and fail to make use of vectors:The codegen can try to match that of C++ compiler:
https://godbolt.org/z/a3WsPdd4M
The text was updated successfully, but these errors were encountered: