-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
port SpanHelpers.IndexOfAny(ref byte, byte, byte, int) to Vector128/256 #73556
Conversation
Tagging subscribers to this area: @dotnet/area-system-memory Issue DetailsIt's #73384 again I've tried to repro #73474 locally and I've failed. Now I want to use the CI to verify if the issue is gone or not. ./build.sh --os browser -c Release
make -C src/mono/wasm provision-wasm
export EMSDK_PATH=/home/adam/projects/runtime/src/mono/wasm/emsdk
./dotnet.sh build /t:Test /p:TargetOS=Browser /p:TargetArchitecture=wasm /p:RunAOTCompilation=true /p:EnableAggressiveTrimming=true -c Release ./src/libraries/System.Runtime/tests
|
@jkotas the |
What fixed the underlying issue? The crash was caused by reading beyond end of the buffer. The CI may be green by luck - the memory after the end of the buffer might have been initialized just right to avoid the crash in the given run. @vargaz or @lambdageek should inspect the IL to confirm that the problem is gone. Also, we should find the change that fixed it. |
Ok, #73493 fixed the out of bound read, replaced it by rejecting the whole method with BadImageFormatException. Do we still have a functional issue at runtime? |
System.Memory tests exercise this code path and they are passing on machine with AVX2
|
Do the tests exercise the AOT code of method, or are they falling back to the interpreter silently? If it is the latter, we are likely going to see perf regressions. |
I am using instructions provided by @lambdageek in #73474 (comment) to run these tests /p:RunAOTCompilation=true /p:EnableAggressiveTrimming=true I would expect that they exercise the AOT code of method. How can I check if they don't fall back to interpreter silently? |
I don’t know how to check that. Somebody from the Mono team needs to help you with at that. |
I guess |
I've tried to benchmark WASM AOT but I don't think that it's currently possible: dotnet/BenchmarkDotNet#1818 (comment) |
/azp run runtime-wasm |
Azure Pipelines successfully started running 1 pipeline(s). |
I'm trying the System.Memory tests locally. I would expect any call to IndexOfAny to throw BIFE on wasm AOT. there shouldn't be a fallback to the interpreter, AFAIK. Perhaps we're skipping some of these tests already for other reasons |
Ah interesting. So this PR is good; i'm working, separately, on adding a fallthru check to the interpreter |
Do you agree that this PR is going to introduce perf regression since the method won’t be AOT compiled anymore? |
Yes, I should qualify "good" - this PR wont' crash the AOT compiler or the runtime anymore. The perf won't be good. |
actually dynamically this code is unreachable |
@jkotas @lambdageek @vitek-karas I am using similar patterns in few other PRs and aiming to port all But why? Is it because I am using |
Yes, something in the refactored methods exposed a bug in the IL linker. Native AOT does not have the bug. It is not sharing the relevant part of the linker. |
I'm not sure, but I think I have a workaround: if I change the last else
{
Debug.Assert (Vector.IsHardwareAccelerated); then I can also delete Debug.Fail("Unreachable");
goto NotFound; and then in the IL that comes out of the linker, I don't see a conditional jump at the end of the method anymore. (I'm not sure if this has any undesired impact on CoreCLR performance) |
My guess is that the bug is triggered by certain shape of the method flowgraph. It is not about types used. |
Sorry - I didn't get to this yet (first on my list after emails/management)... Early1:
Early2:
if (something)
{
goto Early1;
}
else
{
goto Early2;
} Basically the method ends with an if/else where both branches go backwards. In that case it's invalid to remove the "false" branch, because there's nothing left in the method - and even though that branch is technically not reachable, it needs to exist for validity. |
Can it still remove it and fold |
I'll have to go through the code - but in general the code in the linker is very simplistic and tries to avoid changing the number of instructions (it's problematic to move code around in that setup) - so it tries to keep the same number of instructions. It's also relatively problematic to fully remove the ldc.0 and so on - if the condition is over a call then removing the entire call is relatively complicated - and we don't do that. Which means we need to "pop" somehow. I'm considering either disabling the optimization in this case or to force insert a dummy ret at the end of the method. |
ldnull + throw may be better. Ret requires exactly zero or one items on the evaluation stack. |
@jkotas @lambdageek @vitek-karas Thanks for the explanation! I've removed all the gotos, similified the code and ensured that there are no regressions for configs that I can benchmark (CLR x64 AVX2 AVX & arm64 AdvSIMD). The perf has not regressed, it's on par. When reading the AVX2 (x64)BenchmarkDotNet=v0.13.1.1845-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.7.22377.5
[Host] : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT AVX2
Job-FHASLE : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-JRHZCS : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
LaunchCount=9
Code gen before: ; System.SpanHelpers.IndexOfAny(Byte ByRef, Byte, Byte, Int32)
push rsi
vzeroupper
movzx edx,dl
mov eax,edx
movzx r8d,r8b
mov r10d,r8d
xor r11d,r11d
mov esi,r9d
movsxd r9,r9d
add r9,0FFFFFFFFFFFFFFF0
js near ptr M00_L01
mov rsi,r9
jmp near ptr M00_L13
M00_L00:
add rsi,0FFFFFFFFFFFFFFF8
movzx edx,byte ptr [rcx+r11]
cmp eax,edx
je near ptr M00_L05
cmp r10d,edx
je near ptr M00_L05
movzx edx,byte ptr [rcx+r11+1]
cmp eax,edx
je near ptr M00_L06
cmp r10d,edx
je near ptr M00_L06
movzx edx,byte ptr [rcx+r11+2]
cmp eax,edx
je near ptr M00_L07
cmp r10d,edx
je near ptr M00_L07
movzx edx,byte ptr [rcx+r11+3]
cmp eax,edx
je near ptr M00_L08
cmp r10d,edx
je near ptr M00_L08
movzx edx,byte ptr [rcx+r11+4]
cmp eax,edx
je near ptr M00_L09
cmp r10d,edx
je near ptr M00_L09
movzx edx,byte ptr [rcx+r11+5]
cmp eax,edx
je near ptr M00_L10
cmp r10d,edx
je near ptr M00_L10
movzx edx,byte ptr [rcx+r11+6]
cmp eax,edx
je near ptr M00_L11
cmp r10d,edx
je near ptr M00_L11
movzx edx,byte ptr [rcx+r11+7]
cmp eax,edx
je near ptr M00_L12
cmp r10d,edx
je near ptr M00_L12
add r11,8
M00_L01:
cmp rsi,8
jae near ptr M00_L00
cmp rsi,4
jb short M00_L02
add rsi,0FFFFFFFFFFFFFFFC
movzx edx,byte ptr [rcx+r11]
cmp eax,edx
je short M00_L05
cmp r10d,edx
je short M00_L05
movzx edx,byte ptr [rcx+r11+1]
cmp eax,edx
je short M00_L06
cmp r10d,edx
je short M00_L06
movzx edx,byte ptr [rcx+r11+2]
cmp eax,edx
je short M00_L07
cmp r10d,edx
je short M00_L07
movzx edx,byte ptr [rcx+r11+3]
cmp eax,edx
je short M00_L08
cmp r10d,edx
je short M00_L08
add r11,4
M00_L02:
test rsi,rsi
je short M00_L04
M00_L03:
movzx edx,byte ptr [rcx+r11]
cmp eax,edx
je short M00_L05
cmp r10d,edx
je short M00_L05
inc r11
dec rsi
jne short M00_L03
M00_L04:
mov eax,0FFFFFFFF
vzeroupper
pop rsi
ret
M00_L05:
mov eax,r11d
jmp near ptr M00_L20
M00_L06:
lea eax,[r11+1]
jmp near ptr M00_L20
M00_L07:
lea eax,[r11+2]
jmp near ptr M00_L20
M00_L08:
lea eax,[r11+3]
jmp near ptr M00_L20
M00_L09:
lea eax,[r11+4]
jmp near ptr M00_L20
M00_L10:
lea eax,[r11+5]
jmp near ptr M00_L20
M00_L11:
lea eax,[r11+6]
jmp near ptr M00_L20
M00_L12:
lea eax,[r11+7]
jmp near ptr M00_L20
M00_L13:
cmp rsi,10
jb short M00_L16
vmovd xmm0,edx
vpbroadcastb ymm0,xmm0
vmovd xmm1,r8d
vpbroadcastb ymm1,xmm1
add rsi,0FFFFFFFFFFFFFFF0
je short M00_L15
M00_L14:
vmovupd ymm2,[rcx+r11]
vpcmpeqb ymm3,ymm0,ymm2
vpcmpeqb ymm2,ymm1,ymm2
vpor ymm2,ymm3,ymm2
vpmovmskb eax,ymm2
test eax,eax
jne near ptr M00_L19
add r11,20
cmp rsi,r11
ja short M00_L14
M00_L15:
vmovupd ymm2,[rcx+rsi]
mov r11,rsi
vpcmpeqb ymm0,ymm0,ymm2
vpcmpeqb ymm1,ymm1,ymm2
vpor ymm0,ymm0,ymm1
vpmovmskb eax,ymm0
test eax,eax
jne short M00_L19
jmp near ptr M00_L04
M00_L16:
vmovd xmm0,edx
vpbroadcastb xmm0,xmm0
vmovd xmm1,r8d
vpbroadcastb xmm1,xmm1
test rsi,rsi
je short M00_L18
M00_L17:
vmovupd xmm2,[rcx+r11]
vpcmpeqb xmm3,xmm0,xmm2
vpcmpeqb xmm2,xmm1,xmm2
vpor xmm2,xmm3,xmm2
vpmovmskb eax,xmm2
test eax,eax
jne short M00_L19
add r11,10
cmp rsi,r11
ja short M00_L17
M00_L18:
vmovupd xmm2,[rcx+rsi]
mov r11,rsi
vpcmpeqb xmm0,xmm0,xmm2
vpcmpeqb xmm1,xmm1,xmm2
vpor xmm0,xmm0,xmm1
vpmovmskb eax,xmm0
test eax,eax
je near ptr M00_L04
M00_L19:
tzcnt eax,eax
add r11,rax
jmp near ptr M00_L05
M00_L20:
vzeroupper
pop rsi
ret
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
sbb [rcx],eax
add [rax],eax
add [rax],esp
; Total bytes of code 663 Codegen after: .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2; System.SpanHelpers.IndexOfAny(Byte ByRef, Byte, Byte, Int32)
vzeroupper
xor r10d,r10d
cmp r9d,20
jl short M00_L02
nop dword ptr [rax]
mov eax,r9d
and eax,0FFFFFFE0
movzx r11d,dl
vmovd xmm0,r11d
vpbroadcastb ymm0,xmm0
movzx r11d,r8b
vmovd xmm1,r11d
vpbroadcastb ymm1,xmm1
M00_L00:
vmovdqu ymm2,ymmword ptr [rcx+r10]
vpcmpeqb ymm3,ymm0,ymm2
vpcmpeqb ymm2,ymm1,ymm2
vpor ymm2,ymm3,ymm2
vptest ymm2,ymm2
jne short M00_L01
add r10,20
cmp r10,rax
jb short M00_L00
jmp short M00_L05
nop dword ptr [rax]
M00_L01:
vpmovmskb r9d,ymm2
xor eax,eax
tzcnt eax,r9d
add eax,r10d
jmp near ptr M00_L09
M00_L02:
cmp r9d,10
jl short M00_L05
mov eax,r9d
and eax,0FFFFFFF0
movzx r11d,dl
vmovd xmm0,r11d
vpbroadcastb xmm0,xmm0
movzx r11d,r8b
vmovd xmm1,r11d
vpbroadcastb xmm1,xmm1
M00_L03:
vmovdqu xmm2,xmmword ptr [rcx+r10]
vpcmpeqb xmm3,xmm0,xmm2
vpcmpeqb xmm2,xmm1,xmm2
vpor xmm2,xmm3,xmm2
vptest xmm2,xmm2
jne short M00_L04
add r10,10
cmp r10,rax
jb short M00_L03
jmp short M00_L05
xchg ax,ax
M00_L04:
vpmovmskb edx,xmm2
xor eax,eax
tzcnt eax,edx
add eax,r10d
jmp short M00_L09
M00_L05:
movzx eax,dl
movzx edx,r8b
mov r8d,r9d
cmp r10,r8
jae short M00_L07
M00_L06:
movzx r9d,byte ptr [rcx+r10]
cmp eax,r9d
je short M00_L08
cmp edx,r9d
je short M00_L08
inc r10
cmp r10,r8
jb short M00_L06
M00_L07:
mov eax,0FFFFFFFF
vzeroupper
ret
M00_L08:
mov eax,r10d
M00_L09:
vzeroupper
ret
add [rcx],bl
add [rax],al
add [rax],al
add [rax],al
add [rax],al
add [rax],al
add [rax],al
add [rax],al
add [rax-55C021C],bh
jg short M00_L10
M00_L10:
add [rbp+48],dl
(bad)
; Total bytes of code 291 AVX (x64)BenchmarkDotNet=v0.13.1.1845-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.7.22377.5
[Host] : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT AVX2
Job-YLMNVS : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
Job-DFRNLR : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
EnvironmentVariables=COMPlus_EnableAVX2=0 LaunchCount=9
AdvSIMD(arm64)BenchmarkDotNet=v0.13.1.1845-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22408.5
[Host] : .NET 7.0.0 (7.0.22.40308), Arm64 RyuJIT AdvSIMD
Job-NZXMHK : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-LQCIXE : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
LaunchCount=9
|
I think I have a tentative fix for the linker bug (with the gotos - I couldn't find any other way to get to that state 😉 ). I'll clean it up and if possible we should try it on your previous change... anyway, thanks a lot for finding it! |
Hi @adamsitnik, I'm looking at the char equivalent of IndexOfAny. It's half done (blocked by #73777) but can add as a draft PR if you are going to find it useful. |
|
We don't need it anymore due to #73556 |
It's #73384 again
I've tried to repro #73474 locally and I've failed. Now I want to use the CI to verify if the issue is gone or not.