-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fatal error. Internal CLR error. (0x80131506) #69653
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
I can reproduce this. It looks like we get an AV and then hit a stackoverflow in EH while trying to handle that AV. The top frames of the call stack that shows the repeating segment is 0:034> kn30
# Child-SP RetAddr Call Site
00 000000da`233c3e80 00007ff9`c5911dc8 coreclr!FixupDispatcherContext+0x3c [C:\dev\dotnet\runtime\src\coreclr\vm\exceptionhandling.cpp @ 5514]
01 (Inline Function) --------`-------- coreclr!FixupDispatcherContext+0x10 [C:\dev\dotnet\runtime\src\coreclr\vm\exceptionhandling.cpp @ 5642]
02 000000da`233c43c0 00007ffa`155b209f coreclr!FixContextHandler+0x44 [C:\dev\dotnet\runtime\src\coreclr\vm\exceptionhandling.cpp @ 5768]
03 000000da`233c4400 00007ffa`15561454 ntdll!RtlpExecuteHandlerForException+0xf
04 000000da`233c4430 00007ffa`155b0bce ntdll!RtlDispatchException+0x244
05 000000da`233c4b40 00007ffa`132c4fd9 ntdll!KiUserExceptionDispatch+0x2e
06 000000da`233c5250 00007ff9`c577d799 KERNELBASE!RaiseException+0x69
07 000000da`233c5330 00007ff9`c577d7cb coreclr!NakedThrowHelper2+0x9
08 000000da`233c5360 00007ff9`c577d7d5 coreclr!NakedThrowHelper_RspAligned+0x1e
09 000000da`233c5888 00007ff9`66927030 coreclr!NakedThrowHelper_FixRsp+0x5
0a 000000da`233c5890 00007ff9`b8f9f5dc 0x00007ff9`66927030
0b 000000da`233c58c0 00007ff9`c577c803 System_Private_CoreLib+0x1ef5dc
0c 000000da`233c5930 00007ff9`c56e11cb coreclr!CallDescrWorkerInternal+0x83
0d (Inline Function) --------`-------- coreclr!CallDescrWorkerWithHandler+0x56 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.cpp @ 67]
0e 000000da`233c5970 00007ff9`c5818c3b coreclr!MethodDescCallSite::CallTargetWorker+0x247 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.cpp @ 570]
0f (Inline Function) --------`-------- coreclr!MethodDescCallSite::Call_RetSTRINGREF+0x17 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.h @ 462]
10 000000da`233c5aa0 00007ff9`c5818c7a coreclr!GetExceptionMessage+0x9b [C:\dev\dotnet\runtime\src\coreclr\vm\excep.cpp @ 277]
11 000000da`233c5bb0 00007ff9`c57b8bff coreclr!GetExceptionMessage+0xe [C:\dev\dotnet\runtime\src\coreclr\vm\excep.cpp @ 248]
12 000000da`233c5be0 00007ffa`155b209f coreclr!ProcessCLRException+0x172cff [C:\dev\dotnet\runtime\src\coreclr\vm\exceptionhandling.cpp @ 1042]
13 000000da`233c5e00 00007ffa`15561454 ntdll!RtlpExecuteHandlerForException+0xf
14 000000da`233c5e30 00007ffa`155b0bce ntdll!RtlDispatchException+0x244
15 000000da`233c6540 00007ffa`132c4fd9 ntdll!KiUserExceptionDispatch+0x2e
16 000000da`233c6c50 00007ff9`c577d799 KERNELBASE!RaiseException+0x69
17 000000da`233c6d30 00007ff9`c577d7cb coreclr!NakedThrowHelper2+0x9
18 000000da`233c6d60 00007ff9`c577d7d5 coreclr!NakedThrowHelper_RspAligned+0x1e
19 000000da`233c7288 00007ff9`66927030 coreclr!NakedThrowHelper_FixRsp+0x5
1a 000000da`233c7290 00007ff9`b8f9f5dc 0x00007ff9`66927030
1b 000000da`233c72c0 00007ff9`c577c803 System_Private_CoreLib+0x1ef5dc
1c 000000da`233c7330 00007ff9`c56e11cb coreclr!CallDescrWorkerInternal+0x83
1d (Inline Function) --------`-------- coreclr!CallDescrWorkerWithHandler+0x56 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.cpp @ 67]
1e 000000da`233c7370 00007ff9`c5818c3b coreclr!MethodDescCallSite::CallTargetWorker+0x247 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.cpp @ 570]
1f (Inline Function) --------`-------- coreclr!MethodDescCallSite::Call_RetSTRINGREF+0x17 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.h @ 462]
20 000000da`233c74a0 00007ff9`c5818c7a coreclr!GetExceptionMessage+0x9b [C:\dev\dotnet\runtime\src\coreclr\vm\excep.cpp @ 277]
21 000000da`233c75b0 00007ff9`c57b8bff coreclr!GetExceptionMessage+0xe [C:\dev\dotnet\runtime\src\coreclr\vm\excep.cpp @ 248]
22 000000da`233c75e0 00007ffa`155b209f coreclr!ProcessCLRException+0x172cff [C:\dev\dotnet\runtime\src\coreclr\vm\exceptionhandling.cpp @ 1042]
23 000000da`233c7800 00007ffa`15561454 ntdll!RtlpExecuteHandlerForException+0xf
24 000000da`233c7830 00007ffa`155b0bce ntdll!RtlDispatchException+0x244
25 000000da`233c7f40 00007ffa`132c4fd9 ntdll!KiUserExceptionDispatch+0x2e
26 000000da`233c8650 00007ff9`c577d799 KERNELBASE!RaiseException+0x69
27 000000da`233c8730 00007ff9`c577d7cb coreclr!NakedThrowHelper2+0x9
28 000000da`233c8760 00007ff9`c577d7d5 coreclr!NakedThrowHelper_RspAligned+0x1e
29 000000da`233c8c88 00007ff9`66927030 coreclr!NakedThrowHelper_FixRsp+0x5
2a 000000da`233c8c90 00007ff9`b8f9f5dc 0x00007ff9`66927030
2b 000000da`233c8cc0 00007ff9`c577c803 System_Private_CoreLib+0x1ef5dc
2c 000000da`233c8d30 00007ff9`c56e11cb coreclr!CallDescrWorkerInternal+0x83
2d (Inline Function) --------`-------- coreclr!CallDescrWorkerWithHandler+0x56 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.cpp @ 67]
2e 000000da`233c8d70 00007ff9`c5818c3b coreclr!MethodDescCallSite::CallTargetWorker+0x247 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.cpp @ 570]
2f (Inline Function) --------`-------- coreclr!MethodDescCallSite::Call_RetSTRINGREF+0x17 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.h @ 462] The actual crash itself happens inside runtime/src/coreclr/System.Private.CoreLib/src/System/RuntimeType.CoreCLR.cs Lines 2401 to 2417 in 621cd59
Note the shape of runtime/src/coreclr/System.Private.CoreLib/src/System/Runtime/InteropServices/GCHandle.CoreCLR.cs Lines 16 to 23 in 621cd59
This made me strongly suspect it is related to #68739. However, we actually crash during the inlined 00007ff9`66927020 56 push rsi
00007ff9`66927021 4883ec20 sub rsp,20h
00007ff9`66927025 8bf2 mov esi,edx
00007ff9`66927027 488b4110 mov rax,qword ptr [rcx+10h]
00007ff9`6692702b 4885c0 test rax,rax
00007ff9`6692702e 740a je 00007ff9`6692703a
>>> 00007ff9`66927030 488b00 mov rax,qword ptr [rax]
00007ff9`66927033 4885c0 test rax,rax
00007ff9`66927036 7402 je 00007ff9`6692703a
00007ff9`66927038 eb06 jmp 00007ff9`66927040
00007ff9`6692703a ff1560082aff call qword ptr [00007ff9`65bc78a0] (System.RuntimeType.InitializeCache(), mdToken: 0000000006000726)
00007ff9`66927040 488bc8 mov rcx,rax
00007ff9`66927043 8bd6 mov edx,esi
00007ff9`66927045 3909 cmp dword ptr [rcx],ecx
00007ff9`66927047 ff15c3a765ff call qword ptr [00007ff9`65f81810] (System.RuntimeType+RuntimeTypeCache.GetName(System.TypeNameKind), mdToken: 00000000060007C2)
00007ff9`6692704d 90 nop
00007ff9`6692704e 4883c420 add rsp,20h
00007ff9`66927052 5e pop rsi
00007ff9`66927053 c3 ret
0:034> !DumpObj 1e1`2fce6100
Name: System.RuntimeType
MethodTable: 00007ff965bb85d8
EEClass: 00007ff965bab748
Tracked Type: false
Size: 40(0x28) bytes
Type Name: System.AccessViolationException
Type MT: 00007ff966fbdaa0
File: C:\dev\dotnet\investigations\zero\zero.sync\bin\Release\net7.0\win-x64\publish\System.Private.CoreLib.dll
Fields:
MT Field Offset Type VT Attr Value Name
00007ff9664cbab0 4000303 100 ...Reflection.Binder 0 static 000001e11a577120 s_defaultBinder
00007ff965bbe630 4000304 9aa System.Char 1 static 2e Delimiter
00007ff965e9a490 4000305 108 System.Type[] 0 static 000001e11a42bc48 EmptyTypes
00007ff965a99280 4000306 110 System.Object 0 static 000001e11a42bc60 Missing
00007ff965f90fb8 4000307 118 ...tion.MemberFilter 0 static 000001e11a42bc78 FilterAttribute
00007ff965f90fb8 4000308 120 ...tion.MemberFilter 0 static 000001e11a42bcb8 FilterName
00007ff965f90fb8 4000309 128 ...tion.MemberFilter 0 static 000001e11a42bd10 FilterNameIgnoreCase
00007ff965a99280 40002a8 8 System.Object 0 instance 0000000000000000 m_keepalive
00007ff965c294b0 40002a9 10 System.IntPtr 1 instance 0000000000096997 m_cache <- not a valid pointer
00007ff965c294b0 40002aa 18 System.IntPtr 1 instance 00007FF966FBDAA0 m_handle
00007ff965bb85d8 40002ab c8 System.RuntimeType 0 static 000001e11a418948 ValueType
00007ff965bb85d8 40002ac d0 System.RuntimeType 0 static 000001e11a418920 ObjectType
00007ff965bb85d8 40002ad d8 System.RuntimeType 0 static 000001e11a400580 StringType
0000000000000000 40002af e0 System.OleAutBinder 0 static 0000000000000000 s_ForwardCallBinder Notice that |
When I run this under debugger I see quite a few different crashes that all point to some heap corruption. It also seems to crash much more quickly than when not run under debugger. |
It looks a lot like allocations are being returned without being zeroed. In one example run we crashed immediately when trying to throw an exception: 0:000> !clrstack -f
OS Thread Id: 0xf5c0 (0)
Child SP IP Call Site
000000213D37CE70 00007FFA132C4FD9 KERNELBASE!RaiseException + 105
000000213D37CF50 00007FF9D84E9BF6 coreclr!RaiseTheExceptionInternalOnly + 678 at C:\dev\dotnet\runtime\src\coreclr\vm\excep.cpp:2805
000000213D37D080 00007FF9D85BF229 coreclr!IL_Throw + 185 at C:\dev\dotnet\runtime\src\coreclr\vm\jithelpers.cpp:4014
000000213D37D0E8 [HelperMethodFrame: 000000213d37d0e8]
000000213D37D1E0 00007ff9793f4994 BouncyCastle.Crypto.dll!Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ScalarMultBaseEncoded(Byte[], Byte[], Int32) + 148
000000213D37D230 00007ff97940bd99 BouncyCastle.Crypto.dll!Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ImplSign(Byte[], Int32, Byte[], Byte, Byte[], Int32, Int32, Byte[], Int32) + 329
000000213D37D2E0 00007ff97940bc29 BouncyCastle.Crypto.dll!Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.Sign(Byte[], Int32, Byte[], Int32, Int32, Byte[], Int32) + 105 The managed function that creates the exception looks like: 0:000> !U /d 00007ff9793f4994
Normal JIT generated code
Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ScalarMultBaseEncoded(Byte[], Byte[], Int32)
ilAddr is 000001B0A42A6A28 pImport is 000001F8B79FF540
Begin 00007FF9793F4900, size 9b
00007ff9`793f4900 55 push rbp
00007ff9`793f4901 4883ec40 sub rsp,40h
00007ff9`793f4905 488d6c2440 lea rbp,[rsp+40h]
00007ff9`793f490a c5d857e4 vxorps xmm4,xmm4,xmm4
00007ff9`793f490e c5fa7f65e8 vmovdqu xmmword ptr [rbp-18h],xmm4
00007ff9`793f4913 33c0 xor eax,eax
00007ff9`793f4915 488945f8 mov qword ptr [rbp-8],rax
00007ff9`793f4919 48894d10 mov qword ptr [rbp+10h],rcx
00007ff9`793f491d 48895518 mov qword ptr [rbp+18h],rdx
00007ff9`793f4921 44894520 mov dword ptr [rbp+20h],r8d
00007ff9`793f4925 48b9a0c44779f97f0000 mov rcx,7FF97947C4A0h (MT: Org.BouncyCastle.Math.EC.Rfc8032.Ed25519+PointAccum)
00007ff9`793f492f e85c861e5f call coreclr!JIT_TrialAllocSFastMP_InlineGetThread (00007ff9`d85dcf90)
00007ff9`793f4934 488945f0 mov qword ptr [rbp-10h],rax
00007ff9`793f4938 488b4df0 mov rcx,qword ptr [rbp-10h]
00007ff9`793f493c ff15de6d0900 call qword ptr [00007ff9`7948b720] (Org.BouncyCastle.Math.EC.Rfc8032.Ed25519+PointAccum..ctor(), mdToken: 0000000006004504)
00007ff9`793f4942 488b4df0 mov rcx,qword ptr [rbp-10h]
00007ff9`793f4946 48894df8 mov qword ptr [rbp-8],rcx
00007ff9`793f494a 488b4d10 mov rcx,qword ptr [rbp+10h]
00007ff9`793f494e 488b55f8 mov rdx,qword ptr [rbp-8]
00007ff9`793f4952 ff15a06bc9ff call qword ptr [00007ff9`7908b4f8] (Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ScalarMultBase(Byte[], PointAccum), mdToken: 0000000006000BE1)
00007ff9`793f4958 488b4df8 mov rcx,qword ptr [rbp-8]
00007ff9`793f495c 488b5518 mov rdx,qword ptr [rbp+18h]
00007ff9`793f4960 448b4520 mov r8d,dword ptr [rbp+20h]
00007ff9`793f4964 ff154668c9ff call qword ptr [00007ff9`7908b1b0] (Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.EncodePoint(PointAccum, Byte[], Int32), mdToken: 0000000006000BBE)
00007ff9`793f496a 85c0 test eax,eax
00007ff9`793f496c 7526 jne 00007ff9`793f4994
00007ff9`793f496e 48b9c0a5cf78f97f0000 mov rcx,7FF978CFA5C0h (MT: System.InvalidOperationException)
00007ff9`793f4978 e813861e5f call coreclr!JIT_TrialAllocSFastMP_InlineGetThread (00007ff9`d85dcf90)
00007ff9`793f497d 488945e8 mov qword ptr [rbp-18h],rax
00007ff9`793f4981 488b4de8 mov rcx,qword ptr [rbp-18h]
00007ff9`793f4985 ff1525d490ff call qword ptr [00007ff9`78d01db0] (System.InvalidOperationException..ctor(), mdToken: 000000000600156B)
00007ff9`793f498b 488b4de8 mov rcx,qword ptr [rbp-18h]
00007ff9`793f498f e8dca71c5f call coreclr!IL_Throw (00007ff9`d85bf170)
>>> 00007ff9`793f4994 90 nop
00007ff9`793f4995 4883c440 add rsp,40h
00007ff9`793f4999 5d pop rbp
00007ff9`793f499a c3 ret And the object does not look to have been zeroed: 0:000> !dumpobj 170`1d1d3708
Name: System.InvalidOperationException
MethodTable: 00007ff978cfa5c0
EEClass: 00007ff978ce5ac0
Tracked Type: false
Size: 128(0x80) bytes
File: C:\dev\dotnet\investigations\zero\zero.sync\bin\Release\net7.0\win-x64\publish\System.Private.CoreLib.dll
Fields:
MT Field Offset Type VT Attr Value Name
00007ff978ddc328 400023f 8 ...ection.MethodBase 0 instance dddddddddddddddd _exceptionMethod
00007ff978aa2ce0 4000240 10 System.String 0 instance 000001701d1d37d8 _message
00007ff978aa6af8 4000241 18 ...tions.IDictionary 0 instance 0000000000000000 _data
00007ff978aa4688 4000242 20 System.Exception 0 instance 0000000000000000 _innerException
00007ff978aa2ce0 4000243 28 System.String 0 instance dddddddddddddddd _helpURL
00007ff978cf2380 4000244 30 System.Byte[] 0 instance 000001701d1d3a08 _stackTrace
00007ff978cf2380 4000245 38 System.Byte[] 0 instance 0000000000000000 _watsonBuckets
00007ff978aa2ce0 4000246 40 System.String 0 instance 0000000000000000 _stackTraceString
00007ff978aa2ce0 4000247 48 System.String 0 instance dddddddddddddddd _remoteStackTraceString
00007ff978a1a808 4000248 50 System.Object[] 0 instance 0000000000000000 _dynamicMethods
00007ff978aa2ce0 4000249 58 System.String 0 instance 0000000000000000 _source
00007ff978a8a828 400024a 60 System.UIntPtr 1 instance 00007FF9793F4994 _ipForWatsonBuckets
00007ff978a894b0 400024b 68 System.IntPtr 1 instance DDDDDDDDDDDDDDDD _xptrs
00007ff978a49e10 400024c 70 System.Int32 1 instance -532462766 _xcode
00007ff978a49e10 400024d 74 System.Int32 1 instance -2146233079 _HResult 0:000> !verifyheap
object 000001701d1d3708: bad member DDDDDDDDDDDDDDDD at 000001701D1D3710
Last good object: 000001701D1D36C8. cc @dotnet/gc |
Could this crash be the same? build. PR doesn't have any related changes.
|
I hacked a part of code that hooks into the scheduler & thread pool. Specifically this line: It contains a primed Thanks for that! |
So initializing the exception as a static in a struct is causing it to be not initialized correctly? |
No. I think throwing the same instance is the problem. Since removing it the error has gone away. Also the error started around the time I injected that code. I can't tell for sure because those exceptions were triggering at different rates at that time. |
@tactical-drone @mangod9 The codegen above indicates that the "failing" code is constructing and immediately throwing an exception. It should not be possible for such an object to contain a field with an invalid GC ref. I still think this is a GC or runtime bug, but it might be related to something EH ends up doing because of that same exception instance. |
Yea, I am taking a wild guess here. If it is not the exception then great! |
@janvorli as FYI |
@dotnet/gc when I uncomment the 0:043> k
# Child-SP RetAddr Call Site
00 000000e5`ec57a6f8 00007ff9`45d3cad1 KERNELBASE!wil::details::DebugBreak+0x2
01 (Inline Function) --------`-------- coreclr!WKS::FATAL_GC_ERROR+0x5 [C:\dev\dotnet\runtime3\src\coreclr\gc\gcpriv.h @ 27]
02 (Inline Function) --------`-------- coreclr!WKS::verify_mem_cleared+0x4a [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 14501]
03 000000e5`ec57a700 00007ff9`45d66705 coreclr!WKS::gc_heap::adjust_limit_clr+0x591 [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 15491]
04 (Inline Function) --------`-------- coreclr!WKS::gc_heap::a_fit_free_list_p+0x3fd [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 15898]
05 000000e5`ec57a780 00007ff9`45d3ee8b coreclr!WKS::gc_heap::soh_try_fit+0x435 [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 16405]
06 000000e5`ec57a820 00007ff9`45d68bee coreclr!WKS::gc_heap::allocate_soh+0x28b [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 16537]
07 000000e5`ec57a8c0 00007ff9`45d36e21 coreclr!WKS::gc_heap::try_allocate_more_space+0x33e [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 17489]
08 (Inline Function) --------`-------- coreclr!WKS::gc_heap::allocate_more_space+0x11 [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 17961]
09 (Inline Function) --------`-------- coreclr!WKS::gc_heap::allocate+0x60 [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 17992]
0a 000000e5`ec57a940 00007ff9`459d1d49 coreclr!WKS::GCHeap::Alloc+0x301 [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 44969]
0b 000000e5`ec57aa30 00007ff9`459d3fef coreclr!Alloc+0x1e9 [C:\dev\dotnet\runtime3\src\coreclr\vm\gchelpers.cpp @ 226]
0c 000000e5`ec57ab20 00007ff9`45a365cc coreclr!AllocateSzArray+0x1ef [C:\dev\dotnet\runtime3\src\coreclr\vm\gchelpers.cpp @ 478]
0d 000000e5`ec57ac40 00007ff8`e8688e35 coreclr!JIT_NewArr1+0x2bc [C:\dev\dotnet\runtime3\src\coreclr\vm\jithelpers.cpp @ 2625] 0:043> !clrstack -f
OS Thread Id: 0xf5f4 (43)
Child SP IP Call Site
000000E5EC57A6F8 00007FFA1335D532 KERNELBASE!wil::details::DebugBreak + 2
000000E5EC57A700 00007FF945D3CAD1 coreclr!WKS::gc_heap::adjust_limit_clr + 1425 at C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp:15491
000000E5EC57A780 00007FF945D66705 coreclr!WKS::gc_heap::soh_try_fit + 1077 at C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp:16405
000000E5EC57A820 00007FF945D3EE8B coreclr!WKS::gc_heap::allocate_soh + 651 at C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp:16537
000000E5EC57A8C0 00007FF945D68BEE coreclr!WKS::gc_heap::try_allocate_more_space + 830 at C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp:17489
000000E5EC57A940 00007FF945D36E21 coreclr!WKS::GCHeap::Alloc + 769 at C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp:44969
000000E5EC57AA30 00007FF9459D1D49 coreclr!Alloc + 489 at C:\dev\dotnet\runtime3\src\coreclr\vm\gchelpers.cpp:226
000000E5EC57AB20 00007FF9459D3FEF coreclr!AllocateSzArray + 495 at C:\dev\dotnet\runtime3\src\coreclr\vm\gchelpers.cpp:478
000000E5EC57AC40 00007FF945A365CC coreclr!JIT_NewArr1 + 700 at C:\dev\dotnet\runtime3\src\coreclr\vm\jithelpers.cpp:2625
000000E5EC57AD88 [HelperMethodFrame: 000000e5ec57ad88]
000000E5EC57AEB0 00007ff8e8688e35 BouncyCastle.Crypto.dll!Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.PointDouble(PointAccum) + 85
000000E5EC57AF20 00007ff8e862cf2d BouncyCastle.Crypto.dll!Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ScalarMultStrausVar(UInt32[], UInt32[], PointAffine, PointAccum) + 461
000000E5EC57AFE0 00007ff8e862bfad BouncyCastle.Crypto.dll!Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ImplVerify(Byte[], Int32, Byte[], Int32, Byte[], Byte, Byte[], Int32, Int32) + 653
000000E5EC57B090 00007ff8e862bcf9 BouncyCastle.Crypto.dll!Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.Verify(Byte[], Int32, Byte[], Int32, Byte[], Int32, Int32) + 105
000000E5EC57B100 00007ff8e862bc75 zero.cocoon.dll!zero.cocoon.identity.CcDesignation.Verify(Byte[], Int32, Int32, Byte[], Int32, Byte[], Int32) + 69 [C:\dev\dotnet\investigations\zero\zero.cocoon\identity\CcDesignation.cs @ 93]
000000E5EC57B150 00007ff8e862587c zero.cocoon.dll!zero.cocoon.models.CcDiscoveries+d__16.MoveNext() + 4908 [C:\dev\dotnet\investigations\zero\zero.cocoon\models\CcDiscoveries.cs @ 288]
000000E5EC57BB40 00007ff8e86244b6 System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[zero.cocoon.models.CcDiscoveries+d__16, zero.cocoon]](d__16 ByRef) + 102
000000E5EC57BBA0 00007ff8e862442c System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncValueTaskMethodBuilder`1[[zero.core.patterns.bushings.contracts.IoJobMeta+JobState, zero.core]].Start[[zero.cocoon.models.CcDiscoveries+d__16, zero.cocoon]](d__16 ByRef) + 28
000000E5EC57BBD0 00007ff8e86243cd zero.cocoon.dll!zero.cocoon.models.CcDiscoveries.ConsumeAsync() + 125 Note that when cloning the repro case you may want to use an older commit as the change discussed above seems to have an effect on the issue. I reproduced it on tactical-drone/zero@6218976. |
Tagging subscribers to this area: @dotnet/gc Issue DetailsDescription
This started happening regularly and somewhat reproducibly since .Net 7 Preview 5 a couple of versions ago. My next level scheduler & async/await optimizations were also taking affect so I cannot be sure. It could just be caused by the increased load achieved instead of something in the runtime.
Reproduction StepsClone & compile https://github.com/tactical-drone/zero and run it. Cluster bootstrap tests regularly fail with that error (1/5) chance. Expected behaviorProfit. Actual behaviorRandomly every 2nd or 3rd run I get this error in RELEASE mode. I have never seen this error happen in DEBUG. What is awesome is, that it is very reproducible. Borg technology is designed to push the runtime to the limits. Regression?This worked in .net 7 preview 4 Known WorkaroundsNone ConfigurationCPU: amd 3900x Other informationI have no idea. The cluster test needs 10Gb memory in RELEASE mode, 20GB in DEBUG.
|
If the problem goes away after deleting the exception singleton (https://github.com/tactical-drone/zero/blob/760d42656136c29e0e219c757e7558ddbe18385c/zero.core/patterns/semaphore/core/IoManualResetValueTaskSourceCore.cs#L20), it is most likely a race condition in how the stacktrace is saved into the exception. Are you able to see the same |
@jkotas based on @cshung's advice, @jakobbotsch has enabled a piece of code in the GC that verifies that the memory that is given out for the allocation context is zeroed. And the assert in that check is firing. So it seems to indicate the problem of uninitialized exception object fields is in GC. While the stack trace would likely get garbled if that exception instance is thrown from multiple threads, it should not influence fields like |
Keep in mind that exception "singleton" was on a generic struct, so there are multiple of them. But the ones that were thrown in this case were of only @jkotas Yes they might throw in parallel and inside places like |
Or it can indicate a buffer overrun or incorrectly protected OBJECTREF in the VM. |
What is the exact command to run the project? |
I published the zero.sync project self-contained, copied checked bits in there and just ran zero.sync.exe. |
On the latest revision I don't see those errors happening anymore. There is still sometimes strange crashes, this time something about But to be fair, I do stress the runtime on purpose with hybrid attachments to core parts of the runtime that might be causing the problems. If you press When you are finished and what to perform the hardcore test that .net fails every time, enter |
The crash does not repro for me (Intel Core i7, 32GB, 8 logical cores). If anybody can share a full dump (w/ symbols if it is using private coreclr build), I will be happy to take a look. |
I have looked at several dumps from @jakobbotsch . I agree that this looks like bug in the GC. I do not see any pattern that would suggest a GC reporting hole. cc @dotnet/gc From what we have seen so far, the issue repros well on AMD machines. It is hard/impossible to repro on Intel machines. |
@cshung could you please try to repro this on our amd machine if it's not obvious from the dumps why this occurs? |
I like this theory. AFAIK I still think that the Naturally this is pure conjecture because the state I see makes no sense. Not even wild theories can explain. History dictates that the problem is PEBKAC though, but sometimes I wonder. The things I am trying to achieve with these instructions are complex and hard to debug, but I am convinced I have been consistently plagued by some strange phenomenon with this instruction on AMD. What is hard to believe is sometimes it works for 20million iterations, sometimes if fails after 300. The range is large. (and AMD because it has that memory bridge between the two dies that could be causing this) At another part of this chain, even slightly changing this function with an equivalent one but different IL my cluster test would bog down after 300 adjuncts have been added for clustering. So to be clear, my .net scheduler outcome is critically dependent on the outcome of this CAS. So something subtle happening here is eventually bricking my scheduler's queues that depend on the correct cores to be selected by this CAS. AFAIK this code should lead to the same outcomes. This means that there are timing issues around this Fails:
Succeeds:
|
It's possible there are two separate bugs here.
|
@jakobbotsch Good point. Let me explain the reason I mention these CAS issues, is that while I was redesigning my semaphore over and over again it seemed like these parts had the biggest effect on the program totally crashing in these strange ways. The CAS instructions lead to valuetasks that break throwing the exceptions above at high rates causing the original issue. But in a way you are correct, these are separate issues. Naturally I don't think these bugs will occur if you use .net like a civilized individual. These errors will not occur when I disable my scheduler and my scheduler depends on 2 CAS instructions to work. The rest of the concurrency issues are solved by using runtime tech instead of my own contraptions. Unless the glue provided in ZeroCore is fundamentally wrong. I spent a lot of time coming up with that design (using runtime tech instead of another contraption), so I would be surprised but sad if there was a fundamental issue in the way I multiplex To be clear. I don't want to send you guys on a wild goose chase when I mess with custom schedulers. Obviously any crashes reported is a good thing. The least I can do is tell you how to disable my custom bits so that you can decern better if you are chasing my bugs. |
Was the assert in |
Yes. |
We have tracked down the first issue above to seemingly be a WinDbg bug where the upper half of Interestingly, on Intel CPUs the case does not repro and after stepping I believe that @janvorli will investigate the second bug, which is likely the real bug reported here. |
@jakobbotsch Excellent news. W.R.T the second problem. Not sure what I am doing wrong but I have had at 3 separate code paths now where my CAS fails if I apply pressure tests. I believe one of these failures are proven to be correct code because it is runtime tech. The reason why I struggle is the pressure needs to be high for the bug to appear, so many times I think I fixed the problem when instead I made the performance worse. Current status quo is, I see old CAS instructions coming out of nowhere jamming my state. Not sure if they are stuck in the scheduler, the CPU or my broken logic. Maybe I should just run this code in Linux VM and see what happens. Interestingly if I lower the node count to more civilized values such as With higher values such as |
I can repro the
The corruption is quite strange. The object at frame 9 is of type
This is the memory at the object address:
Here is what the reference that is in the
And here is a history of that object:
|
And more more piece of the puzzle - the heap verification shows a heap corruption
That corresponds to the corruption shown above. The
However, this reference has no GC root. |
Thanks to @jakobbotsch's hint, I have found that the
Looking at the generic argument of the |
In other words, somehow the _sentinel that's an unsafely cast |
I use that sentinel as a instance of the type that I am queueing to manage concurrency. That sentinel is only inserted into the Q buffer using these CAS instructions that I complain about. Not sure if it was strictly necessary to be there. I was trying to fix the CAS by making the other side detect the sentinel instead of issuing another cmp&swap spam that could jam. If I take out the sentinel the code becomes massively unstable, which should not be. This is a lead. |
@tactical-drone It is never safe to reinterpret an object reference of one type as an object reference of another type where the same cast would not succeed. It will cause runtime instability issues, like you see in this issue. I have applied the following patch that eliminates the use of @@ -50,8 +50,8 @@ namespace zero.core.patterns.queue
//TODO: tuning
capacity = Math.Max(4, capacity);
_hwm = _capacity = 1;
- _storage = new T[32][];
- _storage[0] = _fastStorage = new T[_capacity];
+ _storage = new object[32][];
+ _storage[0] = _fastStorage = new object[_capacity];
var v = IoMath.Log2((ulong)capacity) - 1;
var scaled = false;
@@ -66,8 +66,8 @@ namespace zero.core.patterns.queue
else
{
_hwm = _capacity = capacity;
- _storage = new T[1][];
- _storage[0] = _fastStorage = new T[_capacity];
+ _storage = new object[1][];
+ _storage[0] = _fastStorage = new object[_capacity];
}
if (_blockingCollection)
@@ -83,7 +83,7 @@ namespace zero.core.patterns.queue
_curEnumerator = new IoQEnumerator<T>(this);
- _sentinel = Unsafe.As<T>(new object());
+ _sentinel = new object();
}
#region packed
@@ -92,10 +92,10 @@ namespace zero.core.patterns.queue
private readonly string _description;
- private readonly T[][] _storage;
- private readonly T[] _fastStorage;
+ private readonly object[][] _storage;
+ private readonly object[] _fastStorage;
- private readonly T _sentinel;
+ private readonly object _sentinel;
private readonly object _syncRoot = new();
private volatile IoQEnumerator<T> _curEnumerator;
@@ -161,11 +161,11 @@ namespace zero.core.patterns.queue
Debug.Assert(idx >= 0);
//if (!IsAutoScaling) return Volatile.Read(ref _fastStorage[idx % _capacity]);
- if (!IsAutoScaling) return _fastStorage[idx % _capacity];
+ if (!IsAutoScaling) return (T)_fastStorage[idx % _capacity];
idx %= Capacity;
var i = IoMath.Log2(unchecked((ulong)idx + 1));
- return _storage[i][idx - ((1 << i) - 1)];
+ return (T)_storage[i][idx - ((1 << i) - 1)];;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
protected set
@@ -207,7 +207,7 @@ namespace zero.core.patterns.queue
if (_primedForScale == 1 && Head <= cap2 && Tail <= cap2 && (Tail > Head || Tail == Head && _count < cap2) && Interlocked.CompareExchange(ref _primedForScale, 2, 1) == 1 || force)
{
var hwm = 1 << (_virility + 1);
- _storage[_virility + 1] = new T[hwm];
+ _storage[_virility + 1] = new object[hwm];
Interlocked.Add(ref _hwm, hwm);
Interlocked.Increment(ref _virility);
Interlocked.Exchange(ref _primedForScale, 0);
@@ -226,7 +226,7 @@ namespace zero.core.patterns.queue
/// <param name="compare">The compare value</param>
/// <returns>The previous value</returns>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
- private T CompareExchange(long idx, T value, T compare)
+ private object CompareExchange(long idx, object value, object compare)
{
if (!IsAutoScaling) return Interlocked.CompareExchange(ref _fastStorage[idx % _capacity], value, compare);
@@ -401,7 +401,7 @@ namespace zero.core.patterns.queue
T latch;
while ((head = Head) >= Tail || (latch = this[head]) == _sentinel || latch == null || head != Head ||
- (slot = CompareExchange(head, _sentinel, latch)) != latch)
+ (slot = (T)CompareExchange(head, _sentinel, latch)) != latch)
{
if (_count == 0 || Zeroed)
{
@@ -418,7 +418,7 @@ namespace zero.core.patterns.queue
this[head] = null;//restore the sentinel
Interlocked.Decrement(ref _count);
Interlocked.Increment(ref _head);
return true;
}
catch (Exception e) When I apply this patch, I see exceptions like the following: 2022/05/27 13:30:43.597|ERROR[ IoZeroQ`1 ( 317)]: TryDequeue failed! System.Object ChkCastAny(Void*, System.Object) System.InvalidCastException: Unable to cast object of type 'System.Object' to type 'IoZNode[zero.core.patterns.bushings.IoSink`1[zero.core.feat.models.protobuffer.CcProtocMessage`2[Zero.Models.Protobuf.chroniton,zero.cocoon.models.batches.CcDiscoveryBatch]]]'.
at zero.core.patterns.queue.IoZeroQ`1.TryDequeue(T& slot) in C:\dev\dotnet\investigations\zero\zero.core\patterns\queue\IoZeroQ.cs:line 431 |
I've also noticed that are couple of |
And one more detail - in the dump I've taken at the crash, I can actually see the sentinel to be stored somewhere out of the IoZeroQ:
This looks like a static variable holding a Task. The ValueTuple are most likely the captured locals, the first member being of type
|
I can see that setting the string to |
@jakobbotsch Sure thing! I was totally hacking there. I was hoping Your solution casting to object should work? I might have that internal bug that is masked by this sentinel. Your cast errors is an indication of this. Let me try that patch. |
I am a bit confused now. I cant get CAS to work on objects. It throws |
You probably ended up with array covariance by storing |
Oops! I missed an entire part of the patch. Your patch passes smoke tests. I don't get those cast errors. Let me make bigger tests. |
I thought that fixed it. All my tests pass everything runs very smoothly now. The liveness under load looks better too. So smoothly in fact that I end up back at Everything just work then suddenly out of nowhere CPU flatline with Let me see if I can check in progress. |
Just an extra bit of detail that concludes how we got to the corruption. The problem was triggered by the public async ValueTask DisposeAsync(IIoNanite @from, string reason, [CallerFilePath] string filePath = null, [CallerMemberName] string methodName = null, [CallerLineNumber] int lineNumber = default)
{
if (GetType() == typeof(object))
{
Environment.FailFast("DisposeAsync called on System.Object!");
} The call stack displayed by the FailFast shown the ZeroManagedAsync was the caller. It seemed a bit strange as the caller seems to check if the reference it calls the DisposeAsync on implements
But I guess the compiler somehow eliminated the check since the type of the |
@janvorli Great stuff! Indeed, that is a bug. Dispose should not be called on the sentinel, but surely must pass that check otherwise For now I have removed the sentinel entirely anyway as I was using it to jam the CAS timing. |
@tactical-drone I assume we can close this issue now, right? |
@janvorli Thanks! Good luck! |
Description
This started happening regularly and somewhat reproducibly since .Net 7 Preview 5 a couple of versions ago. My next level scheduler & async/await optimizations were also taking affect so I cannot be sure. It could just be caused by the increased load achieved instead of something in the runtime.
zero.cocoon.models.CcDiscoveries+<ConsumeAsync>d__16.MoveNext()
is the part that reads and parses delivered msgs from UDP backend with protobuf & and tries to verify signatures. The signature verification fails. The rate at which this function is called is high, hence eventually the error happens. So the error self is very low probability but the cluster makes it higher.Reproduction Steps
Clone & compile https://github.com/tactical-drone/zero and run it.
Cluster bootstrap tests regularly fail with that error (1/5) chance.
Expected behavior
Profit.
Actual behavior
Randomly every 2nd or 3rd run I get this error in RELEASE mode. I have never seen this error happen in DEBUG.
What is awesome is, that it is very reproducible. Borg technology is designed to push the runtime to the limits.
Regression?
This worked in .net 7 preview 4
Known Workarounds
None
Configuration
CPU: amd 3900x
RAM: 64GB
7.0.0-preview.5.22266.11
Other information
I have no idea. The cluster test needs 10Gb memory in RELEASE mode, 20GB in DEBUG.
PointDouble
Does a tonne of small byte array allocations and does some basic math on it littered with bit shifts.The text was updated successfully, but these errors were encountered: