-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Help] [RISC-V] GC crash on System.Diagnostics.Tests.DiagnosticSourceTest.AllSubscriberStress #105808
Comments
You need to find where the bad GC reference came from. We typically debug these types of crashes using stresslog. #45557 (comment) is an example of the stresslog setting to use. Stresslog has information about what happened during the past GCs - where the thread stopped for the GC, objects that got enumerated during stackwalk, etc. This crash is likely caused by an object reference missed during stackwalk. You should be able to find the object reference that was missed from the stresslog. |
Thanks a lot for this example! I would check StressLog output and GC info via JitDisasm for methods in my testcase. |
I had collected StressLog on riscv64, it was a binary output. This binary log analyzed via x64 StressLogAnalyzer to obtain text output. Text has "BAD Method" and "BAD MethodTable" instead of real names. Simple programusing System;
using System.Diagnostics;
namespace Hello
{
class Program
{
static void CreateString()
{
var mfo = new String("ASDASD");
}
static void Main()
{
// Create a string without reference
CreateString();
//Force garbage collection.
GC.Collect();
Process process = Process.GetCurrentProcess();
System.Console.WriteLine("GC collected. Press enter to exit. Cur ID = {0}", process.Id);
System.Console.ReadLine();
}
}
} Environment variables: declare -x DOTNET_LogFacility="0x00080001"
declare -x DOTNET_LogLevel="6"
declare -x DOTNET_StressLog="1"
declare -x DOTNET_StressLogSize="2000000"
declare -x DOTNET_TotalStressLogSize="40000000"
declare -x DOTNET_StressLogFilename="/home/runtime/StressLog.gccollect_once.bin" Run command: ./corerun ./gccollect_once.dll Analyze: ./StressLogAnalyzer ../../StressLog.gccollect_once.bin -o:../../StressLog.gccollect_once.txt -a BAD names in StressLog: StressLog.txt
@jkotas Did I do something wrong? |
I always use DumpLog command from SOS debugger extension to save stresslog into a text file (https://github.com/dotnet/diagnostics/blob/main/documentation/sos.md).
|
I'm currently not fixing the pretty-printing in #104999. It's something we can look into in the future. I know something the GC team is also interested in (that I think we can do in the future) supporting running StressLogAnalyzer on a dump with an embedded stresslog instead of a memory-mapped log, which would make the pretty-printing work more reliably. |
@t-mustafin Do you still need help with diagnosing this issue, or can it be closed? |
@jkotas I did not found the problem root. It is postponed now and I don't know if it is actual on latest main. |
Whould be fixed by #109512. |
Common information
Test
System.Diagnostics.Tests.DiagnosticSourceTest.AllSubscriberStress
catches intermittent crash on VisionFive2 board with disabled TieredCompilation.Run command and log:
Test behavior is intermittent: different asserts may fire or SIGSEGV may happen or even the test can pass. The most common path on crash inside
Object::Validate
, called from GC invocation. Backtrace for the assert crash above is:Testcase creates 100 threads and each thread allocates 100-104
DiagnosticListener
objects. Frame 13 above with this of typeThread
gives information about thread with problematic stack which breaks stack-walking algorithm:Hidden gdblog
For launch and debugging purposes I used the following environment variables:
JitStdOutFile.txt, clr.log and perf-26084.map collected with logs above in attachment: JitStdOut_clrlog_perfmap.tar.gz
Address of jitted code Xunit.Assert::All as Object*
Sometimes Object* in Object::Validate() frame points to jitted code as in this log:
Hidden gdblog
Hijack::UpdateRegDisplay() A0 register for riscv64 as for arm64
Sometimes
OnHijackTripThread
occures on backtrace andHijack::UpdateRegDisplay()
makes a print in clr.log. There is a difference inHijack::UpdateRegDisplay()
between arm64 and riscv64. Arm64 code updates X0 value:runtime/src/coreclr/vm/arm64/stubs.cpp
Lines 782 to 790 in f4866b0
riscv64 does not update A0 value:
runtime/src/coreclr/vm/riscv64/stubs.cpp
Lines 736 to 742 in f4866b0
However intermittent crash in GC -> Object::Validate() occurs despite applying the patch:
HeapVerify
Several (~5) launches of test with
export DOTNET_HeapVerify=1
environment variable did not show another assert or backtrace for this test.@Maoni0 @jkotas @jakobbotsch I would grateful for any tips, advices or directions how to debug this intermittent crash.
cc @dotnet/samsung
The text was updated successfully, but these errors were encountered: