Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Linux][arm] System.Memory.Tests failed with SIGSEGV on rolling build #60825

Closed
krwq opened this issue Oct 25, 2021 · 11 comments
Closed

[Linux][arm] System.Memory.Tests failed with SIGSEGV on rolling build #60825

krwq opened this issue Oct 25, 2021 · 11 comments
Labels
arch-arm32 area-GC-coreclr os-linux Linux OS (any supported distro)
Milestone

Comments

@krwq
Copy link
Member

krwq commented Oct 25, 2021

Rolling build: https://dev.azure.com/dnceng/public/_build/results?buildId=1437467&view=results

  Discovering: System.Memory.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Memory.Tests (found 2366 of 2386 test cases)
  Starting:    System.Memory.Tests (parallel test collections = on, max threads = 4)
./RunTests.sh: line 162:   100 Segmentation fault      (core dumped) "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Memory.Tests.runtimeconfig.json --depsfile System.Memory.Tests.deps.json xunit.console.dll System.Memory.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/root/helix/work/workitem/e
----- end Sat Oct 23 08:57:31 UTC 2021 ----- exit code 139 ----------------------------------------------------------
exit code 139 means SIGSEGV Illegal memory access. Deref invalid pointer, overrunning buffer, stack overflow etc. Core dumped.
ulimit -c value: unlimited

in case it's needed you can get full log

net7.0-Linux-Release-arm64-CoreCLR_release-(Ubuntu.1804.ArmArch.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-16.04-helix-arm64v8-20210106155927-56c6673

@krwq krwq added arch-arm32 area-System.Memory os-linux Linux OS (any supported distro) labels Oct 25, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Oct 25, 2021
@ghost
Copy link

ghost commented Oct 25, 2021

Tagging subscribers to this area: @GrabYourPitchforks, @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Rolling build: https://dev.azure.com/dnceng/public/_build/results?buildId=1437467&view=results

  Discovering: System.Memory.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Memory.Tests (found 2366 of 2386 test cases)
  Starting:    System.Memory.Tests (parallel test collections = on, max threads = 4)
./RunTests.sh: line 162:   100 Segmentation fault      (core dumped) "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Memory.Tests.runtimeconfig.json --depsfile System.Memory.Tests.deps.json xunit.console.dll System.Memory.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/root/helix/work/workitem/e
----- end Sat Oct 23 08:57:31 UTC 2021 ----- exit code 139 ----------------------------------------------------------
exit code 139 means SIGSEGV Illegal memory access. Deref invalid pointer, overrunning buffer, stack overflow etc. Core dumped.
ulimit -c value: unlimited

in case it's needed you can get full log

Author: krwq
Assignees: -
Labels:

arch-arm32, area-System.Memory, os-linux

Milestone: -

@danmoseley
Copy link
Member

can we get a dump?

@carlossanlop
Copy link
Member

❯ runfo get-helix-payload -o "D:\repos\runfo_results" -j "a89b1515-26cd-4906-b64b-bd64b5f31c4b" -w "System.Memory.Tests"
Payload 5c08e442-3e29-432f-92c9-d2c425e436b5.zip => D:\repos\runfo_results\correlation-payload\5c08e442-3e29-432f-92c9-d2c425e436b5.zip
Payload 396d48f2-9cf8-4ec2-a943-dd73c3bb5676.zip => D:\repos\runfo_results\correlation-payload\396d48f2-9cf8-4ec2-a943-dd73c3bb5676.zip
Payload test-runtime-net7.0-Linux-Release-arm64.zip => D:\repos\runfo_results\correlation-payload\test-runtime-net7.0-Linux-Release-arm64.zip
Payload cd9644f0-5899-4135-a7cf-269eca232b22.zip => D:\repos\runfo_results\correlation-payload\cd9644f0-5899-4135-a7cf-269eca232b22.zip

------ Downloading files for: System.Memory.Tests -------

WorkItem System.Memory.Tests => D:\repos\runfo_results\workitems\System.Memory.Tests\System.Memory.Tests.zip
how-to-debug-dump.md => D:\repos\runfo_results\workitems\System.Memory.Tests\how-to-debug-dump.md
core.1001.100 => D:\repos\runfo_results\workitems\System.Memory.Tests\core.1001.100
console.3ef66c74.log => D:\repos\runfo_results\workitems\System.Memory.Tests\console.3ef66c74.log

There's a dump file in this folder:

❯ ls .\runfo_results\workitems\System.Memory.Tests\

        Directory: D:\repos\runfo_results\workitems\System.Memory.Tests


Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a---        10/26/2021  12:08 PM          14922    console.3ef66c74.log
-a---        10/26/2021  12:08 PM      575078400    core.1001.100
-a---        10/26/2021  12:03 PM           6301    how-to-debug-dump.md
-a---        10/26/2021  12:03 PM        1298410    System.Memory.Tests.zip

But since it's a Linux arm64 dump file, you need to follow the instructions in the how-to-debug-dump.md file:

Download the Cross DAC Binaries, open it and choose the flavor that matches the dump you are to debug, and copy those files to C:\helix_payload\System.Memory.Tests\shared\Microsoft.NETCore.App\7.0.0.

The C:\helix_payload folder seems to refer to the contents of the zip file located in D:\repos\runfo_results\correlation-payload\test-runtime-net7.0-Linux-Release-arm64.zip. So I extracted it, and then copied the contents of the downloaded DAC Binaries zip file into the right location.

I opened the dump file with WinDBG, I made sure to specify the right symbols folder, and this is what I got:

Error message:

Unable to read memory at <Unavailable>

Callstack:

0:000> k
 # Child-SP          RetAddr               Call Site
00 (Inline Function) --------`--------     libcoreclr!WKS::gc_heap::mark_array_marked+0x14 [/__w/1/s\src/coreclr/gc/gc.cpp @ 8317] 
01 (Inline Function) --------`--------     libcoreclr!WKS::gc_heap::background_mark1+0x14 [/__w/1/s\src/coreclr/gc/gc.cpp @ 22429] 
02 (Inline Function) --------`--------     libcoreclr!WKS::gc_heap::background_mark_simple+0x14 [/__w/1/s\src/coreclr/gc/gc.cpp @ 23647] 
03 0000007e`ef7fad80 0000007f`97aed178     libcoreclr!WKS::gc_heap::background_promote+0xc4 [/__w/1/s\src/coreclr/gc/gc.cpp @ 23735] 
04 0000007e`ef7fada0 0000007f`979131cc     libcoreclr!GcInfoDecoder::EnumerateLiveSlots+0xe44 [/__w/1/s\src/coreclr/vm/gcinfodecoder.cpp @ 938] 
05 0000007e`ef7fb190 0000007f`97a30004     libcoreclr!EECodeManager::EnumGcRefs+0x114 [/__w/1/s\src/coreclr/vm/eetwain.cpp @ 5287] 
06 0000007e`ef7fb340 0000007f`9799da7c     libcoreclr!GcStackCrawlCallBack+0x258 [/__w/1/s\src/coreclr/vm/gcenv.ee.common.cpp @ 310] 
07 0000007e`ef7fb4a0 0000007f`9799dca8     libcoreclr!Thread::MakeStackwalkerCallback+0x98 [/__w/1/s\src/coreclr/vm/stackwalk.cpp @ 833] 
08 0000007e`ef7fb4f0 0000007f`9799e024     libcoreclr!Thread::StackWalkFramesEx+0x184 [/__w/1/s\src/coreclr/vm/stackwalk.cpp @ 914] 
09 0000007e`ef7fb8c0 0000007f`97a2c780     libcoreclr!Thread::StackWalkFrames+0xb8 [/__w/1/s\src/coreclr/vm/stackwalk.cpp @ 996] 
0a 0000007e`ef7fc5d0 0000007f`97a2c50c     libcoreclr!ScanStackRoots+0x1b0 [/__w/1/s\src/coreclr/vm/gcenv.ee.cpp @ 173] 
0b 0000007e`ef7fc680 0000007f`97b61740     libcoreclr!GCToEEInterface::GcScanRoots+0x100 [/__w/1/s\src/coreclr/vm/gcenv.ee.cpp @ 269] 
0c 0000007e`ef7fc6e0 0000007f`97b6030c     libcoreclr!WKS::gc_heap::background_mark_phase+0x890 [/__w/1/s\src/coreclr/gc/gc.cpp @ 33248] 
0d 0000007e`ef7fc7d0 0000007f`97b7824c     libcoreclr!WKS::gc_heap::gc1+0x394 [/__w/1/s\src/coreclr/gc/gc.cpp @ 20321] 
0e 0000007e`ef7fc880 0000007f`97a2f1c0     libcoreclr!WKS::gc_heap::bgc_thread_function+0xb8 [/__w/1/s\src/coreclr/gc/gc.cpp @ 34268] 
0f (Inline Function) --------`--------     libcoreclr!<unnamed-class>::operator()+0x60 [/__w/1/s\src/coreclr/vm/gcenv.ee.cpp @ 1409] 
10 0000007e`ef7fc8e0 0000007f`97d09ac0     libcoreclr!<unnamed-class>::__invoke+0x78 [/__w/1/s\src/coreclr/vm/gcenv.ee.cpp @ 1388] 
11 0000007e`ef7fc920 0000007f`982aefc4     libcoreclr!CorUnix::CPalThread::ThreadEntry+0x1f4 [/__w/1/s\src/coreclr/pal/src/thread/thread.cpp @ 1832] 
12 0000007e`ef7fc9d0 00000000`00000000     libpthread_2_23!_pthread_get_minstack+0x1404

coreclr/gc/gc.cpp:

inline unsigned int gc_heap::mark_array_marked(uint8_t* add)
{
    return mark_array [mark_word_of (add)] & (1 << mark_bit_bit_of (add));
}

@ghost
Copy link

ghost commented Oct 26, 2021

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

Rolling build: https://dev.azure.com/dnceng/public/_build/results?buildId=1437467&view=results

  Discovering: System.Memory.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Memory.Tests (found 2366 of 2386 test cases)
  Starting:    System.Memory.Tests (parallel test collections = on, max threads = 4)
./RunTests.sh: line 162:   100 Segmentation fault      (core dumped) "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Memory.Tests.runtimeconfig.json --depsfile System.Memory.Tests.deps.json xunit.console.dll System.Memory.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/root/helix/work/workitem/e
----- end Sat Oct 23 08:57:31 UTC 2021 ----- exit code 139 ----------------------------------------------------------
exit code 139 means SIGSEGV Illegal memory access. Deref invalid pointer, overrunning buffer, stack overflow etc. Core dumped.
ulimit -c value: unlimited

in case it's needed you can get full log

Author: krwq
Assignees: -
Labels:

arch-arm32, os-linux, area-GC-coreclr, untriaged

Milestone: -

@danmoseley
Copy link
Member

Thanks @carlossanlop ! GC folks, do you believe this is GC, or more likely some memory corruption caused by a libraries bug?

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Oct 26, 2021
@mangod9 mangod9 added this to the 7.0.0 milestone Oct 26, 2021
@mangod9
Copy link
Member

mangod9 commented Oct 26, 2021

Is this a consistent repro? This looks to be some kind of heap/stack corruption.

@carlossanlop
Copy link
Member

@krwq how often is it happening?

@krwq
Copy link
Member Author

krwq commented Oct 27, 2021

Definitely not consistent repro, For rolling builds that's a single instance I've seen in the last month. I'm not sure about PRs or other pipelines. I'll see if I can find the data (runfo is crashing when I try the search)

@danmoseley
Copy link
Member

Jobs
| where Queued > ago (180d)
| where Source == "ci/public/dotnet/runtime/refs/heads/main"
| distinct Name
| join WorkItems on $left.Name == $right.JobName
| where FriendlyName == "System.Memory.Tests"
| where ExitCode == 139
| project WorkItemId, Finished, MachineName, QueueName//, ConsoleUri
WorkItemId Finished MachineName QueueName
752668191 2021-08-24 03:12:44.8050000 ddvsotx2l245 ubuntu.1804.armarch.open
774521271 2021-09-14 14:52:11.9630000 ddvsotx2l167 ubuntu.1804.armarch.open
777373949 2021-09-16 09:54:37.6740000 ddvsotx2l234 ubuntu.1804.armarch.open
783435802 2021-09-22 08:16:38.3710000 ddvsotx2l008 ubuntu.1804.armarch.open
810964964 2021-10-23 08:59:49.8400000 ddvsotx2l182 ubuntu.1804.armarch.open

just looking at the last 30 days, it does seem Arm32 is particularly segfaulty:

Jobs
| where Queued > ago (30d)
| where Source == "ci/public/dotnet/runtime/refs/heads/main"
| distinct Name
| join WorkItems on $left.Name == $right.JobName
| where ExitCode == 139
| summarize count() by FriendlyName, QueueName
| sort by count_ desc
FriendlyName QueueName count_
DllImportGenerator.Unit.Tests osx.1100.arm64.open 15
System.Text.Json.Tests ubuntu.1804.armarch.open 3
System.Dynamic.Runtime.Tests ubuntu.1804.armarch.open 3
Microsoft.NETCore.Platforms.Tests ubuntu.1804.armarch.open 3
System.Runtime.Serialization.Xml.Tests ubuntu.1804.armarch.open 3
System.Runtime.Serialization.Json.Tests ubuntu.1804.armarch.open 3
System.Runtime.Serialization.Xml.ReflectionOnly.Tests ubuntu.1804.armarch.open 3
Microsoft.CSharp.Tests ubuntu.1804.armarch.open 3
System.ServiceModel.Syndication.Tests ubuntu.1804.armarch.open 3
System.Runtime.Serialization.Json.ReflectionOnly.Tests ubuntu.1804.armarch.open 3
System.Text.RegularExpressions.Tests ubuntu.1604.amd64.open.rt 1
System.Text.RegularExpressions.Tests ubuntu.1804.amd64.open.rt 1
Microsoft.Extensions.DependencyInjection.ExternalContainers.Tests ubuntu.1804.armarch.open 1
System.Memory.Tests ubuntu.1804.armarch.open 1

That might be consistent with a runtime/GC bug.

@mangod9
Copy link
Member

mangod9 commented Aug 3, 2022

This was being hit early during the 7 timeframe, but the same assert hasnt reproed since Jan, so closing for now.

@mangod9 mangod9 closed this as completed Aug 3, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Sep 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm32 area-GC-coreclr os-linux Linux OS (any supported distro)
Projects
None yet
Development

No branches or pull requests

5 participants