Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in LibraryImportGenerator.Unit.Tests #68443

Closed
jkotas opened this issue Apr 23, 2022 · 19 comments
Closed

Segmentation fault in LibraryImportGenerator.Unit.Tests #68443

jkotas opened this issue Apr 23, 2022 · 19 comments
Assignees
Labels
area-GC-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Milestone

Comments

@jkotas
Copy link
Member

jkotas commented Apr 23, 2022

Use this issue to track Segmentation fault with stacktrace similar to #68443 (comment) only. Open separate issue for other types of failures in LibraryImportGenerator.Unit.Tests test.

Log:

/root/helix/work/workitem/e /root/helix/work/workitem/e
  Discovering: LibraryImportGenerator.Unit.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  LibraryImportGenerator.Unit.Tests (found 121 of 126 test cases)
  Starting:    LibraryImportGenerator.Unit.Tests (parallel test collections = on, max threads = 4)
    LibraryImportGenerator.UnitTests.Compiles.ValidateSnippetsWithMarshalType [SKIP]
      No current scenarios to test.
./RunTests.sh: line 168:    24 Segmentation fault      (core dumped) "$RUNTIME_PATH/dotnet" exec --runtimeconfig LibraryImportGenerator.Unit.Tests.runtimeconfig.json --depsfile LibraryImportGenerator.Unit.Tests.deps.json 

Details (failed in #68436): https://dev.azure.com/dnceng/public/_build/results?buildId=1734505&view=ms.vss-test-web.build-test-results-tab&runId=46962718&resultId=188668&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab

Earlier issue tracking the same failure #67031

@jkotas jkotas added arch-arm64 os-linux Linux OS (any supported distro) blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' labels Apr 23, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Apr 23, 2022
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@jkotas jkotas added tenet-reliability Reliability/stability related issue (stress, load problems, etc.) area-System.Runtime.InteropServices labels Apr 23, 2022
@ghost
Copy link

ghost commented Apr 23, 2022

Tagging subscribers to this area: @dotnet/interop-contrib
See info in area-owners.md if you want to be subscribed.

Issue Details

Use this issue to track Segmentation fault on linux-arm64 only. Open separate issue for other types of failures in LibraryImportGenerator.Unit.Tests test.

Log:

/root/helix/work/workitem/e /root/helix/work/workitem/e
  Discovering: LibraryImportGenerator.Unit.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  LibraryImportGenerator.Unit.Tests (found 121 of 126 test cases)
  Starting:    LibraryImportGenerator.Unit.Tests (parallel test collections = on, max threads = 4)
    LibraryImportGenerator.UnitTests.Compiles.ValidateSnippetsWithMarshalType [SKIP]
      No current scenarios to test.
./RunTests.sh: line 168:    24 Segmentation fault      (core dumped) "$RUNTIME_PATH/dotnet" exec --runtimeconfig LibraryImportGenerator.Unit.Tests.runtimeconfig.json --depsfile LibraryImportGenerator.Unit.Tests.deps.json 

Details (failed in #68436): https://dev.azure.com/dnceng/public/_build/results?buildId=1734505&view=ms.vss-test-web.build-test-results-tab&runId=46962718&resultId=188668&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab

Earlier issue tracking the same failure #67031

Author: jkotas
Assignees: -
Labels:

arch-arm64, area-System.Runtime.InteropServices, os-linux, tenet-reliability, blocking-clean-ci, untriaged

Milestone: -

@danmoseley
Copy link
Member

from the one above,

(lldb) bt
* thread #1, name = 'dotnet', stop reason = signal SIGSEGV
  * frame #0: 0x0000007fadf67e34 libcoreclr.so`MethodTable::GetLoaderAllocatorObjectForGC() [inlined] Assembly::GetLoaderAllocator(this=0x580000007f3c3d92) at assembly.hpp:145:84
    frame #1: 0x0000007fadf67e34 libcoreclr.so`MethodTable::GetLoaderAllocatorObjectForGC() [inlined] Module::GetLoaderAllocator(this=<unavailable>) at ceeload.inl:464
    frame #2: 0x0000007fadf67e30 libcoreclr.so`MethodTable::GetLoaderAllocatorObjectForGC() [inlined] MethodTable::GetLoaderAllocator(this=0x0000001561a122c0) at methodtable.inl:101
    frame #3: 0x0000007fadf67e2c libcoreclr.so`MethodTable::GetLoaderAllocatorObjectForGC() [inlined] MethodTable::GetLoaderAllocatorObjectHandle(this=0x0000001561a122c0) at methodtable.inl:1360
    frame #4: 0x0000007fadf67e2c libcoreclr.so`MethodTable::GetLoaderAllocatorObjectForGC(this=0x0000001561a122c0) at methodtable.cpp:8550
    frame #5: 0x0000007fae239df4 libcoreclr.so`WKS::gc_heap::mark_object_simple(po=<unavailable>) at gc.cpp:24041:17
    frame #6: 0x0000007fae23fd38 libcoreclr.so`WKS::gc_heap::mark_through_cards_for_uoh_objects(void (*)(unsigned char**), int, int) [inlined] WKS::gc_heap::mark_through_cards_helper(poo=0x000000155f05f070, cg_pointers_found=0x0000007e9caeeba8, fn=(libcoreclr.so`WKS::gc_heap::mark_object_simple(unsigned char**) at gc.cpp:24018), nhigh=0x0000000000000000, next_boundary=0x0000000000000000, condemned_gen=<unavailable>, current_gen=2)(unsigned char**), unsigned char*, unsigned char*, int, int) at gc.cpp:36813:9
    frame #7: 0x0000007fae23fbd4 libcoreclr.so`WKS::gc_heap::mark_through_cards_for_uoh_objects(fn=<unavailable>, gen_num=<unavailable>, relocating=NO)(unsigned char**), int, int) at gc.cpp:42225
    frame #8: 0x0000007fae22e750 libcoreclr.so`WKS::gc_heap::mark_phase(condemned_gen_number=<unavailable>, mark_only_p=NO) at gc.cpp:25788:25
    frame #9: 0x0000007fae22a880 libcoreclr.so`WKS::gc_heap::gc1() at gc.cpp:20608:13
    frame #10: 0x0000007fae236a88 libcoreclr.so`WKS::gc_heap::garbage_collect(n=<unavailable>) at gc.cpp:0
    frame #11: 0x0000007fae2249f8 libcoreclr.so`WKS::GCHeap::GarbageCollectGeneration(this=<unavailable>, gen=0, reason=reason_alloc_soh) at gc.cpp:45988:9
    frame #12: 0x0000007fae227690 libcoreclr.so`WKS::gc_heap::try_allocate_more_space(acontext=0x0000007e9d044088, size=104, flags=2, gen_number=0) at gc.cpp:17479:21
    frame #13: 0x0000007fae253964 libcoreclr.so`WKS::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int) [inlined] WKS::gc_heap::allocate_more_space(acontext=0x0000007e9d044088, size=<unavailable>, flags=2, alloc_generation_number=0) at gc.cpp:17954:18
    frame #14: 0x0000007fae253950 libcoreclr.so`WKS::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int) at gc.cpp:17985
    frame #15: 0x0000007fae253910 libcoreclr.so`WKS::GCHeap::Alloc(this=0x0000007fad7f3bc0, context=0x0000007e9d044088, size=104, flags=2) at gc.cpp:44950
    frame #16: 0x0000007fae08a418 libcoreclr.so`Alloc(size=104, flags=GC_ALLOC_CONTAINS_REF) at gchelpers.cpp:226:48
    frame #17: 0x0000007fae088cd0 libcoreclr.so`AllocateSzArray(pArrayMT=0x0000007f3bc97a08, cElements=5, flags=GC_ALLOC_CONTAINS_REF) at gchelpers.cpp:0
    frame #18: 0x0000007fae0b1454 libcoreclr.so`JIT_NewArr1(arrayMT=0x0000007f3bc97a08, size=5) at jithelpers.cpp:2627:16
    frame #19: 0x0000007f3b3af240

I can't get managed frames, if they matter, as below. Not sure why:

(lldb) runtimes
#0 .NET Core runtime at 0000007FADBEE000 size 009A756D
    Runtime module path: /home/dan/helix_payload/LibraryImportGenerator.Unit.Tests/shared/Microsoft.NETCore.App/7.0.0/libcoreclr.so
    Runtime module directory: /home/dan/helix_payload/LibraryImportGenerator.Unit.Tests/shared/Microsoft.NETCore.App/7.0.0
    DAC: /home/dan/helix_payload/LibraryImportGenerator.Unit.Tests/shared/Microsoft.NETCore.App/7.0.0/libmscordaccore.so
(lldb) sosstatus
Target OS: LINUX Architecture: X64 ProcessId: 24 (0x18)
#0 .NET Core runtime at 0000007FADBEE000 size 009A756D
    Runtime module path: /home/dan/helix_payload/LibraryImportGenerator.Unit.Tests/shared/Microsoft.NETCore.App/7.0.0/libcoreclr.so
    Runtime module directory: /home/dan/helix_payload/LibraryImportGenerator.Unit.Tests/shared/Microsoft.NETCore.App/7.0.0
    DAC: /home/dan/helix_payload/LibraryImportGenerator.Unit.Tests/shared/Microsoft.NETCore.App/7.0.0/libmscordaccore.so


Cache: /home/dan/.dotnet/symbolcache
Server: https://msdl.microsoft.com/download/symbols/
Directory: /home/dan/helix_payload/LibraryImportGenerator.Unit.Tests/shared/Microsoft.NETCore.App/7.0.0
(lldb) clrstack
Failed to load data access module, 0x80004002
Can not load or initialize libmscordaccore.so. The target runtime may not be initialized.

For more information see https://go.microsoft.com/fwlink/?linkid=2135652
ClrStack  failed
(lldb)

@danmoseley
Copy link
Member

BTW, to find future such dumps (currently this is the only recent one) run this query
Execute: Web | Desktop | Web (Lens) | Desktop (SAW)

https://engsrvprod.kusto.windows.net/engineeringdata

let wi =
WorkItems
| join kind=leftsemi (Jobs | where Queued > ago (3d) ) on $left.JobName == $right.Name
| where ExitCode != 0
| where FriendlyName == "LibraryImportGenerator.Unit.Tests";
Files
| lookup kind=inner wi on $left.WorkItemName == $right.Name
| where ExitCode ==139
| where FileName == "how-to-debug-dump.md"
| join WorkItems on $left.WorkItemName == $right.Name
| join Jobs on $left.JobName == $right.Name
| extend PhaseName = tostring(parse_json(Properties)["System.PhaseName"]),
Pipeline = tostring(parse_json(Properties).DefinitionName),
BuildId = tostring(parse_json(Properties).BuildId)
| where Pipeline !contains("jitstress")
| project Timestamp, QueueName, ExitCode, Uri, ConsoleUri, PhaseName, Pipeline, BuildId

the URI column contains a link to the how-to-debug-dump.md for each, which has a link to the dump etc.

@hez2010
Copy link
Contributor

hez2010 commented Apr 23, 2022

@danmoseley
Copy link
Member

Seems this is #68112 ?

@jkotas
Copy link
Member Author

jkotas commented Apr 25, 2022

I can't get managed frames, if they matter, as below. Not sure why

This is arm64 dump and you are setup for x64 (the log you have shared says "Target OS: LINUX Architecture: X64").

This is the managed part of the stack for reference (as you have said, it probably does not matter).

(lldb) clrstack
OS Thread Id: 0x2e (1)
        Child SP               IP Call Site
0000007E9CAEF0F0 0000007fadf67e34 [HelperMethodFrame: 0000007e9caef0f0]
0000007E9CAEF270 0000007F3B3AF240 System.Reflection.Metadata.MetadataReader.ReadStreamHeaders(System.Reflection.Metadata.BlobReader ByRef) [/_/src/libraries/System.Reflection.Metadata/src/System/Reflection/Metadata/MetadataReader.cs @ 243]
0000007E9CAEF330 0000007F3B3AAF4C System.Reflection.Metadata.MetadataReader..ctor(Byte*, Int32, System.Reflection.Metadata.MetadataReaderOptions, System.Reflection.Metadata.MetadataStringDecoder, System.Object) [/_/src/libraries/System.Reflection.Metadata/src/System/Reflection/Metadata/MetadataReader.cs @ 107]
0000007E9CAEF4B0 0000007F3B3A8E60 System.Reflection.Metadata.PEReaderExtensions.GetMetadataReader(System.Reflection.PortableExecutable.PEReader, System.Reflection.Metadata.MetadataReaderOptions, System.Reflection.Metadata.MetadataStringDecoder) [/_/src/libraries/System.Reflection.Metadata/src/System/Reflection/Metadata/PEReaderExtensions.cs @ 87]
0000007E9CAEF520 0000007F3D8A8DD4 Microsoft.CodeAnalysis.PEModule.InitializeMetadataReader()
0000007E9CAEF560 0000007F3C943AD0 Microsoft.CodeAnalysis.PEModule.get_MetadataReader()
0000007E9CAEF580 0000007F3D8A802C Microsoft.CodeAnalysis.PEModule.GetMetadataModuleNamesOrThrow()
0000007E9CAEF5E0 0000007F3B39BF2C Microsoft.CodeAnalysis.AssemblyMetadata.GetOrCreateData()
0000007E9CAEF660 0000007F3D8A1E14 Microsoft.CodeAnalysis.AssemblyMetadata.GetModules()
0000007E9CAEF690 0000007F3D8A1008 Microsoft.CodeAnalysis.AssemblyMetadata.IsValidAssembly()
0000007E9CAEF6C0 0000007F3D8A598C Microsoft.CodeAnalysis.CommonReferenceManager`2[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]].GetMetadata(Microsoft.CodeAnalysis.PortableExecutableReference, Microsoft.CodeAnalysis.CommonMessageProvider, Microsoft.CodeAnalysis.Location, Microsoft.CodeAnalysis.DiagnosticBag)
0000007E9CAEF780 0000007F3B367F80 Microsoft.CodeAnalysis.CommonReferenceManager`2[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]].ResolveMetadataReferences(System.__Canon, System.Collections.Generic.Dictionary`2<System.String,System.Collections.Generic.List`1<ReferencedAssemblyIdentity<System.__Canon,System.__Canon>>>, System.Collections.Immutable.ImmutableArray`1<Microsoft.CodeAnalysis.MetadataReference> ByRef, System.Collections.Generic.IDictionary`2<System.ValueTuple`2<System.String,System.String>,Microsoft.CodeAnalysis.MetadataReference> ByRef, System.Collections.Immutable.ImmutableArray`1<Microsoft.CodeAnalysis.MetadataReference> ByRef, System.Collections.Immutable.ImmutableArray`1<AssemblyData<System.__Canon,System.__Canon>> ByRef, System.Collections.Immutable.ImmutableArray`1<Microsoft.CodeAnalysis.PEModule> ByRef, Microsoft.CodeAnalysis.DiagnosticBag)
0000007E9CAEFC30 0000007F3B363D74 Microsoft.CodeAnalysis.CSharp.CSharpCompilation+ReferenceManager.CreateAndSetSourceAssemblyFullBind(Microsoft.CodeAnalysis.CSharp.CSharpCompilation)
...

@jkotas
Copy link
Member Author

jkotas commented Apr 25, 2022

Seems this is #68112 ?

I do not think it is #68112. The problem is that the object that we are marking in frame 5 is invalid. We happen to take the path for collectible types based on the invalid data.

(lldb) frame select 5
frame #5: 0x0000007fae239df4 libcoreclr.so`WKS::gc_heap::mark_object_simple(po=<unavailable>) at gc.cpp:24041:17
(lldb) frame variable
(uint8_t **) po = <variable not available>

(const int) thread = 0
(uint8_t *) o = 0x0000001561a16fe8 "\xc1\"\xa1a\U00000015"
(int) condemned_gen = 0
(size_t) s = 127
(uint8_t *) class_obj = <variable not available>

(uint8_t **) poo = <no location, value may have been optimized out>

(lldb) dumpobj 0x0000001561a16fe8
<Note: this object has an invalid CLASS field>
Invalid object

@jkotas
Copy link
Member Author

jkotas commented Apr 25, 2022

!verifyheap prints thousands of missing card_table entry messages, like:

`Object 00000015657b1220:  missing card_table entry for 00000015657B1228

@jkotas
Copy link
Member Author

jkotas commented Apr 25, 2022

I believe that this is most likely a GC bug, potentially related to enabling regions.

@mangod9 @Maoni0 Could you please take a look?

@ghost
Copy link

ghost commented Apr 25, 2022

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

Use this issue to track Segmentation fault on linux-arm64 only. Open separate issue for other types of failures in LibraryImportGenerator.Unit.Tests test.

Log:

/root/helix/work/workitem/e /root/helix/work/workitem/e
  Discovering: LibraryImportGenerator.Unit.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  LibraryImportGenerator.Unit.Tests (found 121 of 126 test cases)
  Starting:    LibraryImportGenerator.Unit.Tests (parallel test collections = on, max threads = 4)
    LibraryImportGenerator.UnitTests.Compiles.ValidateSnippetsWithMarshalType [SKIP]
      No current scenarios to test.
./RunTests.sh: line 168:    24 Segmentation fault      (core dumped) "$RUNTIME_PATH/dotnet" exec --runtimeconfig LibraryImportGenerator.Unit.Tests.runtimeconfig.json --depsfile LibraryImportGenerator.Unit.Tests.deps.json 

Details (failed in #68436): https://dev.azure.com/dnceng/public/_build/results?buildId=1734505&view=ms.vss-test-web.build-test-results-tab&runId=46962718&resultId=188668&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab

Earlier issue tracking the same failure #67031

Author: jkotas
Assignees: -
Labels:

arch-arm64, area-System.Runtime.InteropServices, os-linux, tenet-reliability, area-GC-coreclr, blocking-clean-ci, untriaged

Milestone: -

@jkotas
Copy link
Member Author

jkotas commented Apr 25, 2022

The crash with the same signature is getting hit on both Windows and Linux, and both x64 and arm64. I am removing the linux-arm64 specific label.

@jkotas jkotas removed arch-arm64 os-linux Linux OS (any supported distro) labels Apr 25, 2022
@jkotas jkotas changed the title Segmentation fault in LibraryImportGenerator.Unit.Tests (linux-arm64) Segmentation fault in LibraryImportGenerator.Unit.Tests Apr 25, 2022
@mangod9
Copy link
Member

mangod9 commented Apr 25, 2022

From the query @danmoseley provided above appears that the failure is Linux only, are there any windows dumps available per Jan's comment above?

@danmoseley
Copy link
Member

Remove this from query

| where ExitCode ==139

@mangod9

This comment was marked as off-topic.

@jkotas
Copy link
Member Author

jkotas commented Apr 25, 2022

With that removed I observe that the latest windows failure looks unrelated?

Yes, this is unrelated problem from the runs triggered by unmerged PR that has a bug on startup path and so all tests are crashing on it.

I only looked at the crashes from main or from PRs that got merged. I am not sure whether there is a good way to filter out the PR specific problems in the Kusto query.

@mangod9
Copy link
Member

mangod9 commented Apr 25, 2022

Thanks, we will take a look. @PeterSolMS

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Apr 26, 2022
@mangod9 mangod9 added this to the 7.0.0 milestone Apr 26, 2022
@jkotas
Copy link
Member Author

jkotas commented May 10, 2022

Fixed by #69106

@jkotas jkotas closed this as completed May 10, 2022
PeterSolMS added a commit to PeterSolMS/runtime-1 that referenced this issue May 18, 2022
PeterSolMS added a commit that referenced this issue May 30, 2022
#69496)

Let's figure out later whether this test should be part of the GC stress test and/or be added to a CI config.
@ghost ghost locked as resolved and limited conversation to collaborators Jun 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-GC-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Projects
None yet
Development

No branches or pull requests

5 participants