Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application exception exit: segfault #32848

Closed
zhxymh opened this issue Feb 26, 2020 · 25 comments
Closed

Application exception exit: segfault #32848

zhxymh opened this issue Feb 26, 2020 · 25 comments
Assignees
Labels
area-TypeSystem-coreclr untriaged New issue has not been triaged by the area owner

Comments

@zhxymh
Copy link

zhxymh commented Feb 26, 2020

We run the application in docker, and occasionally the application crashes. I found an early issue #10856 similar to this one. But it has been fixed.

I don't know why it happens?

Here are the details:

Environment:

.NET Core SDK (reflecting any global.json):
 Version:   3.1.100
 Commit:    cd82f021f4

Runtime Environment:
 OS Name:     debian
 OS Version:  10
 OS Platform: Linux
 RID:         debian.10-x64
 Base Path:   /usr/share/dotnet/sdk/3.1.100/

Exception information:

 kernel: [19444845.285303] dotnet[19785]: segfault at f0 ip 00007f475694e1a0 sp 00007f474f893fd0 error 4 in libcoreclr.so[7f475691a000+2ef000]

coredump:

(lldb) bt
* thread #1, name = 'dotnet', stop reason = signal SIGSEGV
  * frame #0: 0x00007f475694e1a0 libcoreclr.so`ClassLoader::LoadTypeHandleForTypeKey_Body(TypeKey*, TypeHandle, ClassLoadLevel) + 1280
    frame #1: 0x00007f475694a2bf libcoreclr.so`ClassLoader::LoadTypeHandleForTypeKey(TypeKey*, TypeHandle, ClassLoadLevel, InstantiationContext const*) + 255
    frame #2: 0x00007f475694a17b libcoreclr.so`ClassLoader::LoadConstructedTypeThrowing(TypeKey*, ClassLoader::LoadTypesFlag, ClassLoadLevel, InstantiationContext const*) + 507
    frame #3: 0x00007f475694bc7c libcoreclr.so`ClassLoader::LookupTypeDefOrRefInModule(Module*, unsigned int, ClassLoadLevel*) + 12
    frame #4: 0x00007f475680005d libcoreclr.so`SigPointer::GetTypeHandleThrowing(Module*, SigTypeContext const*, ClassLoader::LoadTypesFlag, ClassLoadLevel, int, Substitution const*, ZapSig::Context const*) const + 5117
    frame #5: 0x00007f4756807b7c libcoreclr.so`MetaSig::GetReturnTypeNormalized(TypeHandle*) const + 188
    frame #6: 0x00007f47567b0851 libcoreclr.so`TransitionFrame::PromoteCallerStackHelper(void (*)(Object**, ScanContext*, unsigned int), ScanContext*, MethodDesc*, MetaSig*) + 225
    frame #7: 0x00007f47567b0274 libcoreclr.so`TransitionFrame::PromoteCallerStack(void (*)(Object**, ScanContext*, unsigned int), ScanContext*) + 676
    frame #8: 0x00007f475687a09c libcoreclr.so`GcStackCrawlCallBack(CrawlFrame*, void*) + 764
    frame #9: 0x00007f475680933d libcoreclr.so`Thread::MakeStackwalkerCallback(CrawlFrame*, StackWalkAction (*)(CrawlFrame*, void*), void*) + 221
    frame #10: 0x00007f4756809571 libcoreclr.so`Thread::StackWalkFramesEx(REGDISPLAY*, StackWalkAction (*)(CrawlFrame*, void*), void*, unsigned int, Frame*) + 497
    frame #11: 0x00007f475680994c libcoreclr.so`StackWalkFunctions(Thread*, StackWalkAction (*)(CrawlFrame*, void*), void*) + 12
    frame #12: 0x00007f474f895690
    frame #13: 0x00007f4756a387cc libcoreclr.so
    frame #14: 0x00007f4756a385f5 libcoreclr.so`GCToEEInterface::GcScanRoots(void (*)(Object**, ScanContext*, unsigned int), int, int, ScanContext*) + 325
    frame #15: 0x00007f47569e096f libcoreclr.so`SVR::gc_heap::mark_phase(int, int) + 1007
    frame #16: 0x00007f47569ddd1d libcoreclr.so`SVR::gc_heap::gc1() + 525
    frame #17: 0x00007f47569d0443 libcoreclr.so`SVR::gc_heap::garbage_collect(int) + 2723
    frame #18: 0x00007f47569cf4e2 libcoreclr.so`SVR::gc_heap::gc_thread_function() + 738
    frame #19: 0x00007f47569cf236 libcoreclr.so`SVR::gc_heap::gc_thread_function() + 54
    frame #20: 0x00007f4756a3ab98 libcoreclr.so`HndCreateHandleTable(unsigned int const*, unsigned int) + 40
    frame #21: 0x00007f4756b4cb6d libcoreclr.so`CorUnix::CPalThread::WaitForStartStatus() + 45
    frame #22: 0x00007f47577fcfa3 libpthread.so.0`start_thread + 243
    frame #23: 0x00007f47574074cf libc.so.6`clone + 63

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Feb 26, 2020
@jeffschwMSFT
Copy link
Member

@fadimounir

@fadimounir fadimounir self-assigned this Feb 26, 2020
@zhxymh
Copy link
Author

zhxymh commented Mar 4, 2020

Hi @fadimounir, do you think this is a bug or something else?

@fadimounir
Copy link
Contributor

It looks like a bug from the callstack. I didn't get a chance to take a look yet, but does this reproduce consistently or non-deterministically?

Also, what are the repro steps?

@zhxymh
Copy link
Author

zhxymh commented Mar 5, 2020

It reproduces non-deterministically. When our program runs for hours or days, it happens by accident. And Do you need any other information?

@fadimounir
Copy link
Contributor

Can you share your program so I can run it on my end, and capture the failure under debugger? Without a repro or a crash dump, I won't be able to investigate the root cause

@zhxymh
Copy link
Author

zhxymh commented Mar 6, 2020

It takes some configuration to run our program, and it's hard to reproduce the problem right away. Does it help if I share the crash dump directly to you?

https://drive.google.com/open?id=1BaVppnjLNba8MP9hHJmpLZOLjr6rXK2Y

@fadimounir
Copy link
Contributor

Yes, thanks for sharing the crash dump. I was able to download it, get the symbols, and I'll start taking a look to understand what happened

@fadimounir
Copy link
Contributor

@zhxymh Could you please also share all the dlls that were used by the app when this dump was produced?

@zhxymh
Copy link
Author

zhxymh commented Mar 7, 2020

Sorry, I lost that version of dlls. If you can't debug without dlls, I can only share with you the next time this problem comes up.

Thank you very much for your help!

@fadimounir
Copy link
Contributor

If you have any other version of the dlls, it could also help with the investigation at this time, until you are able to capture another crash dump the next time the problem comes up.

@zhxymh
Copy link
Author

zhxymh commented Mar 10, 2020

Try this version of dlls, thx!
https://drive.google.com/open?id=1BaVppnjLNba8MP9hHJmpLZOLjr6rXK2Y

@fadimounir
Copy link
Contributor

fadimounir commented Mar 10, 2020

When you captured the dump, was the application compiled into ReadyToRun images? (i.e. did you set PublishReadyToRun=true in your csproj, or in the dotnet publish command line?)

The reason is that from the dump, there was some evidence that your app assemblies were compiled into R2R, and from the stack traces and thread states, am suspecting the underlying root cause might be this issue or related to it: #608

The app assemblies you provided all seem to be IL images. Maybe for example you packaged the build folder and uploaded it instead of the publish folder, which would have the R2R images? (ReadyToRun is a publish-only scenario).

If you are able to confirm that you used ReadyToRun images while running your scenario, and provide me with those images, I should be able to quickly check if a certain bit was missing in the images, which can confirm that the issue really is #608

@zhxymh
Copy link
Author

zhxymh commented Mar 11, 2020

We did not set PublishReadyToRun on the project or the command line.
But we use load Assembly in our code, does this have anything to do with this?

@fadimounir
Copy link
Contributor

Ok thank you for confirming. I don't think using assembly load in the code would cause this. The issue here is that it seems that a particular type needed to be eagerly loaded by the TypeLoader before a certain method executes, but was not and was deferred to a later stage. Then 2 threads attempted to the load that type at the same time, where one of them is a GC collection thread that is not supposed to be loading any type at all. Right now i'm investigating why the type was not correctly eagerly loaded.

@zhxymh
Copy link
Author

zhxymh commented Mar 12, 2020

Ok, thank you very much!

@zhxymh
Copy link
Author

zhxymh commented Mar 16, 2020

Hi @fadimounir , The crash problem has come up again, we are now using .net core 3.1.102, and the stack information for the crash is not quite the same as last time. I share the dump file and dlls of the crash with you.

Dump file and dlls:
https://drive.google.com/open?id=1wwaIz8GQtiGU4GOfFLqMk8N8BE9PxwMU

Exception:

kernel: [20964391.607910] dotnet[14276]: segfault at 10 ip 00007fd2ad465dda sp 00007fcfbc094cf0 error 4 in libcoreclr.so[7fd2ad362000+24e000]

Environment:

.NET Core SDK (reflecting any global.json):
 Version:   3.1.102
 Commit:    573d158fea

Runtime Environment:
 OS Name:     debian
 OS Version:  10
 OS Platform: Linux
 RID:         debian.10-x64
 Base Path:   /usr/share/dotnet/sdk/3.1.102/
(lldb) bt
* thread #1, name = 'dotnet', stop reason = signal SIGSEGV
  * frame #0: 0x00007fd2ad465dda libcoreclr.so`LoaderAllocator::SetHandleValue(unsigned long, Object*) + 186
    frame #1: 0x00007fd2ad465c58 libcoreclr.so`LoaderAllocator::FreeHandle(unsigned long) + 24
    frame #2: 0x00007fd2ad4b43de libcoreclr.so`ThreadLocalBlock::FreeTLM(unsigned long, int) + 190
    frame #3: 0x00007fd2ad4b449a libcoreclr.so`ThreadLocalBlock::FreeTable() + 74
    frame #4: 0x00007fd2ad4abaaf libcoreclr.so`Thread::OnThreadTerminate(int) + 143
    frame #5: 0x00007fd2ad4d1813 libcoreclr.so`ThreadpoolMgr::WorkerThreadStart(void*) + 1635
    frame #6: 0x00007fd2ad7e3b7d libcoreclr.so`CorUnix::CPalThread::ThreadEntry(void*) + 349
    frame #7: 0x00007fd2ae493fa3 libpthread.so.0`start_thread + 243
    frame #8: 0x00007fd2ae09e4cf libc.so.6`clone + 63

@fadimounir
Copy link
Contributor

@janvorli This callstack on the second crash looks like something you've been looking at recently. There is a dump for it here if you need one.

@janvorli
Copy link
Member

@fadimounir the call stack is different from what I was seeing when looking into the issue #32171. But it doesn't mean it cannot be caused by the same issue. Failures caused by that could hit at various places. I'll look at the dump to see if there is any sign of correlation with that fixed issue.

@janvorli
Copy link
Member

@zhxymh I am unable to confirm whether this is due to the same issue as #32171 and so it was already fixed or not from the dump. The dump is missing some information. Could you please try to hit the issue again, but running the following from the same shell before running your app?

echo 0x3f > /proc/self/coredump_filter

This makes the core dump "beefier", containing more data.

I could also share a modified System.Private.CoreLib.dll with you that has a little hack fixing the issue from #32171 so that we can confirm whether the issue is caused by the same thing or not.

@fadimounir
Copy link
Contributor

@zhxymh I have identified the root cause behind the first crash dump you provided, and have a fix for it in progress. The fix will be in dotnet 5. I still don't know yet if it will be ported to an update version of 3.XYZ

@zhxymh
Copy link
Author

zhxymh commented Mar 19, 2020

@zhxymh I am unable to confirm whether this is due to the same issue as #32171 and so it was already fixed or not from the dump. The dump is missing some information. Could you please try to hit the issue again, but running the following from the same shell before running your app?

echo 0x3f > /proc/self/coredump_filter

This makes the core dump "beefier", containing more data.

I could also share a modified System.Private.CoreLib.dll with you that has a little hack fixing the issue from #32171 so that we can confirm whether the issue is caused by the same thing or not.

Thx, we will set it and test again.

@zhxymh
Copy link
Author

zhxymh commented Mar 19, 2020

@zhxymh I have identified the root cause behind the first crash dump you provided, and have a fix for it in progress. The fix will be in dotnet 5. I still don't know yet if it will be ported to an update version of 3.XYZ

Thx! And if we can't get the fix version soon. When does this happen? Can we optimize our own code to avoid this?

@fadimounir
Copy link
Contributor

When does this happen?

This happens when a GC is triggered at the same time a certain virtual generic method is being invoked (in your scenario: Grpc.Core.DefaultCallInvoker<BlockWithTransactions, PeerDialException>. The issue is that for some reason, the return type of this API is not yet loaded in the runtime, and we attempt to load it during GC, which should not be allowed by the runtime. That's the bug.

Can we optimize our own code to avoid this?

That might be tricky to do, given that the crash is triggered by library code you are using in your app most likely.
Here's what I would recommend:

  1. Use regular GC instead of server GC (may or may not help...)
  2. Add some dummy code in your startup and ensure it doesn't get optimized away, like:
var dummy = typeof(Grpc.Core.DefaultCallInvoker.CallInvocationDetails<BlockWithTransactions, PeerDialException>

The second suggestion will help load this type eagerly, but things might still fail on another type unfortunately if another race condition between execution and GC happens at a very unlucky time.

@zhxymh
Copy link
Author

zhxymh commented Mar 20, 2020

Thank you very much for your help!

@fadimounir
Copy link
Contributor

Fixed via #33733

@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-TypeSystem-coreclr untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

5 participants