Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

# Fatal signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr ************ in tid **** (SGen worker) #105389

Open
dcrespo-isaac opened this issue Jul 23, 2024 · 25 comments
Milestone

Comments

@dcrespo-isaac
Copy link

dcrespo-isaac commented Jul 23, 2024

Android framework version

net8.0-android

Affected platform version

.NET 8.0.100, .NET 8.0.303

Description

Our Android application started to crash randomly 3 months ago when we upgraded it to .net6.
We tried to add more logging, analyze the logcat, reproduce... but all in vain.

In an attempt to stabilize the app, we have upgraded to .net8.0-android and the number of crash was significantly reduced.
This problem happens on different kind of tablets, and we excluded the idea that it could come from firmware or hardware
because they did not change and with .net framework (Xamarin) everything was running smoothly (1 crash rarely vs hundreds per day now).

Our dev and testing team (qa) are not able to reproduce the problem. Looking at the logs (SIGSEGV_Samples_logcat-with-symbols.txt), we see that all of the SIGSEGV (SEGV_MAPERR) happens on the SGen worker thread but give different backtraces.
At the end of the log file, you’ll find a SIGABRT from SGen. We’ve kept it there for reference, in case it’s related to the SIGSEGV issues.

We used dotnet-symbol to retrieve a debug version of libmonosgen-2-0.so (with symbols). Then we run ndk-stack on our
logcat to symbolize the backtraces.

Application catches nothing. So we don't have any other traces to give from it. The logcat contains the most detailled
ones we have.

FYI, we did split our crashes into 3 categories, we're not sure that they related to the SIGSEGV...
but to give you all data, the other 2 are :

  • Fatal signal 6 (SIGABRT), code -1 (SI_QUEUE) in tid ***** (|ANR-WatchDog|), pid ***** (com.isaac.onthego)
  • Fatal signal 6 (SIGABRT), code -1 (SI_QUEUE) in tid ***** (enableInternal), pid ***** (com.android.nfc)

What we tried :

  • Reproducing by running our acceptance tests on the tablet
    -> Nothing happens unless we run the day long tests and it's very rare. It does not give more information than what we have in production.

  • Upgrading .net version (from 8.0.100 to 8.0.303)
    -> Nothing changed.

  • Revert to .Net Framework (Xamarin)
    -> We did it in order to identify if it's related to our changes or the framework. And it worked !!!
    The number of crash were back to normal (before .net6 and .net8).

  • *** EDIT *** We also tried to disable trimming by settings <PublishTrimmed>false</PublishTrimmed> in all of our .csproj.
    -> Our APK was a little bit heavier (of course) and no changes in number of crashes..

What we are still probing :

  • Disabling concurrency on Mono Garbage collector and change the bridge-implementation to old
    -> We’ve seen numerous issues and discussions about the stability of SGen. We are hoping that this change can make a difference, as the backtraces seem to point to a problem that could be within it. If this works, it could allow us to stay with .net8 until a fix is found.

Thank you in advance for your help. We will remain vigilant if you need more details.

Steps to Reproduce

Unfortunetaly we are unable to reproduce the problem in step by step as it appears randomly on production.

Did you find any workaround?

Currently, reverting to the .Net Framework appears to be effective for us.

Relevant log output

See attached files of the description
@grendello
Copy link
Contributor

grendello commented Jul 24, 2024

@dcrespo-isaac thanks for the logs! They universally show that the issue is somewhere in the MonoVM's GC (SGEN), so I think the issue belongs in dotnet/runtime. Would you happen to still have the logs that have SIGABRT in them? The reason I ask is that they would contain more information before the native stack trace is dumped. The information will contain the actual assertion message logged by MonoVM - and this message is crucial to understanding what's going on. If you could share these logs, or scour them for any log lines containing any assertions coming from MonoVM and paste them here, it would be very helpful. Thanks!

@akoeplinger akoeplinger transferred this issue from dotnet/android Jul 24, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jul 24, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jul 24, 2024
@akoeplinger akoeplinger added area-GC-mono and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jul 24, 2024
@grendello grendello removed their assignment Jul 24, 2024
Copy link
Contributor

Tagging subscribers to this area: @BrzVlad
See info in area-owners.md if you want to be subscribed.

@akoeplinger
Copy link
Member

@BrzVlad looks like a GC issue

@grendello
Copy link
Contributor

@dcrespo-isaac Scan the logs for messages containing strings similar to * Assertion at which are tagged either with an empty string, DOTNET or a fragment of your application's ID. The message will also contain a file path which should have mono component in it.

@dcrespo-isaac
Copy link
Author

Thank you for the triage.

I've found the logcat I had analyzed for the SIGSEGV errors. In the logcat-Assertion.txt, you'll find two samples of SIGABRT preceded by :

Assertion at /__w/1/s/src/mono/mono/metadata/object.c:4410, condition `is_ok (error)' not met, function:mono_unhandled_exception_internal, (null) assembly:System.Private.CoreLib.dll type:ArgumentNullException member:(null)

Additional notes:

  • The SIGABRT assertion error is always the same, but it sometimes occurs on the ANR_WatchDog thread and other times on what appears to be the main thread com.isaac.onthego.
  • The two samples show errors just after the start of the application. It’s true that some errors occur then, but others occur at different moments – sadly at random times – throughout the lifetime of the application.
  • The number of SIGSEGV crashes is substantially higher than that of SIGABRT.
    -> We attempted to identify a pattern between SIGSEGV and SIGABRT, but found nothing significant. This means that they don’t precede or follow each other in any particular way.

I will continue to update the issue if I encounter any other assertion errors.

@lambdageek
Copy link
Member

@dcrespo-isaac I don't have any insight for the SIGSEGV, but for SIGABRT - does your app have a AppDomain.CurrentDomain.UnhandledException event handler installed? the assertion on object.c:4410 looks like it's due to an unhandled null reference exception during the execution of the unhandled exception handler.

@dcrespo-isaac
Copy link
Author

Hi @lambdageek,

First and foremost, thank you for your response. It is greatly appreciated. We have identified the origin of the SIGABRT:

  • There was indeed a null reference that was triggered due to a race condition at startup.
  • We discovered a setter in ANR_WatchDog that allows us to handle an ANR in our own way, rather than allowing the tool to crash our application.

Regarding this open issue, we continue to experience problems with SIGSEGV, which is the primary subject of this issue and, regrettably, the cause of the majority of our crashes.

We are currently deploying our reverted version to Xamarin, as it has proven to be significantly more stable. We will remain vigilant for any responses, fixes, or requests for additional information regarding this issue.

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Aug 1, 2024
@mangod9 mangod9 added this to the Future milestone Aug 1, 2024
@dcrespo-isaac
Copy link
Author

Hi Team, @mangod9 ,

We noticed a few weeks ago that the issue was moved to the “Future” milestones.

While “No due date” is quite clear, it leaves us with a significant span of uncertainty, especially since Xamarin is no longer supported. We would like to manage our expectations regarding this issue. Will it be fixed or at least investigated by your team of experts?

Currently, we have had to completely rollback all our projects to the old .NET Framework/Xamarin to reduce crashes due to SIGSEGV.

Please let us know if there are any troubleshooting techniques or tools we can use to provide you with more information, or ideally, find a fix ourselves.

We look forward to hearing more on this thread,
Best regards to your team.

@mangod9
Copy link
Member

mangod9 commented Oct 16, 2024

Hello @dcrespo-isaac, you mention that the issue is reduced in 8, but still occurs occasionally?

@dcrespo-isaac
Copy link
Author

dcrespo-isaac commented Oct 16, 2024

Hi @mangod9,

It appears there may have been a misunderstanding regarding the issue.
The concern is about an increase in the number of SIGSEGV occurrences on .NET 8 (specifically on the SGen Worker), not a decrease.

Our workaround was to revert to Xamarin because it:

  • Fixed the SIGSEGV errors on the SGen Worker, which is the primary focus of this thread.
  • Generally reduced the number of crashes and SIGSEGVs (null dereference). These errors are challenging to troubleshoot, but we believe they are not an issue with .NET. Consequently, we did not open an issue for them. Still we'd appreciate any insights you could give us.

@mangod9
Copy link
Member

mangod9 commented Oct 16, 2024

Hi @vitek-karas, since this is on Android can you please provide guidance on diagnostic steps. Is there a way to collect some dumps when the failure occurs?

@BrzVlad
Copy link
Member

BrzVlad commented Oct 22, 2024

This issue looks like duplicate of #100311. Did disabling concurrent GC lead to fixing of the crashes ? This issue seems somewhat impossible to investigate without a repro. However, I think we can give an attempt to investigate whether something is wrong with the GC by trying some debug flags like MONO_GC_DEBUG=check-remset-consistency,mod-union-consistency-check,collect-before-allocs,verify-before-allocs. Some of this options might be bitrotten so it would be ideal if we could have access to run the application ourseleves.

@dcrespo-isaac
Copy link
Author

dcrespo-isaac commented Oct 22, 2024

Hi @BrzVlad,

I reviewed issue #100311, which pertains to SIGABRT. We no longer face issues with SIGABRT; this thread is actually about SIGSEGV.
EDIT: SIGSEGV issues did not give us assertion errors (in logcat) while SIGABRT would usually do.

However, @pijnappel has created an issue that seems similar to ours: dotnet/maui#21479.
Additionally, #109459 appears to be related as well.

We plan to beta test the version without concurrent GC in a few days and will keep everyone updated. Thank you for the MONO_GC_DEBUG options; I was unaware of some of them, and they could be helpful.

What exactly would you need to run everything on your end?

@BrzVlad
Copy link
Member

BrzVlad commented Oct 22, 2024

When there are GC problems, the application can crash in a variety of ways, they are non-deterministic by nature. I am almost certain that the underlying issue is the same and based on comments on other issues it seems that the culprit is somewhere in the concurrent collector.

Note that I'm getting crashes with some of these debug options locally on some of our tests suites that are somewhat heavier on the GC. I will have to do a little bit of investigation to see whether they are false positives or not and see if these constraints from the debug options are respected in our own tests suites, before investigating anything on a mobile app.

Assuming we can't reproduce this issue, the second approach I would take is to understand more or less what the application was doing when it crashed and then attempt to do more or less the same thing locally, with some different gc params / debug options, in the hope of catching some inconsistencies in the GC. I think an apk with the application with some instructions on what to do in the app could be useful. @grendello I am assuming one could easily unzip an apk, patch libmonosgen-2.0.so and then zip it back to have the application run with custom runtime, right ?

@grendello
Copy link
Contributor

@BrzVlad Replacing a shared library in the apk requires that you re-align and re-sign the archive, or it won't install and run on the device/emulator. Signing process records sha256 hashes of the shared libraries. When you need to do that, ping us on Discord in the #android channel, we'll help you out :)

@BrzVlad
Copy link
Member

BrzVlad commented Oct 23, 2024

@dcrespo-isaac In addition to testing with concurrent gc disabled, I think it would be worthwhile to keep concurrent gc enabled but use the flags MONO_GC_PARAMS=evacuation-threshold=0,no-precleaning which disables some functionality that seems involved in the crash reports.

@dcrespo-isaac
Copy link
Author

@BrzVlad,

Thank you. We will develop a plan to test the different options and will get back to you with the data. At the very least, we will have two versions:

  • Concurrent GC disabled, debug improved with check-remset-consistency,mod-union-consistency-check,collect-before-allocs,verify-before-allocs
  • Concurrent GC enabled, debug improved with check-remset-consistency,mod-union-consistency-check,collect-before-allocs,verify-before-allocs, options evacuation-threshold=0,no-precleaning

Additionally, I am compiling information to create a brief documentation on how you can test our .apk on your end.

@BrzVlad
Copy link
Member

BrzVlad commented Oct 24, 2024

Don't use the mod-union-consistency-check, there seems to be some issues with it. Also note that collect-before-allocs,verify-before-allocs are absurdly heavy, maybe use them just for a few runs manually to see if they happen to report anything, I don't think it would be worthwhile to stress test with it. I would say it is best to first run without any debug flags just to see whether the gc options might fix the crash and only afterwards attempt to run with gc flags on the crashing configurations.

So some explicit configurations could be

  • concurrent gc disabled
  • concurrent gc enabled with evacuation-threshold=0,no-precleaning
  • concurrent gc enabled with check-remset-consistency
  • concurrent gc enabled with check-remset-consistency,collect-before-allocs=100

@dcrespo-isaac
Copy link
Author

Hi, Quick update,

We're sorry for the delay, we should be able to test and give you more data by next week.

@dcrespo-isaac
Copy link
Author

dcrespo-isaac commented Dec 9, 2024

Hi, @BrzVlad,

Sorry this took so long; we had a lot of work to do to release our new beta version.

Quick reminder:

  • Releasing a version is our only way to reproduce crashes.
  • Automation tests and tests by developers don't trigger issues.

About the .apk:

  • We considered sending you an apk, but you wouldn't be able to run it. Our application requires a specific setup to run (our manufactured Android tablet, pairing to our truck gateway for telemetry, etc.).

What we have found:

  • Upgraded to 8.0.402
    Does not solve the issue.

  • concurrent gc disabled
    It works great ! The number of crashes is equal to that of Xamarin versions. This could be a great workaround to stay on .NET 8 instead of rolling back to Xamarin. We have not experienced any trouble with garbage collection times so far, but it will be a concern if we release the version outside the beta.

  • concurrent gc enabled with evacuation-threshold=0,no-precleaning
    It works better than no options. The number of crashes was reduced, but not as much as disabling concurrency or using the Xamarin version.

  • concurrent gc enabled with check-remset-consistency
    Untested

  • concurrent gc enabled with check-remset-consistency,collect-before-allocs=100
    Untested

Do you suggest we investigate the last two options, or should we stick with disabling concurrency for now?
Have you find anything on your side ?

@BrzVlad
Copy link
Member

BrzVlad commented Dec 10, 2024

My current understanding is that concurrent GC was enabled by default with the new .NET versions (5+). Concurrent GC was supported in legacy Xamarin but you needed to explicitly add it in the csproj (or click the option in the project editor). This would mean that using this configuration for .NET8 is not really a change in behavior and you shouldn't encounter any problems with pause times.

The check-remset-consistency option had some issues which I addressed recently. The new dotnet version with the fix should be available in 1-2 months. I would recommend its usage only once the new version is available, but for the sole purpose of helping us investigate the issue with concurrent GC that nobody is able to reliably reproduce so far. This debug option can end up more aggressively killing the runtime if a problem is found in the GC, so I understand if you wouldn't want to deploy an app with this enabled. However, giving it a try for localized testing would be appreciated.

@IainS1986
Copy link

@BrzVlad How are you building with the concurrent GC disabled?

Are you just doing it in the csproj configuration with

<AndroidEnableSGenConcurrent>false</AndroidEnableSGenConcurrent>

Or is there other build params/steps?

@Th3L0x
Copy link

Th3L0x commented Jan 17, 2025

I'm encountering an issue that seems specific to Xiaomi devices. The error details are as follows:

[My.App] * Assertion at /__w/1/s/src/mono/mono/metadata/sgen-tarjan-bridge.c:1176, condition `xref_count == xref_index` not met, function: processing_build_callback_data, xref_count is 642 but we added 638 xrefs  
[monodroid] * Assertion at /__w/1/s/src/mono/mono/metadata/sgen-tarjan-bridge.c:1176, condition `xref_count == xref_index` not met, function: processing_build_callback_data, xref_count is 642 but we added 638 xrefs  
[monodroid] Abort at mono-log-adapter.cc:46:3 ('static void xamarin::android::internal::MonodroidRuntime::mono_log_handler(const char *, const char *, const char *, mono_bool, void *)')  
[libc] Fatal signal 6 (SIGABRT), code -1 (SI_QUEUE) in tid 20651 (My.App), pid 20651 (My.App)  

This error does not occur on other Android devices or on iOS, only on Xiaomi devices.

This config: <AndroidEnableSGenConcurrent>false</AndroidEnableSGenConcurrent> didnt solved it.

@BrzVlad
Copy link
Member

BrzVlad commented Jan 20, 2025

@IainS1986 That is correct.

@Th3L0x That is a completely unrelated issue. See #106410 for workaround until we are able to investigate it.

@Syed-RI
Copy link

Syed-RI commented Jan 30, 2025

Would this comment be worth investigating? #106410 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants