Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture AccessViolationException #12077

Closed
BaronKiko opened this issue Feb 20, 2019 · 37 comments
Closed

Capture AccessViolationException #12077

BaronKiko opened this issue Feb 20, 2019 · 37 comments
Labels
area-Diagnostics-coreclr question Answer questions and provide assistance, not an issue with source code or documentation.
Milestone

Comments

@BaronKiko
Copy link

Hi
I would like to be able to capture an AccessViolationException to provide useful information about why the application crashed. I don't intend to continue running the application, I just want some of the information that would be available when run inside visual studio.
Currently the program silently closes without reporting anything when run outside visual studio. At the very least I would like a dialog box saying the application crashed due to an AccessViolationException.
Taking this example program: https://gist.github.com/gdkchan/bd18dea1679283ff6e2c152b5f836927
In .Net framework I can simply use a try/catch around line 115. (In 4.0 onward there is some tags needed)
I would like to be able to do that in Core however the tags aren't honored so it's not possible
Thank you for your time.

@filipnavara
Copy link
Member

I'd say it's generally a bad idea to execute any code when this happens. Windows Error Reporting (aka Watson) can provide memory dumps in these occasions. This can be used both for local diagnosis and for tracking the errors from customers through the Partner Portal (as long as the application is signed with valid code signing certificate).

@BaronKiko
Copy link
Author

I don't want to use Window error reporting because:

  • To enable it each user need to mess with the registry which most people just aren't comfortable doing.
  • We would have to sign our application which we don't currently do and see no reason to do so.
  • It only works on windows, defeating the whole point of using core (saying that for our case if windows allowed 4kb pages in the first place non of this would be needed)
  • It would require the user to actively work with us to debug the issue when most users, if you are lucky. report it and never follow up.
  • Many of the issues we will likely face will be hard to replicate so after the first crash it may never come up again, especially if the error only comes up after extended use.

In short it's a terrible solution for us.

I'm well aware that it's not smart to continue running but ultimately I think that decision should be left to the developer, not the .net runtime. Just like it is with framework. And by default there needs to be something to tell the user a crash has occurred, not just closing the application.

As a bare minimum the documentation needs to make it clear that this feature doesn't work in .net core, really the tag should just be removed if it's not going to be supported (it should be supported)
https://docs.microsoft.com/en-us/dotnet/api/system.runtime.exceptionservices.handleprocesscorruptedstateexceptionsattribute?redirectedfrom=MSDN&view=netcore-2.2

@RussKeldorph
Copy link
Contributor

@dotnet/dotnet-diag for suggestions

@filipnavara
Copy link
Member

filipnavara commented Feb 20, 2019

saying that for our case if windows allowed 4kb pages in the first place non of this would be needed

I kinda started writing a long follow-up before I noticed this sentence... Can you elaborate on why 64Kb pages are a problem for you? How would 4Kb pages help you? Are you trying to create some guard page scheme?

I will post my original follow-up anyway in case someone else stumbles upon it:

We are shipping a large .NET application and we spend quite some time with trying many error reporting tools (in-app, systems for collecting, external tools, etc.) and 10+ years on debugging various errors with a desktop application deployed to more than million of users. Maybe my perspective will help you understand why I suggest to rely on WER (or any other operating system mechanism) for these kind of errors.

I understand that it's nearly inevitable that AccessViolationException will happen at some point if you deploy application at scale to unknown environments. However, in .NET programs the major cause of these exceptions are unmanaged libraries and improper interfaces to unmanaged libraries. It's very rare that the access violation is caused by a managed unsafe code because you have to go out of your way to allow it. For properly written code it should never happen (this is not kernel code where you may do it intentionally). The unmanaged libraries could be something you distribute with your application (such as SQLite) or a 3rd-party library loaded through some extension method (injected hooks through registry, codec loaded by sound player, shell extension loaded into process through open file dialog, network library loaded through the WinSock layer, etc.).

For normal managed exceptions the exception itself usually contains enough information to reasonably identify the cause or at least the location of the error. That's not the case for AccessViolationException. Only very rarely the actual memory corruption happens at the time it is discovered. And even if it was the case then you will likely be interested in the native stack trace, not the managed one. No matter how hard you try to recover from it or capture enough information [in process] it's likely not going to be enough to solve it.

We eventually ended up with multiple ways to debug these errors:

  1. Unhandled exceptions that can be caught are processed in managed code. We use custom handler and serialize the exception to file and then run another process to submit it. Open source tools like HockeyApp or Sentry can do this for you and deliver reasonably good job at it.
  2. We check for existence of our "ClrDump.exe" tool in the exception handler filter (exposed as when keyword in C#) and run it to produce memory dumps. The actual memory dump is created out-of-process, but it's triggered from in-process code which is still unusable for access violations.
  3. If the error is in any way reproducible we use the ProcDump tool to launch our application. It can be configured to produce memory dumps on first chance exceptions or many other conditions and it can be deployed to customers easily.
  4. If everything else fails we resort to WER, either through registry or through the developer portal

Given this introduction to where I come from let me try to answer your questions:

We would have to sign our application which we don't currently do and see no reason to do so.

Two reasons: 1) Windows SmartScreen is less annoying for signed applications 2) You get access to the error reports with the built-in system mechanism, no special effort necessary on your part; the certification you sign with serves as a proof of ownership that gives you access to the the collected error reports

It only works on windows, defeating the whole point of using core

WER works on Windows. Other systems have similar mechanisms. On Linux you often get core dumps automatically, on macOS they can be enabled for local debugging or there's WER-like system with online submissions.

It would require the user to actively work with us to debug the issue when most users, if you are lucky. report it and never follow up.

Nope. The nice thing about WER is that is in every machine and it's just one click of a button (often not even that) for the user to submit the memory dump which is created automatically. For errors like access violation that's the best information you can get, much better than anything you can read from the managed exception object.

@filipnavara
Copy link
Member

As a bare minimum the documentation needs to make it clear that this feature doesn't work in .net core, really the tag should just be removed if it's not going to be supported (it should be supported)

Agreed. If the documentation is wrong it should be fixed. Note that the documentation is hosted on GitHub, so issues/PRs can be submitted for it.

@BaronKiko
Copy link
Author

To start the reason all this is needed is because we are writing a switch emulator called Ryujinx (https://github.com/Ryujinx/Ryujinx)
So your point here kinda sums of the whole situation:

For properly written code it should never happen (this is not kernel code where you may do it intentionally)

This basically is kernel code. At least it's emulating one. We want to be able to allocate pages but if the game accesses an unmapped page we want to capture it. As we are emulating switch games it's very rare the game is actually broken and probably the emulator.
Doing the check in code is reasonably simple however every access has to have a page table translation so we need that to be as fast as possible. It's only a few instructions. Adding in validation vastly increases the amount of code so we just can't really do that.
What we want to do is act like a real kernel and intentionally let it throw this exception. This gives us checking without any extra asm.

This also leads on to this question:

Can you elaborate on why 64Kb pages are a problem for you?

If we could use 4kb pages we could directly map the switch (ARM) address space to windows. Meaning we can cut out all page table translation as the hardware will do it for up. On linux this is available and it works giving us the fastest solution possible. On windows you can't do that unless you are a driver, for unknown reason as it internally uses 4kb pages -_-
Sadly even if this were implemented into windows today it wouldn't be useful to us because older versions of windows wont allow it and we want to support them.

Regarding WER
It's just not a very good fit here. Signing an unlicenced open source application just isn't going to go down well. Who would manage the keys, have the right to make a new build etc. Builds are currently just dumped out by appveyor.
There is also the issue that WER just wont ever provide enough info for us like the game the user is running, what they did to create the error etc. While it would be nice to have and may well be added at some point (probably after long talks about how to manage the whole thing) we want something we control so we can work out whatever we need to and builds that aren't signed still provide useful information.

My point
All this shouldn't really matter anyway. It's not a decision .net core should be making for you. I can understand hiding it behind a flag to avoid accidental use but it should be left to the developer to make the final decision. Just like in .net framework. There was nothing broken about it before so why isn't it good enough for .net core.

That was a long post so if I missed out something from your question I apologize, just let me know.

@BaronKiko
Copy link
Author

Thinking about it. I should really ask.
Is it going to take significant dev time to add the flag into core? If not then I see no reason not to add it. It's totally optional, disabled by default, increases compatibility with porting from framework and lets you do things you currently just can't do in core but can in other languages and variants of .net.

@cshung
Copy link
Member

cshung commented Feb 21, 2019

If you only care about running on Windows - you might be able to "handle" your EXCEPTION_ACCESS_VIOLATION by registering your own VectoredExceptionHandler.

You need to make sure you:

  1. Call the AddVectoredExceptionHandler() after CLRAddVectoredExceptionHandler() is called and make sure it is registered as the first handler - now your vectored exception handler get to deal with the EXCEPTION_ACCESS_VIOLATION earlier then CoreCLR can.

  2. Identify if the EXCEPTION_ACCESS_VIOLATION is yours - this can be hard - all you have is a CONTEXT.

  3. If the EXCEPTION_ACCESS_VIOLATION is yours, do whatever (in native - don't try to call back to managed, that would be a disaster) you want to handle the EXCEPTION_ACCESS_VIOLATION, and you can even return EXCEPTION_CONTINUE_EXECUTION if continue the execution is what you want, or

  4. If it is not your EXCEPTION_ACCESS_VIOLATION (or in fact, anything else), simply return EXCEPTION_CONTINUE_SEARCH.

Fingers crossed, it might just work.

Adding a vectored exception handler after CoreCLR did and so it acts as the first one is a total hack. It is neither recommended nor supported, but it might solve your problem. Would love to hear from you if it works.

@BaronKiko
Copy link
Author

That does sound like a total hack. I doubt it will be accepted on released code but I will mess around with it tomorrow anyway because it just seems interesting. I would still like a proper solution.
Thank you.

@noahfalk
Copy link
Member

Taking this example program: https://gist.github.com/gdkchan/bd18dea1679283ff6e2c152b5f836927
In .Net framework I can simply use a try/catch around line 115. (In 4.0 onward there is some tags needed)

@janvorli am I remembering right that you did a bunch of the earlier exception handling work for .Net Core? Do you recall if there was a conscious decision one way or another to support corrupting state exceptions? Looking in the code I see conflicting evidence, for example FEATURE_CORRUPTING_EXCEPTIONS appears enabled, but the test tree has no tests for it and in practice it doesn't appear to work.

@janvorli
Copy link
Member

I've verified that if you set env var COMPlus_legacyCorruptedStateExceptionsPolicy=1, you can catch the exception in .NET Core.

The HandleProcessCorruptedStateExceptionsAttribute should work too, based on what I can see in the runtime here:
https://github.com/dotnet/coreclr/blob/master/src/vm/excep.cpp#L11205-L11257
But for some reason, it doesn't work.

@BaronKiko
Copy link
Author

Could it be related to this: dotnet/coreclr#10957 ?

@noahfalk
Copy link
Member

noahfalk commented Feb 21, 2019

@janvorli To the best of your understanding the fact that it doesn't work is simply a bug we should fix (and hopefully add tests for) rather than a feature we deliberately wanted to disable for some reason?

@noahfalk
Copy link
Member

dotnet/coreclr#10957 certainly looks related. @jkotas perhaps you could add a little background since you filed #7272?
Assuming we are sticking with the decision in #7272, then we have an open issue that the .Net Core documentation is now in a misleading state that doesn't give any indication that the functionality is deprecated.

@noahfalk
Copy link
Member

Reading through #7272 more thoroughly, a few bits of interest:

https://github.com/dotnet/coreclr/issues/9045#issuecomment-292452208

I think it would be reasonable to keep the legacyCorruptedStateExceptionsPolicy config switch that disables the fail fast on corrupting exceptions and let them propagate as regular exceptions.

https://github.com/dotnet/coreclr/issues/9045#issuecomment-290159433

Handling of corrupted state exceptions is set of low-level desktop features that tried to make it possible to write managed code for what would be best written in plain C. We have abandoned this direction and not included features from this set in .NET Core whenever we had opportunity to do so.

https://github.com/dotnet/coreclr/issues/9045#issuecomment-289949603

In particular, any processing of dirty AVs is security liability. It is best to fail fast for them right away.

@janvorli
Copy link
Member

Ok, it seems all the details /reasons are now clear. The legacyCorruptedStateExceptionsPolicy config switch is what @BaronKiko can use.

@BaronKiko
Copy link
Author

So I should be using https://docs.microsoft.com/en-us/dotnet/framework/configure-apps/file-schema/runtime/legacycorruptedstateexceptionspolicy-element
Which is unavailable because it requires editing the App.config which is a framework thing not core?

@jkotas
Copy link
Member

jkotas commented Feb 21, 2019

Currently the program silently closes without reporting anything when run outside visual studio.

This sounds like a bug. The runtime should not exit silently on fatal crash.

Yes, we should fix the docs.

I would not recommend using legacyCorruptedStateExceptionsPolicy as a permanent solution, it is ok to use it as a temporary workaround to unblock migration from .NET Framework to .NET Core.

For cases where folks want to get details about fatal crashes (ie implement their own WER or fatal crash detail uploader), I think we should be looking at introducing unmanaged C API to provide these details.

@janvorli
Copy link
Member

I have taken a look why it exits silently. The issue is in this function:
https://github.com/dotnet/coreclr/blob/master/src/vm/eepolicy.cpp#L931-L1111
The only case where we log anything to the console is for FailFast exception (exitCode==COR_E_FAILFAST). In this case, the exitCode is 0xC0000005 (access violation) and so we skip the logging.

@janvorli
Copy link
Member

It seems it would be simple to change that to log the call stack and exception details for all exitCode values.

@mikem8361
Copy link
Member

Sounds to me like something that should be done. Jan do you have time?

@janvorli
Copy link
Member

Yes, I can do it.

@BaronKiko
Copy link
Author

To sum up all this:

This sounds like a bug. The runtime should not exit silently on fatal crash.

It seems it would be simple to change that to log the call stack and exception details for all exitCode values.

  • You are going print out a stack trace and details when it crashes rather than just silently exiting

Yes, we should fix the docs.

  • The docs need to get updated

The legacyCorruptedStateExceptionsPolicy config switch is what @BaronKiko can use.

I would not recommend using legacyCorruptedStateExceptionsPolicy as a permanent solution, it is ok to use it as a temporary workaround to unblock migration from .NET Framework to .NET Core.

  • legacyCorruptedStateExceptionsPolicy should be supported to help migrate but shouldn't be used as a permanent solution.

For cases where folks want to get details about fatal crashes (ie implement their own WER or fatal crash detail uploader), I think we should be looking at introducing unmanaged C API to provide these details.

  • A C API to implement your own handling would be nice to have

  • No specific quote but WER can be used if you sign your application and want a windows only solution

Does that sum up the situation correctly?
Thank you all for your time so far.

@BaronKiko
Copy link
Author

@cshung I tried your strange solution. I only spent an hour or so on it but I couldn't get it to work. Was quickly getting messy but I think it would technically work. Even if it's frankly too hacky for real use.

@janvorli
Copy link
Member

@BaronKiko have you tried to set the COMPlus_legacyCorruptedStateExceptionsPolicy environment variable to 1 before running your app? That sets the legacyCorruptedStateExceptionsPolicy config switch that should allow you to catch the exception as a temporary solution.

@BaronKiko
Copy link
Author

@janvorli I tried setting it on the first line in main with:
Environment.SetEnvironmentVariable("COMPlus_legacyCorruptedStateExceptionsPolicy", "1");

This didn't change anything but I'm assuming by the time it gets to main it's already too late.
Setting it from powershell prior to running does indeed allow capture of the exception and I can presumable make a wrapper to do that automatically. While not ideal it does work.

@janvorli
Copy link
Member

Yes, it needs to be done before running the app.

@tommcdon
Copy link
Member

It looks like the question has been answered. Feel free to re-open if needed.

@BaronKiko
Copy link
Author

BaronKiko commented Apr 17, 2019

I know it outputs error info on a crash now, is this on .net core 2 or only on 3?
Has legacyCorruptedStateExceptionsPolicy been re-enabled and has a C API been added or will it be added?
I think the C API, if it's going to be done, should be a separate issue.

Edit: Have the docs been updated?

@ericwj
Copy link

ericwj commented Apr 24, 2019

Perhaps my interest is very specialized, but might it be possible to opt into handling AV's and maybe SEH's for a particular memory region? I understand why this behavior is being made but I regret how completely it appears the goal is to try to erradicate any access at all.

I am not interested in diagnostics or crash dumps and I would hope not to have to set that environment variable and launch a child process or tricks like that. I would be happy installing a vectored handler on top of the one of the CLR, were it not that the native code will spread like the plague because the native/managed transitions are wicked way too slow for my purpose. And the whole code path gets blocked on UWP because it does not have the one API to set the handler though it has all others that fit my needs.

Short of going native all the way and get 4x faster AV's, but well...going native, wouldn't it be great for some scenario's to be able to let those AV's through in a filter/when either provided by me or opted into from .NET Core in plain UWP-compatible C# without COMPlus flags or custom hosts?

@BaronKiko
Copy link
Author

The flag just allows you to catch the exceptions with a try catch so you can check the mem address then and handle it however you want. I'm pretty sure it gives you the address in the exception but worst case you can get it from the exception string.
Maybe I'm missing the point of your question.

@ericwj
Copy link

ericwj commented Apr 24, 2019

Setting the flag defeats all these pulls that change the behavior so that AccessViolationException is not caught by accident in places that are meant to catch application errors such as overflow or file not found just because you didn't write out the exception type(s) in the handler filters for example.

And it is just not suitable to have to force the flag on short of writing a custom CLR host, plus in UWP apps afaik there is no way to set the flag at all.

@kirsan31
Copy link

I know it outputs error info on a crash now, is this on .net core 2 or only on 3?
Has legacyCorruptedStateExceptionsPolicy been re-enabled and has a C API been added or will it be added?
I think the C API, if it's going to be done, should be a separate issue.

Edit: Have the docs been updated?

@BaronKiko @jkotas @janvorli
Any news on this?
We are migrating our WinForm apps to core 3.0+. One of the main requirement to our apps is that we must to do some action on exit (regardless of exit type - normal, crash etc).
On .net we use legacyCorruptedStateExceptionsPolicy and AppDomain.CurrentDomain.UnhandledException for this.
During migration we found that, this not working any more. For coreclr bugs, for example.

For cases where folks want to get details about fatal crashes (ie implement their own WER or fatal crash detail uploader), I think we should be looking at introducing unmanaged C API to provide these details.

What current state on this problem? What you will advice?
Thanks.

@janvorli
Copy link
Member

@kirsan31 we have not introduced any API for reporting fatal crashes yet. So you are still left with the COMPlus_legacyCorruptedStateExceptionsPolicy=1 as a temporary workaround.

@kirsan31
Copy link

kirsan31 commented Nov 18, 2019

@janvorli

we have not introduced any API for reporting fatal crashes yet.

Sadly to hear :( Any eta, and/or issue to track on this?

So you are still left with the COMPlus_legacyCorruptedStateExceptionsPolicy=1 as a temporary workaround.

It's very strange, but It's not working for us (and not only for us).
See: https://stackoverflow.com/q/48682489/6075536 and https://stackoverflow.com/q/49357461/6075536
We are trying both console and WinForms core 3.0 apps and no luck with this. Exception from first link normally captured when targeting .net 472 and not captured in core 3.0.
We are tried:
via cmd:

set COMPlus_legacyCorruptedStateExceptionsPolicy=1
AccessViolationExceptionTest.exe // our test app

or via launcher app:

Environment.SetEnvironmentVariable("COMPlus_legacyCorruptedStateExceptionsPolicy", "1");
Process.Start("AccessViolationExceptionTest.exe");

and/or setting dword legacyCorruptedStateExceptionsPolicy to 1 in HKCU\Software\Microsoft\.NETFramework.

During all experiments, only in first attempt with COMPlus_legacyCorruptedStateExceptionsPolicy we got it to work. After this success, all subsequent attempts - failed :( And we are not 100% sure that that success was on core - may be we had run .net472 version by mistake?? Any way, we can't get it to work for now :(

@AntonLapounov
Copy link
Member

@kirsan31 Note that if you use Marshal.ReadInt32((IntPtr)42) instead (and set the environment variable), you will catch the AccessViolationException as expected. The problem is that AV during Marshal.StructureToPtr execution is not translated to the AccessViolationException as it used to be in .NET Framework 4.x.

Interestingly, that seems to be caused by statically linking VC Runtime to coreclr.dll. In .NET Framework 4.x the AV call stack on x64 contains the following frames:

ntdll!KiUserExceptionDispatch
VCRUNTIME140_CLR0400!MoveSmall
clr!memcpyNoGCRefs
clr!MarshalNative::StructureToPtr

and AV's address is in the 'VCRUNTIME140_CLR0400' module. Compare to .NET Core 3.0:

ntdll!KiUserExceptionDispatch
coreclr!MoveSmall
coreclr!memcpyNoGCRefs
coreclr!MarshalNative::StructureToPtr

where AV's address (the ip variable below) is now in the 'coreclr' module (g_pMSCorEE). That violates the internal contract regarding exceptions and leads to a fatal error in CLRVectoredExceptionHandlerPhase3: https://github.com/dotnet/coreclr/blob/ed5dc831b09a0bfed76ddad684008bebc86ab2f0/src/vm/excep.cpp#L7508-L7510

@kirsan31 Feel free to open a new issue for this defect. Mentioning @janvorli for awareness.

@kirsan31
Copy link

@AntonLapounov Thank you very match for you answer. Marshal.ReadInt32((IntPtr)42) actually work!
I will create new issue about this.
And what do you think about second post - can it be related some how? I've created simple c++ dll with one method:

int AVE()
{
	int ptr = *(int*)0 = 0;
	return ptr;
}

and calling it from .net core app generates AVE, that successfully captured...

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Diagnostics-coreclr question Answer questions and provide assistance, not an issue with source code or documentation.
Projects
None yet
Development

No branches or pull requests