Initialization failures logging #75097

janvorli · 2022-09-05T22:05:46Z

This change enables logging of E_FAIL error source so that when coreclr
initialization fails with this generic error (which can stem from more
than 300 places), the exact location can be quickly identified.

With this change, besides the following message in corerun and similar
in the dotnet host / apphost
BEGIN: coreclr_initialize failed - Error: 0x80004005
we now get something like
HR=0x80004005 at CORECLR! WKS::gc_heap::initialize_gc + 0x12 (0x00007ffd'277241cb) F:\git\runtime7\src\coreclr\gc\gc.cpp:13361

On macOS, we can only get the symbol name and offset programatically,
however, we also print module offset. This can be passed to atos command
line tool to get the source line:

atos -o artifacts/bin/coreclr/OSX.arm64.Debug/libcoreclr.dylib.dwarf 0x7ac654 -fullPath
EEStartupHelper() (in libcoreclr.dylib.dwarf) (/Users/janvorli/git/runtime/src/coreclr/vm/ceemain.cpp:1001)

On Linux, we cannot get even the symbol, but similar to macOS, there is
a command line tool llvm-symbolizer that can extract both the symbol and
source file:

llvm-symbolizer-14 --obj /home/janvorli/git/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/libcoreclr.so 0xbac1de --pretty-print
EEStartupHelper() at /home/janvorli/git/runtime/src/coreclr/vm/ceemain.cpp:1001:16

While the change logs only sources of E_FAIL errors, it is not limited to those
and can be used to log any errors. The E_FAIL was picked as in 99% of
cases, the coreclr failures that we couldn't diagnose easily in the past
were caused by the E_FAIL.

I have made it use a ring buffer to handle possible cases where there
was a chain of E_FAIL errors (or possibly other ones if we added logging them).
All the entries from the buffer are dumped when coreclr_initialize exits
with error code.

Besides the added logging, the change also enhances assert stack trace by
adding source line information. This is a side effect of the change as
the symbol details extraction is shared with that stack trace. But it is
something I was missing for a long time, since assert trace was
containing only symbol names / offsets which is not quickly mappable to
the source locations, especially when inlining happens.

This change enables logging of E_FAIL error source so that when coreclr initialization fails with this generic error (which can stem from more than 300 places), the exact location can be quickly identified. With this change, besides the following message in corerun and similar in the dotnet host / apphost `BEGIN: coreclr_initialize failed - Error: 0x80004005` we now get something like `HR=0x80004005 at CORECLR! WKS::gc_heap::initialize_gc + 0x12 (0x00007ffd'277241cb) F:\git\runtime7\src\coreclr\gc\gc.cpp:13361` On macOS, we can only get the symbol name and offset programatically, however, we also print module offset. This can be passed to `atos` command line tool to get the source line: `atos -o artifacts/bin/coreclr/OSX.arm64.Debug/libcoreclr.dylib.dwarf 0x7ac654 -fullPath` `EEStartupHelper() (in libcoreclr.dylib.dwarf) (/Users/janvorli/git/runtime/src/coreclr/vm/ceemain.cpp:1001)` On Linux, we cannot get even the symbol, but similar to macOS, there is a command line tool llvm-symbolizer that can extract both the symbol and source file: `llvm-symbolizer-14 --obj /home/janvorli/git/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/libcoreclr.so 0xbac1de --pretty-print` `EEStartupHelper() at /home/janvorli/git/runtime/src/coreclr/vm/ceemain.cpp:1001:16` While the change logs only sources of E_FAIL errors, it is not limited to those and can be used to log any errors. The E_FAIL was picked as in 99% of cases, the coreclr failures that we couldn't diagnose easily in the past were caused by the E_FAIL. I have made it use a ring buffer to handle possible cases where there was a chain of E_FAIL errors (or possibly other ones if we added logging them). All the entries from the buffer are dumped when coreclr_initialize exits with error code. Besides the added logging, the change also enhances assert stack trace by adding source line information. This is a side effect of the change as the symbol details extraction is shared with that stack trace. But it is something I was missing for a long time, since assert trace was containing only symbol names / offsets which is not quickly mappable to the source locations, especially when inlining happens.

jkotas · 2022-09-05T22:25:28Z

This assumes that the symbols are available on the target machine. It is not the case in typical deployment. In order to use this logging in typical support case, you will need to teach the customer how to upload the right symbols to the target machine. That is pretty non-trivial.

Would it be better to just include filename and line number in the log message?

jkotas · 2022-09-05T22:29:32Z

Majority of places that this is touching have nothing to do with initialization. On the other hand, I see number of places in the initialization that won't be captured by this logging. Would it make more sense to just make this focused on initialization sequence?

AaronRobinsonMSFT · 2022-09-06T01:17:19Z

Would it be better to just include filename and line number in the log message?

Agree. We should just use the __FILE__ and __LINE__ macros, which are well-defined: https://en.cppreference.com/w/cpp/preprocessor/replace.

AaronRobinsonMSFT

I'm not sure this is the best approach. This is a rather narrow but troublesome issue, but the solution here is incredibly complex. Since the primary use of this mechanism is for initialization, there are likely other uses, I'd argue for providing a TLS value that stores the string/line pair from the compiler.

AaronRobinsonMSFT · 2022-09-06T01:21:15Z

src/coreclr/hosts/inc/coreclrhost.h

+// Returns:
+//  HRESULT indicating status of the operation. S_OK if the error info was successfully retrieved
+//
+CORECLR_HOSTING_API(coreclr_get_error_info,


This is a big decision. We should include our hosting owners.

/cc @vitek-karas @elinor-fung @agocke

janvorli · 2022-09-06T07:49:36Z

This assumes that the symbols are available on the target machine.

It actually doesn't. In case the symbols are not available, which is the majority of the cases we care about, we still get module base / offset, which is sufficient for post mortem debugging of the problem. As I've mentioned, on Linux, we don't actually get the symbols dumped by the runtime ever. Then for post mortem debugging, it is sufficient to have the .so and .dbg files from the official build to dump the symbol / source / line and run the llvm-symbolizer or addr2line tool with the module offset and the .so as arguments.
I believe that on Windows, you can load the coreclr.dll into windbg and find the symbol / source in a similar way.

janvorli · 2022-09-06T07:51:42Z

Majority of places that this is touching have nothing to do with initialization.

That is true, however, I don't see a reliable way to make sure that we report only (and reliably all) the places that are executed during initialization.

janvorli · 2022-09-06T07:54:36Z

On the other hand, I see number of places in the initialization that won't be captured by this logging.

Can you share some examples? I believe I am reporting every E_FAIL that is in the runtime (except for the stuff in the debugger related subdirectories, which I've excluded intentionally)

janvorli · 2022-09-06T08:23:49Z

We should just use the __FILE__ and __LINE__ macros, which are well-defined.

I've spent quite some time considering the options and these macros were obviously the first option that came to my mind. However, the fact that we would bloat the binary file size by the file name strings made me dismiss this approach.

My approach just needs to get / record the address of the location where the E_FAIL was generated, which can very easily be translated into symbol / line post mortem even without any symbol files available during runtime on all platforms.

janvorli · 2022-09-06T08:38:09Z

the solution here is incredibly complex

I don't see how using __FILE__ and __LINE__ would make it simpler. We would still need to have a logging function or macro at the place where E_FAIL is generated.

I also don't feel like it is complex. It is true that it touches a lot of places, but at 95% of the places, it was just a search and replace change. The only reason for the other changes was to make sure that we get the exact location of the E_FAIL creation. Most of these were places where we were just setting hr = E_FAIL at the beginning of a function and then various failures during the function execution would just return that. So logging the initial setting would not provide a good enough information. There were also couple of places where hr = E_FAIL was at the beginning of a function, but then we would always set it unconditionally. I've removed those because I had to keep coming back to those and double checking that I have not forgotten any real E_FAIL source.

jkotas · 2022-09-06T11:58:18Z

In case the symbols are not available, which is the majority of the cases we care about, we still get module base / offset, which is sufficient for post mortem debugging of the problem.

Right. It means that it is unlikely to be self-service for the customer and we will need to assist in debugging the problem.

Can you share some examples? I believe I am reporting every E_FAIL that is in the runtime

The problem with hard to diagnose startup failures is not limited to E_FAIL. E_OUTOFMEMORY is another common hard to diagnose startup HRESULT (e.g. see #13027 and issues linked to it). Most HRESULTs that you get from startup are non-obvious to diagnose.

janvorli · 2022-09-06T12:43:48Z

It means that it is unlikely to be self-service for the customer and we will need to assist in debugging the problem.

I would argue that showing source code file and line is the same case. If a customer is not advanced enough to install windbg (the Windows app from Microsoft Store) or to run one of the unix tools + dotnet-symbol, I doubt that source code file and line will shed more light on the issue for such a customer, so the issue will get reported to us anyways.

The problem with hard to diagnose startup failures is not limited to E_FAIL

I agree, that's why I have made it the way that we can easily add logging of any hresult we find difficult to diagnose. OOM was one of those to consider in a next iteration. So far, since the time I've added the detailed error codes for PAL initialization, I've seen many cases where the failure is E_FAIL and one case with the OOM. I don't remember seeing any other failure error codes.

jkotas · 2022-09-06T13:35:22Z

The diagnostic for host startup is done via a verbose log. Would it make sense to use similar verbose log for coreclr startup? Ideally, integrate with the host startup log so that we have one-stop solution for diagnosing all types of startup failures.

janvorli · 2022-09-06T13:45:05Z

@jkotas I am already using the host logging stuff to report the failures - see src/native/corehost/hostpolicy/coreclr.cpp change at line 23. Or did you mean something else?
Only corerun prints that to stderr directly.

jkotas · 2022-09-06T13:46:17Z

I meant follow the same logging style too.

AaronRobinsonMSFT

With respect to __FILE__ and __LINE__, I appreciate the concern with bloat, but without concrete understanding on how much increase there would be I'm less convinced on that front.

The complexity I am alluded to is the "yet another logic mechanism". I would prefer we simply change the LogHR function to internally use EventPipe and log the address and HRESULT as we currently have since the file and line are a concern. Given some of the comments we all agree isn't so much a "help themselves" sort of feature so why not use our existing mechanisms and make it align with respect to our other diagnostics stories.

src/coreclr/minipal/Unix/symbols.cpp

src/coreclr/minipal/Windows/symbols.cpp

vitek-karas · 2022-09-06T16:22:42Z

My main worry about using EventPipe is that if something fails before we can initialize EventPipe we're back where we started with no information.

We have basically two logging systems currently: host uses "simple" text file, runtime uses tracing infra (EventPipe).
Ideally we would use just one (so probably EventPipe), but porting host to EventPipe is really tricky and we would still suffer from the "it fails before we can trace" problem. So I think some kind of "simple" text file is still needed.

Personally I'm OK if we have to live with both systems, but there should always be something written somewhere. I could image we use host logging up to the point where EventPipe is operational, and then switch to EventPipe - but that sounds relatively tricky to achieve.

jkotas · 2022-09-06T16:25:58Z

My main worry about using EventPipe is that if something fails before we can initialize EventPipe

You also need to be able to connect to EventPipe. It is not always possible - when the target runs in a container or on a locked down machine.

AaronRobinsonMSFT · 2022-09-06T16:27:59Z

You also need to be able to connect to EventPipe. It is not always possible - when the target runs in a container or on a locked down machine.

Stress log then?

jkotas · 2022-09-06T16:33:33Z

Stress log needs debugger to be attached. It has similar set of issues as event pipe. Logging to console tends to work best in the situation where nothing works and you do not have a full control over the target.

vitek-karas · 2022-09-06T16:41:27Z

src/coreclr/hosts/inc/coreclrhost.h

+//
+// Type of the callback function called by the coreclr_get_error_info
+//
+typedef void (*error_info_callback)(const char* line, void* arg);
+
+//
+// Initialize the CoreCLR. Creates and starts CoreCLR host and creates an app domain
+//
+// Parameters:
+//  callBack                - callback that will be called for each line of the error info
+//  arg                     - argument to pass to the callback
+//
+// Returns:
+//  HRESULT indicating status of the operation. S_OK if the error info was successfully retrieved
+//
+CORECLR_HOSTING_API(coreclr_get_error_info,
+            error_info_callback callBack,
+            void* arg);
+


Since this is to basically hook this up to the host tracing system, I would prefer it used the same mechanism we use inside host for this. Both hostfxr and hostpolicy expose "set error writer" entry points (hostfxr_set_error_writer and corehost_set_error_writer). It would feel "natural" to do the same for coreclr, so something like:

typedef void (*coreclr_error_writer_fn) (const pal::char_t *message); CORECLR_HOSTING_API(coreclr_set_error_writer, coreclr_error_writer_fn error_writer);

The semantics is that the parent component (in this case hostpolicy) calls set_error_writer before doing anything else (right after LoadLibrary/dlopen) and the component (coreclr) in this case can call the call back function at any point in time it wants.

The components "propagate" these, so that if something called set_error_writer on hostpolicy it will propagate all errors into that writer, including those coming from child components (`coreclr).

This is super helpful (important) for native hosting scenarios, as sometimes the caller wants to hide/swallow errors, or redirect them. For example the comhost will buffer all errors and then propagate them using COM interfaces. Windows apps will buffer the errors and send them to EventLog (and also print out) and so on.

The thing I don't like much about such approach is that:

The output happens at "random" times (not synchronized with other possible output of the runtime)

Not all E_FAIL necessarily have to lead to a runtime failure. We don't know that in advance, so we would log even the benign ones. I am not sure if that can happen now, but I can imagine such thing can be added anytime.

Formatting the string needs to happen "inline". So basically when we detect E_FAIL (or any other HR that we decide to handle in the future), we would need to perform the (I assume) heavy weight lookup and string formatting even if the E_FAIL in the end turns out to be benign (handled gracefully). I guess it is not a big deal for the coreclr initialization, but it might be an issue if we decided to keep the writer on all the time instead of just during the coreclr_initialize call.

Actually, I have realized that if we call it just at the end of coreclr_initialize, the concerns I have listed above are moot. And it would give us an extra ability to use it in future for extra error logging if needed. I'll convert to this model.

@vitek-karas I have updated the change based on your suggested new API, can you please take a look if it matches your expectations?

AaronRobinsonMSFT · 2022-09-06T16:48:06Z

Stress log needs debugger to be attached. It has similar set of issues as event pipe. Logging to console tends to work best in the situation where nothing works and you do not have a full control over the target.

Sigh. We've so much infrastructure for this kind of stuff but none of them works for simple text-based logging - this seems like a bug to me and an area where we could reduce mechanisms. Ideally, we could just throw an exception with the details, catch on the boundary, record those details and be done with it. However, given the vast number of these places that would be a very long bug tail and cause heck in other places.

I appreciate the problem space here but it seems like we have a real fundamental issue and this approach, although designed well and slim, seems like yet more infrastructure for solving a problem in yet another way.

janvorli · 2022-09-06T16:56:57Z

I would prefer we simply change the LogHR function to internally use EventPipe and log the address and HRESULT as we currently have since the file and line are a concern.

@AaronRobinsonMSFT the problem is that one of the serious cases when we are hitting issues with E_FAIL is in Microsoft Office. There all they have is a Watson log and some extra data logged by their dotnet host (they use a custom one). For such cases, the EventPipe would not be helpful.

AaronRobinsonMSFT · 2022-09-06T17:01:08Z

There all they have is a Watson log and some extra data logged by their dotnet host (they use a custom one). For such cases, the EventPipe would not be helpful.

If they can enable the host tracing, can't they also set/enable a StressLog setting and we can view the details in the DMP?

janvorli · 2022-09-06T17:03:45Z

@AaronRobinsonMSFT there are no dumps. They just know that for a subset of customers, coreclr initialization fails with 0x80004005. Maybe it is not exactly Watson what they use, it is on macOS.

AaronRobinsonMSFT · 2022-09-06T17:08:40Z

@AaronRobinsonMSFT there are no dumps. They just know that for a subset of customers, coreclr initialization fails with 0x80004005. Maybe it is not exactly Watson what they use, it is on macOS.

Okay, I submit. Let's go with this and see what happens.

jkotas · 2022-09-06T17:58:11Z

@AaronRobinsonMSFT there are no dumps. They just know that for a subset of customers, coreclr initialization fails with 0x80004005. Maybe it is not exactly Watson what they use, it is on macOS.

So what is the exact mechanism that is going to be used to egress the error from the customer machine in this specific case?

So that we don't have to implement it for mono now

src/native/corehost/hostpolicy/coreclr.cpp

janvorli · 2022-09-08T08:56:20Z

So what is the exact mechanism that is going to be used to egress the error from the customer machine in this specific case?

The Office telemetry will report the output from this added logging and the version of .NET runtime used. For the official build on macOS, it will report the libcoreclr.dylib offset and symbol name (even the official build on macOS contains symbols, which I didn't expect). But just the offset is sufficient.
Then the dotnet-symbol tool would be run on the matching official libcoreclr.dylib, which gets the libcoreclr.dylib.dwarf
Finally, the atos tool will be run on the dwarf file and the offset
It will print out the symbol and the source file / line.

Example:
We get offset 0x21ef4c and runtime version 7.0.0-preview.5.22301.12

dotnet-symbol ~/dotnet7/shared/Microsoft.NETCore.App/7.0.0-preview.5.22301.12/libcoreclr.dylib
atos -o ~/dotnet7/shared/Microsoft.NETCore.App/7.0.0-preview.5.22301.12/libcoreclr.dylib.dwarf 0x21ef4c -fullPath
HandleSuspensionForInterruptedThread(_CONTEXT*) (in libcoreclr.dylib.dwarf) (/Users/runner/work/1/s/src/coreclr/vm/threadsuspend.cpp:5828)

vitek-karas

The host parts looks good.

src/native/corehost/hostpolicy/coreclr.cpp

janvorli · 2022-11-09T21:40:20Z

I am closing this PR. I am working on a different approach based on the feedback from @jkotas.

janvorli added the area-VM-coreclr label Sep 5, 2022

janvorli added this to the 8.0.0 milestone Sep 5, 2022

janvorli requested review from jkotas, VSadov and AaronRobinsonMSFT September 5, 2022 22:05

janvorli self-assigned this Sep 5, 2022

AaronRobinsonMSFT reviewed Sep 6, 2022

View reviewed changes

src/coreclr/minipal/Unix/symbols.cpp Show resolved Hide resolved

src/coreclr/minipal/Windows/symbols.cpp Show resolved Hide resolved

vitek-karas reviewed Sep 6, 2022

View reviewed changes

Fix Linux build

002401f

Add missing license headers to new files

52d69bf

janvorli added 2 commits September 7, 2022 00:18

templated sprintf_s doesn't work on Unix

f8025df

Fix nativeaot build

adf9609

janvorli requested a review from MichalStrehovsky as a code owner September 7, 2022 06:42

Make coreclr_get_error_info optional in corerun

0d4b51c

So that we don't have to implement it for mono now

vitek-karas reviewed Sep 7, 2022

View reviewed changes

src/native/corehost/hostpolicy/coreclr.cpp Outdated Show resolved Hide resolved

Move log_error to anonymous namespace

6b1bf5c

janvorli closed this Sep 12, 2022

janvorli reopened this Sep 12, 2022

vitek-karas mentioned this pull request Sep 12, 2022

Dotnet apps fail to start when run in a directory containing a backslash #75387

Closed

karelz mentioned this pull request Sep 12, 2022

System.Data.OleDb.Test - timeout and hangs #74488

Open

Reflect PR feedback and fix new API optionality

ca2a86c

vitek-karas approved these changes Sep 14, 2022

View reviewed changes

src/native/corehost/hostpolicy/coreclr.cpp Show resolved Hide resolved

Added comment based on PR feedback

0a7975f

janvorli closed this Sep 21, 2022

janvorli reopened this Sep 21, 2022

elinor-fung mentioned this pull request Sep 29, 2022

Failed To Create CoreCLR #46093

Open

janvorli closed this Nov 9, 2022

ghost locked as resolved and limited conversation to collaborators Dec 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initialization failures logging #75097

Initialization failures logging #75097

janvorli commented Sep 5, 2022

jkotas commented Sep 5, 2022 •

edited

Loading

jkotas commented Sep 5, 2022

AaronRobinsonMSFT commented Sep 6, 2022

AaronRobinsonMSFT left a comment

AaronRobinsonMSFT Sep 6, 2022

janvorli commented Sep 6, 2022

janvorli commented Sep 6, 2022

janvorli commented Sep 6, 2022

janvorli commented Sep 6, 2022

janvorli commented Sep 6, 2022

jkotas commented Sep 6, 2022 •

edited

Loading

janvorli commented Sep 6, 2022

jkotas commented Sep 6, 2022

janvorli commented Sep 6, 2022

jkotas commented Sep 6, 2022 •

edited

Loading

AaronRobinsonMSFT left a comment

vitek-karas commented Sep 6, 2022

jkotas commented Sep 6, 2022

AaronRobinsonMSFT commented Sep 6, 2022

jkotas commented Sep 6, 2022

vitek-karas Sep 6, 2022

janvorli Sep 6, 2022

janvorli Sep 14, 2022

janvorli Sep 14, 2022

AaronRobinsonMSFT commented Sep 6, 2022

janvorli commented Sep 6, 2022

AaronRobinsonMSFT commented Sep 6, 2022

janvorli commented Sep 6, 2022

AaronRobinsonMSFT commented Sep 6, 2022

jkotas commented Sep 6, 2022

janvorli commented Sep 8, 2022

vitek-karas left a comment

janvorli commented Nov 9, 2022

Initialization failures logging #75097

Initialization failures logging #75097

Conversation

janvorli commented Sep 5, 2022

jkotas commented Sep 5, 2022 • edited Loading

jkotas commented Sep 5, 2022

AaronRobinsonMSFT commented Sep 6, 2022

AaronRobinsonMSFT left a comment

Choose a reason for hiding this comment

AaronRobinsonMSFT Sep 6, 2022

Choose a reason for hiding this comment

janvorli commented Sep 6, 2022

janvorli commented Sep 6, 2022

janvorli commented Sep 6, 2022

janvorli commented Sep 6, 2022

janvorli commented Sep 6, 2022

jkotas commented Sep 6, 2022 • edited Loading

janvorli commented Sep 6, 2022

jkotas commented Sep 6, 2022

janvorli commented Sep 6, 2022

jkotas commented Sep 6, 2022 • edited Loading

AaronRobinsonMSFT left a comment

Choose a reason for hiding this comment

vitek-karas commented Sep 6, 2022

jkotas commented Sep 6, 2022

AaronRobinsonMSFT commented Sep 6, 2022

jkotas commented Sep 6, 2022

vitek-karas Sep 6, 2022

Choose a reason for hiding this comment

janvorli Sep 6, 2022

Choose a reason for hiding this comment

janvorli Sep 14, 2022

Choose a reason for hiding this comment

janvorli Sep 14, 2022

Choose a reason for hiding this comment

AaronRobinsonMSFT commented Sep 6, 2022

janvorli commented Sep 6, 2022

AaronRobinsonMSFT commented Sep 6, 2022

janvorli commented Sep 6, 2022

AaronRobinsonMSFT commented Sep 6, 2022

jkotas commented Sep 6, 2022

janvorli commented Sep 8, 2022

vitek-karas left a comment

Choose a reason for hiding this comment

janvorli commented Nov 9, 2022

jkotas commented Sep 5, 2022 •

edited

Loading

jkotas commented Sep 6, 2022 •

edited

Loading

jkotas commented Sep 6, 2022 •

edited

Loading