Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialization failures logging #75097

Closed

Conversation

janvorli
Copy link
Member

@janvorli janvorli commented Sep 5, 2022

This change enables logging of E_FAIL error source so that when coreclr
initialization fails with this generic error (which can stem from more
than 300 places), the exact location can be quickly identified.

With this change, besides the following message in corerun and similar
in the dotnet host / apphost
BEGIN: coreclr_initialize failed - Error: 0x80004005
we now get something like
HR=0x80004005 at CORECLR! WKS::gc_heap::initialize_gc + 0x12 (0x00007ffd'277241cb) F:\git\runtime7\src\coreclr\gc\gc.cpp:13361

On macOS, we can only get the symbol name and offset programatically,
however, we also print module offset. This can be passed to atos command
line tool to get the source line:

atos -o artifacts/bin/coreclr/OSX.arm64.Debug/libcoreclr.dylib.dwarf 0x7ac654 -fullPath
EEStartupHelper() (in libcoreclr.dylib.dwarf) (/Users/janvorli/git/runtime/src/coreclr/vm/ceemain.cpp:1001)

On Linux, we cannot get even the symbol, but similar to macOS, there is
a command line tool llvm-symbolizer that can extract both the symbol and
source file:

llvm-symbolizer-14 --obj /home/janvorli/git/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/libcoreclr.so 0xbac1de --pretty-print
EEStartupHelper() at /home/janvorli/git/runtime/src/coreclr/vm/ceemain.cpp:1001:16

While the change logs only sources of E_FAIL errors, it is not limited to those
and can be used to log any errors. The E_FAIL was picked as in 99% of
cases, the coreclr failures that we couldn't diagnose easily in the past
were caused by the E_FAIL.

I have made it use a ring buffer to handle possible cases where there
was a chain of E_FAIL errors (or possibly other ones if we added logging them).
All the entries from the buffer are dumped when coreclr_initialize exits
with error code.

Besides the added logging, the change also enhances assert stack trace by
adding source line information. This is a side effect of the change as
the symbol details extraction is shared with that stack trace. But it is
something I was missing for a long time, since assert trace was
containing only symbol names / offsets which is not quickly mappable to
the source locations, especially when inlining happens.

This change enables logging of E_FAIL error source so that when coreclr
initialization fails with this generic error (which can stem from more
than 300 places), the exact location can be quickly identified.

With this change, besides the following message in corerun and similar
in the dotnet host / apphost
`BEGIN: coreclr_initialize failed - Error: 0x80004005`
we now get something like
`HR=0x80004005 at CORECLR! WKS::gc_heap::initialize_gc + 0x12
(0x00007ffd'277241cb) F:\git\runtime7\src\coreclr\gc\gc.cpp:13361`

On macOS, we can only get the symbol name and offset programatically,
however, we also print module offset. This can be passed to `atos` command
line tool to get the source line:

`atos -o artifacts/bin/coreclr/OSX.arm64.Debug/libcoreclr.dylib.dwarf
0x7ac654 -fullPath`
`EEStartupHelper() (in libcoreclr.dylib.dwarf)
(/Users/janvorli/git/runtime/src/coreclr/vm/ceemain.cpp:1001)`

On Linux, we cannot get even the symbol, but similar to macOS, there is
a command line tool llvm-symbolizer that can extract both the symbol and
source file:

`llvm-symbolizer-14 --obj
/home/janvorli/git/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/libcoreclr.so
0xbac1de --pretty-print`
`EEStartupHelper() at
/home/janvorli/git/runtime/src/coreclr/vm/ceemain.cpp:1001:16`

While the change logs only sources of E_FAIL errors, it is not limited to those
and can be used to log any errors. The E_FAIL was picked as in 99% of
cases, the coreclr failures that we couldn't diagnose easily in the past
were caused by the E_FAIL.

I have made it use a ring buffer to handle possible cases where there
was a chain of E_FAIL errors (or possibly other ones if we added logging them).
All the entries from the buffer are dumped when coreclr_initialize exits
with error code.

Besides the added logging, the change also enhances assert stack trace by
adding source line information. This is a side effect of the change as
the symbol details extraction is shared with that stack trace. But it is
something I was missing for a long time, since assert trace was
containing only symbol names / offsets which is not quickly mappable to
the source locations, especially when inlining happens.
@janvorli janvorli added this to the 8.0.0 milestone Sep 5, 2022
@janvorli janvorli self-assigned this Sep 5, 2022
@jkotas
Copy link
Member

jkotas commented Sep 5, 2022

This assumes that the symbols are available on the target machine. It is not the case in typical deployment. In order to use this logging in typical support case, you will need to teach the customer how to upload the right symbols to the target machine. That is pretty non-trivial.

Would it be better to just include filename and line number in the log message?

@jkotas
Copy link
Member

jkotas commented Sep 5, 2022

Majority of places that this is touching have nothing to do with initialization. On the other hand, I see number of places in the initialization that won't be captured by this logging. Would it make more sense to just make this focused on initialization sequence?

@AaronRobinsonMSFT
Copy link
Member

Would it be better to just include filename and line number in the log message?

Agree. We should just use the __FILE__ and __LINE__ macros, which are well-defined: https://en.cppreference.com/w/cpp/preprocessor/replace.

Copy link
Member

@AaronRobinsonMSFT AaronRobinsonMSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the best approach. This is a rather narrow but troublesome issue, but the solution here is incredibly complex. Since the primary use of this mechanism is for initialization, there are likely other uses, I'd argue for providing a TLS value that stores the string/line pair from the compiler.

// Returns:
// HRESULT indicating status of the operation. S_OK if the error info was successfully retrieved
//
CORECLR_HOSTING_API(coreclr_get_error_info,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big decision. We should include our hosting owners.

/cc @vitek-karas @elinor-fung @agocke

@janvorli
Copy link
Member Author

janvorli commented Sep 6, 2022

This assumes that the symbols are available on the target machine.

It actually doesn't. In case the symbols are not available, which is the majority of the cases we care about, we still get module base / offset, which is sufficient for post mortem debugging of the problem. As I've mentioned, on Linux, we don't actually get the symbols dumped by the runtime ever. Then for post mortem debugging, it is sufficient to have the .so and .dbg files from the official build to dump the symbol / source / line and run the llvm-symbolizer or addr2line tool with the module offset and the .so as arguments.
I believe that on Windows, you can load the coreclr.dll into windbg and find the symbol / source in a similar way.

@janvorli
Copy link
Member Author

janvorli commented Sep 6, 2022

Majority of places that this is touching have nothing to do with initialization.

That is true, however, I don't see a reliable way to make sure that we report only (and reliably all) the places that are executed during initialization.

@janvorli
Copy link
Member Author

janvorli commented Sep 6, 2022

On the other hand, I see number of places in the initialization that won't be captured by this logging.

Can you share some examples? I believe I am reporting every E_FAIL that is in the runtime (except for the stuff in the debugger related subdirectories, which I've excluded intentionally)

@janvorli
Copy link
Member Author

janvorli commented Sep 6, 2022

We should just use the __FILE__ and __LINE__ macros, which are well-defined.

I've spent quite some time considering the options and these macros were obviously the first option that came to my mind. However, the fact that we would bloat the binary file size by the file name strings made me dismiss this approach.

My approach just needs to get / record the address of the location where the E_FAIL was generated, which can very easily be translated into symbol / line post mortem even without any symbol files available during runtime on all platforms.

@janvorli
Copy link
Member Author

janvorli commented Sep 6, 2022

the solution here is incredibly complex

I don't see how using __FILE__ and __LINE__ would make it simpler. We would still need to have a logging function or macro at the place where E_FAIL is generated.

I also don't feel like it is complex. It is true that it touches a lot of places, but at 95% of the places, it was just a search and replace change. The only reason for the other changes was to make sure that we get the exact location of the E_FAIL creation. Most of these were places where we were just setting hr = E_FAIL at the beginning of a function and then various failures during the function execution would just return that. So logging the initial setting would not provide a good enough information. There were also couple of places where hr = E_FAIL was at the beginning of a function, but then we would always set it unconditionally. I've removed those because I had to keep coming back to those and double checking that I have not forgotten any real E_FAIL source.

@jkotas
Copy link
Member

jkotas commented Sep 6, 2022

In case the symbols are not available, which is the majority of the cases we care about, we still get module base / offset, which is sufficient for post mortem debugging of the problem.

Right. It means that it is unlikely to be self-service for the customer and we will need to assist in debugging the problem.

Can you share some examples? I believe I am reporting every E_FAIL that is in the runtime

The problem with hard to diagnose startup failures is not limited to E_FAIL. E_OUTOFMEMORY is another common hard to diagnose startup HRESULT (e.g. see #13027 and issues linked to it). Most HRESULTs that you get from startup are non-obvious to diagnose.

@janvorli
Copy link
Member Author

janvorli commented Sep 6, 2022

It means that it is unlikely to be self-service for the customer and we will need to assist in debugging the problem.

I would argue that showing source code file and line is the same case. If a customer is not advanced enough to install windbg (the Windows app from Microsoft Store) or to run one of the unix tools + dotnet-symbol, I doubt that source code file and line will shed more light on the issue for such a customer, so the issue will get reported to us anyways.

The problem with hard to diagnose startup failures is not limited to E_FAIL

I agree, that's why I have made it the way that we can easily add logging of any hresult we find difficult to diagnose. OOM was one of those to consider in a next iteration. So far, since the time I've added the detailed error codes for PAL initialization, I've seen many cases where the failure is E_FAIL and one case with the OOM. I don't remember seeing any other failure error codes.

@jkotas
Copy link
Member

jkotas commented Sep 6, 2022

The diagnostic for host startup is done via a verbose log. Would it make sense to use similar verbose log for coreclr startup? Ideally, integrate with the host startup log so that we have one-stop solution for diagnosing all types of startup failures.

@janvorli
Copy link
Member Author

janvorli commented Sep 6, 2022

@jkotas I am already using the host logging stuff to report the failures - see src/native/corehost/hostpolicy/coreclr.cpp change at line 23. Or did you mean something else?
Only corerun prints that to stderr directly.

@jkotas
Copy link
Member

jkotas commented Sep 6, 2022

I meant follow the same logging style too.

Copy link
Member

@AaronRobinsonMSFT AaronRobinsonMSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With respect to __FILE__ and __LINE__, I appreciate the concern with bloat, but without concrete understanding on how much increase there would be I'm less convinced on that front.

The complexity I am alluded to is the "yet another logic mechanism". I would prefer we simply change the LogHR function to internally use EventPipe and log the address and HRESULT as we currently have since the file and line are a concern. Given some of the comments we all agree isn't so much a "help themselves" sort of feature so why not use our existing mechanisms and make it align with respect to our other diagnostics stories.

src/coreclr/minipal/Unix/symbols.cpp Show resolved Hide resolved
src/coreclr/minipal/Windows/symbols.cpp Show resolved Hide resolved
@vitek-karas
Copy link
Member

My main worry about using EventPipe is that if something fails before we can initialize EventPipe we're back where we started with no information.

We have basically two logging systems currently: host uses "simple" text file, runtime uses tracing infra (EventPipe).
Ideally we would use just one (so probably EventPipe), but porting host to EventPipe is really tricky and we would still suffer from the "it fails before we can trace" problem. So I think some kind of "simple" text file is still needed.

Personally I'm OK if we have to live with both systems, but there should always be something written somewhere. I could image we use host logging up to the point where EventPipe is operational, and then switch to EventPipe - but that sounds relatively tricky to achieve.

@jkotas
Copy link
Member

jkotas commented Sep 6, 2022

My main worry about using EventPipe is that if something fails before we can initialize EventPipe

You also need to be able to connect to EventPipe. It is not always possible - when the target runs in a container or on a locked down machine.

@AaronRobinsonMSFT
Copy link
Member

You also need to be able to connect to EventPipe. It is not always possible - when the target runs in a container or on a locked down machine.

Stress log then?

@jkotas
Copy link
Member

jkotas commented Sep 6, 2022

Stress log needs debugger to be attached. It has similar set of issues as event pipe. Logging to console tends to work best in the situation where nothing works and you do not have a full control over the target.

Comment on lines 50 to 68
//
// Type of the callback function called by the coreclr_get_error_info
//
typedef void (*error_info_callback)(const char* line, void* arg);

//
// Initialize the CoreCLR. Creates and starts CoreCLR host and creates an app domain
//
// Parameters:
// callBack - callback that will be called for each line of the error info
// arg - argument to pass to the callback
//
// Returns:
// HRESULT indicating status of the operation. S_OK if the error info was successfully retrieved
//
CORECLR_HOSTING_API(coreclr_get_error_info,
error_info_callback callBack,
void* arg);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is to basically hook this up to the host tracing system, I would prefer it used the same mechanism we use inside host for this. Both hostfxr and hostpolicy expose "set error writer" entry points (hostfxr_set_error_writer and corehost_set_error_writer). It would feel "natural" to do the same for coreclr, so something like:

typedef void (*coreclr_error_writer_fn) (const pal::char_t *message);

CORECLR_HOSTING_API(coreclr_set_error_writer, coreclr_error_writer_fn error_writer);

The semantics is that the parent component (in this case hostpolicy) calls set_error_writer before doing anything else (right after LoadLibrary/dlopen) and the component (coreclr) in this case can call the call back function at any point in time it wants.

The components "propagate" these, so that if something called set_error_writer on hostpolicy it will propagate all errors into that writer, including those coming from child components (`coreclr).

This is super helpful (important) for native hosting scenarios, as sometimes the caller wants to hide/swallow errors, or redirect them. For example the comhost will buffer all errors and then propagate them using COM interfaces. Windows apps will buffer the errors and send them to EventLog (and also print out) and so on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing I don't like much about such approach is that:

  • The output happens at "random" times (not synchronized with other possible output of the runtime)
  • Not all E_FAIL necessarily have to lead to a runtime failure. We don't know that in advance, so we would log even the benign ones. I am not sure if that can happen now, but I can imagine such thing can be added anytime.
  • Formatting the string needs to happen "inline". So basically when we detect E_FAIL (or any other HR that we decide to handle in the future), we would need to perform the (I assume) heavy weight lookup and string formatting even if the E_FAIL in the end turns out to be benign (handled gracefully). I guess it is not a big deal for the coreclr initialization, but it might be an issue if we decided to keep the writer on all the time instead of just during the coreclr_initialize call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I have realized that if we call it just at the end of coreclr_initialize, the concerns I have listed above are moot. And it would give us an extra ability to use it in future for extra error logging if needed. I'll convert to this model.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vitek-karas I have updated the change based on your suggested new API, can you please take a look if it matches your expectations?

@AaronRobinsonMSFT
Copy link
Member

Stress log needs debugger to be attached. It has similar set of issues as event pipe. Logging to console tends to work best in the situation where nothing works and you do not have a full control over the target.

Sigh. We've so much infrastructure for this kind of stuff but none of them works for simple text-based logging - this seems like a bug to me and an area where we could reduce mechanisms. Ideally, we could just throw an exception with the details, catch on the boundary, record those details and be done with it. However, given the vast number of these places that would be a very long bug tail and cause heck in other places.

I appreciate the problem space here but it seems like we have a real fundamental issue and this approach, although designed well and slim, seems like yet more infrastructure for solving a problem in yet another way.

@janvorli
Copy link
Member Author

janvorli commented Sep 6, 2022

I would prefer we simply change the LogHR function to internally use EventPipe and log the address and HRESULT as we currently have since the file and line are a concern.

@AaronRobinsonMSFT the problem is that one of the serious cases when we are hitting issues with E_FAIL is in Microsoft Office. There all they have is a Watson log and some extra data logged by their dotnet host (they use a custom one). For such cases, the EventPipe would not be helpful.

@AaronRobinsonMSFT
Copy link
Member

There all they have is a Watson log and some extra data logged by their dotnet host (they use a custom one). For such cases, the EventPipe would not be helpful.

If they can enable the host tracing, can't they also set/enable a StressLog setting and we can view the details in the DMP?

@janvorli
Copy link
Member Author

janvorli commented Sep 6, 2022

@AaronRobinsonMSFT there are no dumps. They just know that for a subset of customers, coreclr initialization fails with 0x80004005. Maybe it is not exactly Watson what they use, it is on macOS.

@AaronRobinsonMSFT
Copy link
Member

@AaronRobinsonMSFT there are no dumps. They just know that for a subset of customers, coreclr initialization fails with 0x80004005. Maybe it is not exactly Watson what they use, it is on macOS.

Okay, I submit. Let's go with this and see what happens.

@jkotas
Copy link
Member

jkotas commented Sep 6, 2022

@AaronRobinsonMSFT there are no dumps. They just know that for a subset of customers, coreclr initialization fails with 0x80004005. Maybe it is not exactly Watson what they use, it is on macOS.

So what is the exact mechanism that is going to be used to egress the error from the customer machine in this specific case?

So that we don't have to implement it for mono now
@janvorli
Copy link
Member Author

janvorli commented Sep 8, 2022

So what is the exact mechanism that is going to be used to egress the error from the customer machine in this specific case?

The Office telemetry will report the output from this added logging and the version of .NET runtime used. For the official build on macOS, it will report the libcoreclr.dylib offset and symbol name (even the official build on macOS contains symbols, which I didn't expect). But just the offset is sufficient.
Then the dotnet-symbol tool would be run on the matching official libcoreclr.dylib, which gets the libcoreclr.dylib.dwarf
Finally, the atos tool will be run on the dwarf file and the offset
It will print out the symbol and the source file / line.

Example:
We get offset 0x21ef4c and runtime version 7.0.0-preview.5.22301.12

dotnet-symbol ~/dotnet7/shared/Microsoft.NETCore.App/7.0.0-preview.5.22301.12/libcoreclr.dylib
atos -o ~/dotnet7/shared/Microsoft.NETCore.App/7.0.0-preview.5.22301.12/libcoreclr.dylib.dwarf 0x21ef4c -fullPath
HandleSuspensionForInterruptedThread(_CONTEXT*) (in libcoreclr.dylib.dwarf) (/Users/runner/work/1/s/src/coreclr/vm/threadsuspend.cpp:5828)

Copy link
Member

@vitek-karas vitek-karas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The host parts looks good.

src/native/corehost/hostpolicy/coreclr.cpp Show resolved Hide resolved
@janvorli
Copy link
Member Author

janvorli commented Nov 9, 2022

I am closing this PR. I am working on a different approach based on the feedback from @jkotas.

@janvorli janvorli closed this Nov 9, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants