-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initialization failures logging #75097
Conversation
This change enables logging of E_FAIL error source so that when coreclr initialization fails with this generic error (which can stem from more than 300 places), the exact location can be quickly identified. With this change, besides the following message in corerun and similar in the dotnet host / apphost `BEGIN: coreclr_initialize failed - Error: 0x80004005` we now get something like `HR=0x80004005 at CORECLR! WKS::gc_heap::initialize_gc + 0x12 (0x00007ffd'277241cb) F:\git\runtime7\src\coreclr\gc\gc.cpp:13361` On macOS, we can only get the symbol name and offset programatically, however, we also print module offset. This can be passed to `atos` command line tool to get the source line: `atos -o artifacts/bin/coreclr/OSX.arm64.Debug/libcoreclr.dylib.dwarf 0x7ac654 -fullPath` `EEStartupHelper() (in libcoreclr.dylib.dwarf) (/Users/janvorli/git/runtime/src/coreclr/vm/ceemain.cpp:1001)` On Linux, we cannot get even the symbol, but similar to macOS, there is a command line tool llvm-symbolizer that can extract both the symbol and source file: `llvm-symbolizer-14 --obj /home/janvorli/git/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/libcoreclr.so 0xbac1de --pretty-print` `EEStartupHelper() at /home/janvorli/git/runtime/src/coreclr/vm/ceemain.cpp:1001:16` While the change logs only sources of E_FAIL errors, it is not limited to those and can be used to log any errors. The E_FAIL was picked as in 99% of cases, the coreclr failures that we couldn't diagnose easily in the past were caused by the E_FAIL. I have made it use a ring buffer to handle possible cases where there was a chain of E_FAIL errors (or possibly other ones if we added logging them). All the entries from the buffer are dumped when coreclr_initialize exits with error code. Besides the added logging, the change also enhances assert stack trace by adding source line information. This is a side effect of the change as the symbol details extraction is shared with that stack trace. But it is something I was missing for a long time, since assert trace was containing only symbol names / offsets which is not quickly mappable to the source locations, especially when inlining happens.
This assumes that the symbols are available on the target machine. It is not the case in typical deployment. In order to use this logging in typical support case, you will need to teach the customer how to upload the right symbols to the target machine. That is pretty non-trivial. Would it be better to just include filename and line number in the log message? |
Majority of places that this is touching have nothing to do with initialization. On the other hand, I see number of places in the initialization that won't be captured by this logging. Would it make more sense to just make this focused on initialization sequence? |
Agree. We should just use the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is the best approach. This is a rather narrow but troublesome issue, but the solution here is incredibly complex. Since the primary use of this mechanism is for initialization, there are likely other uses, I'd argue for providing a TLS value that stores the string/line pair from the compiler.
src/coreclr/hosts/inc/coreclrhost.h
Outdated
// Returns: | ||
// HRESULT indicating status of the operation. S_OK if the error info was successfully retrieved | ||
// | ||
CORECLR_HOSTING_API(coreclr_get_error_info, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big decision. We should include our hosting owners.
It actually doesn't. In case the symbols are not available, which is the majority of the cases we care about, we still get module base / offset, which is sufficient for post mortem debugging of the problem. As I've mentioned, on Linux, we don't actually get the symbols dumped by the runtime ever. Then for post mortem debugging, it is sufficient to have the .so and .dbg files from the official build to dump the symbol / source / line and run the llvm-symbolizer or addr2line tool with the module offset and the .so as arguments. |
That is true, however, I don't see a reliable way to make sure that we report only (and reliably all) the places that are executed during initialization. |
Can you share some examples? I believe I am reporting every E_FAIL that is in the runtime (except for the stuff in the debugger related subdirectories, which I've excluded intentionally) |
I've spent quite some time considering the options and these macros were obviously the first option that came to my mind. However, the fact that we would bloat the binary file size by the file name strings made me dismiss this approach. My approach just needs to get / record the address of the location where the |
I don't see how using I also don't feel like it is complex. It is true that it touches a lot of places, but at 95% of the places, it was just a search and replace change. The only reason for the other changes was to make sure that we get the exact location of the |
Right. It means that it is unlikely to be self-service for the customer and we will need to assist in debugging the problem.
The problem with hard to diagnose startup failures is not limited to E_FAIL. E_OUTOFMEMORY is another common hard to diagnose startup HRESULT (e.g. see #13027 and issues linked to it). Most HRESULTs that you get from startup are non-obvious to diagnose. |
I would argue that showing source code file and line is the same case. If a customer is not advanced enough to install windbg (the Windows app from Microsoft Store) or to run one of the unix tools + dotnet-symbol, I doubt that source code file and line will shed more light on the issue for such a customer, so the issue will get reported to us anyways.
I agree, that's why I have made it the way that we can easily add logging of any hresult we find difficult to diagnose. OOM was one of those to consider in a next iteration. So far, since the time I've added the detailed error codes for PAL initialization, I've seen many cases where the failure is |
The diagnostic for host startup is done via a verbose log. Would it make sense to use similar verbose log for coreclr startup? Ideally, integrate with the host startup log so that we have one-stop solution for diagnosing all types of startup failures. |
@jkotas I am already using the host logging stuff to report the failures - see src/native/corehost/hostpolicy/coreclr.cpp change at line 23. Or did you mean something else? |
I meant follow the same logging style too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With respect to __FILE__
and __LINE__
, I appreciate the concern with bloat, but without concrete understanding on how much increase there would be I'm less convinced on that front.
The complexity I am alluded to is the "yet another logic mechanism". I would prefer we simply change the LogHR
function to internally use EventPipe and log the address and HRESULT
as we currently have since the file and line are a concern. Given some of the comments we all agree isn't so much a "help themselves" sort of feature so why not use our existing mechanisms and make it align with respect to our other diagnostics stories.
My main worry about using EventPipe is that if something fails before we can initialize EventPipe we're back where we started with no information. We have basically two logging systems currently: host uses "simple" text file, runtime uses tracing infra (EventPipe). Personally I'm OK if we have to live with both systems, but there should always be something written somewhere. I could image we use host logging up to the point where EventPipe is operational, and then switch to EventPipe - but that sounds relatively tricky to achieve. |
You also need to be able to connect to EventPipe. It is not always possible - when the target runs in a container or on a locked down machine. |
Stress log then? |
Stress log needs debugger to be attached. It has similar set of issues as event pipe. Logging to console tends to work best in the situation where nothing works and you do not have a full control over the target. |
src/coreclr/hosts/inc/coreclrhost.h
Outdated
// | ||
// Type of the callback function called by the coreclr_get_error_info | ||
// | ||
typedef void (*error_info_callback)(const char* line, void* arg); | ||
|
||
// | ||
// Initialize the CoreCLR. Creates and starts CoreCLR host and creates an app domain | ||
// | ||
// Parameters: | ||
// callBack - callback that will be called for each line of the error info | ||
// arg - argument to pass to the callback | ||
// | ||
// Returns: | ||
// HRESULT indicating status of the operation. S_OK if the error info was successfully retrieved | ||
// | ||
CORECLR_HOSTING_API(coreclr_get_error_info, | ||
error_info_callback callBack, | ||
void* arg); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is to basically hook this up to the host tracing system, I would prefer it used the same mechanism we use inside host for this. Both hostfxr
and hostpolicy
expose "set error writer" entry points (hostfxr_set_error_writer
and corehost_set_error_writer
). It would feel "natural" to do the same for coreclr
, so something like:
typedef void (*coreclr_error_writer_fn) (const pal::char_t *message);
CORECLR_HOSTING_API(coreclr_set_error_writer, coreclr_error_writer_fn error_writer);
The semantics is that the parent component (in this case hostpolicy
) calls set_error_writer
before doing anything else (right after LoadLibrary/dlopen) and the component (coreclr
) in this case can call the call back function at any point in time it wants.
The components "propagate" these, so that if something called set_error_writer
on hostpolicy
it will propagate all errors into that writer, including those coming from child components (`coreclr).
This is super helpful (important) for native hosting scenarios, as sometimes the caller wants to hide/swallow errors, or redirect them. For example the comhost will buffer all errors and then propagate them using COM interfaces. Windows apps will buffer the errors and send them to EventLog (and also print out) and so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing I don't like much about such approach is that:
- The output happens at "random" times (not synchronized with other possible output of the runtime)
- Not all E_FAIL necessarily have to lead to a runtime failure. We don't know that in advance, so we would log even the benign ones. I am not sure if that can happen now, but I can imagine such thing can be added anytime.
- Formatting the string needs to happen "inline". So basically when we detect E_FAIL (or any other HR that we decide to handle in the future), we would need to perform the (I assume) heavy weight lookup and string formatting even if the E_FAIL in the end turns out to be benign (handled gracefully). I guess it is not a big deal for the coreclr initialization, but it might be an issue if we decided to keep the writer on all the time instead of just during the coreclr_initialize call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I have realized that if we call it just at the end of coreclr_initialize, the concerns I have listed above are moot. And it would give us an extra ability to use it in future for extra error logging if needed. I'll convert to this model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vitek-karas I have updated the change based on your suggested new API, can you please take a look if it matches your expectations?
Sigh. We've so much infrastructure for this kind of stuff but none of them works for simple text-based logging - this seems like a bug to me and an area where we could reduce mechanisms. Ideally, we could just throw an exception with the details, catch on the boundary, record those details and be done with it. However, given the vast number of these places that would be a very long bug tail and cause heck in other places. I appreciate the problem space here but it seems like we have a real fundamental issue and this approach, although designed well and slim, seems like yet more infrastructure for solving a problem in yet another way. |
@AaronRobinsonMSFT the problem is that one of the serious cases when we are hitting issues with |
If they can enable the host tracing, can't they also set/enable a StressLog setting and we can view the details in the DMP? |
@AaronRobinsonMSFT there are no dumps. They just know that for a subset of customers, coreclr initialization fails with 0x80004005. Maybe it is not exactly Watson what they use, it is on macOS. |
Okay, I submit. Let's go with this and see what happens. |
So what is the exact mechanism that is going to be used to egress the error from the customer machine in this specific case? |
So that we don't have to implement it for mono now
The Office telemetry will report the output from this added logging and the version of .NET runtime used. For the official build on macOS, it will report the libcoreclr.dylib offset and symbol name (even the official build on macOS contains symbols, which I didn't expect). But just the offset is sufficient. Example:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The host parts looks good.
I am closing this PR. I am working on a different approach based on the feedback from @jkotas. |
This change enables logging of
E_FAIL
error source so that when coreclrinitialization fails with this generic error (which can stem from more
than 300 places), the exact location can be quickly identified.
With this change, besides the following message in corerun and similar
in the dotnet host / apphost
BEGIN: coreclr_initialize failed - Error: 0x80004005
we now get something like
HR=0x80004005 at CORECLR! WKS::gc_heap::initialize_gc + 0x12 (0x00007ffd'277241cb) F:\git\runtime7\src\coreclr\gc\gc.cpp:13361
On macOS, we can only get the symbol name and offset programatically,
however, we also print module offset. This can be passed to
atos
commandline tool to get the source line:
atos -o artifacts/bin/coreclr/OSX.arm64.Debug/libcoreclr.dylib.dwarf 0x7ac654 -fullPath
EEStartupHelper() (in libcoreclr.dylib.dwarf) (/Users/janvorli/git/runtime/src/coreclr/vm/ceemain.cpp:1001)
On Linux, we cannot get even the symbol, but similar to macOS, there is
a command line tool llvm-symbolizer that can extract both the symbol and
source file:
llvm-symbolizer-14 --obj /home/janvorli/git/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/libcoreclr.so 0xbac1de --pretty-print
EEStartupHelper() at /home/janvorli/git/runtime/src/coreclr/vm/ceemain.cpp:1001:16
While the change logs only sources of
E_FAIL
errors, it is not limited to thoseand can be used to log any errors. The
E_FAIL
was picked as in 99% ofcases, the coreclr failures that we couldn't diagnose easily in the past
were caused by the
E_FAIL
.I have made it use a ring buffer to handle possible cases where there
was a chain of
E_FAIL
errors (or possibly other ones if we added logging them).All the entries from the buffer are dumped when
coreclr_initialize
exitswith error code.
Besides the added logging, the change also enhances assert stack trace by
adding source line information. This is a side effect of the change as
the symbol details extraction is shared with that stack trace. But it is
something I was missing for a long time, since assert trace was
containing only symbol names / offsets which is not quickly mappable to
the source locations, especially when inlining happens.