Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoreCLR initialization failures logging #78790

Merged
merged 7 commits into from
Dec 7, 2022

Conversation

janvorli
Copy link
Member

This change adds detecting and logging of failures during coreclr initialization. For logging, it uses a new host API coreclr_set_error_writer to register a callback to report the errors to the host. The hosts have support for optional usage of this API so that they can work with older runtime versions as well.

The logging checks and reports failures with:

  • System.Private.CoreLib.dll
  • GC initialization
  • JIT initialization
  • libSystem.Native.so/dylib on Unix

The logging messages should allow customers to self-diagnose the issues causing the failures.

This change adds detecting and logging of failures during coreclr
initialization. For logging, it uses a new host API
`coreclr_set_error_writer` to register a callback to report the errors
to the host. The hosts have support for optional usage of this API so
that they can work with older runtime versions as well.

The logging checks and reports failures with:
* System.Private.CoreLib.dll
* GC initialization
* JIT initialization
* libSystem.Native.so/dylib on Unix

The logging messages should allow customers to self-diagnose the issues
causing the failures.
@janvorli
Copy link
Member Author

Here are some examples of the error logs:

Missing SPC file

Failed to load System.Private.CoreLib.dll (error code 0x80070057)
Path: 'F:\git\runtime8\artifacts\tests\coreclr\windows.x64.Debug\Tests\Core_Root\System.Private.CoreLib.dll'

Could not load file or assembly 'F:\git\runtime8\artifacts\tests\coreclr\windows.x64.Debug\Tests\Core_Root\System.Private.CoreLib.dll'. The parameter is incorrect. (0x80070057 (E_INVALIDARG))

Another assembly instead of SPC

Failed to load System.Private.CoreLib.dll (error code 0x80131522)
Path: 'F:\git\runtime8\artifacts\tests\coreclr\windows.x64.Debug\Tests\Core_Root\System.Private.CoreLib.dll'

Could not load type 'System.Object' from assembly 'System.ObjectModel, Version=7.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a'.

Text file instead of SPC

Failed to load System.Private.CoreLib.dll (error code 0x80131018)
Path: 'F:\git\runtime8\artifacts\tests\coreclr\windows.x64.Debug\Tests\Core_Root\System.Private.CoreLib.dll'

Could not load file or assembly 'F:\git\runtime8\artifacts\tests\coreclr\windows.x64.Debug\Tests\Core_Root\System.Private.CoreLib.dll'. The module was expected to contain an assembly manifest. (0x80131018 (COR_E_ASSEMBLYEXPECTED))

Wrong version of SPC (x86)

Failed to load System.Private.CoreLib.dll (error code 0x8007000B)
Path: 'F:\git\runtime8\artifacts\tests\coreclr\windows.x64.Debug\Tests\Core_Root\System.Private.CoreLib.dll'

Could not load file or assembly 'F:\git\runtime8\artifacts\tests\coreclr\windows.x64.Debug\Tests\Core_Root\System.Private.CoreLib.dll'. An attempt was made to load a program with an incorrect format. (0x8007000B (COR_E_BADIMAGEFORMAT))

Wrong version of SPC (x64 from 3.1):

Failed to load System.Private.CoreLib.dll (error code 0x80131522)
Path: 'F:\git\runtime8\artifacts\tests\coreclr\windows.x64.Debug\Tests\Core_Root\System.Private.CoreLib.dll'

Could not load type 'System.Runtime.CompilerServices.CastHelpers' from assembly 'System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e'.

OOM

Failed to load System.Private.CoreLib.dll (error code 0x8007000E)
Path:
F:\git\runtime8\artifacts\tests\coreclr\windows.x64.Debug\Tests\Core_Root\System.Private.CoreLib.dll
Error message:
Out Of Memory

Wrong setting of GC log file

GC: Cannot create log file
GC heap initialization failed with error 0x80004005

@@ -947,6 +950,21 @@ void EEStartupHelper()
g_MiniMetaDataBuffMaxSize, MEM_COMMIT, PAGE_READWRITE);
#endif // FEATURE_MINIMETADATA_IN_TRIAGEDUMPS

#ifdef TARGET_UNIX
// Preload the libSystem.Native.so/dylib to detect possible problems with loading it early
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to work for singlefilehost?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, singlefilehost should have been excluded here

@jkotas
Copy link
Member

jkotas commented Nov 24, 2022

Missing SPC file

Where is the invalid arg comming from in this case? Would it be feasible to change it to file not found?

Copy link
Member

@AaronRobinsonMSFT AaronRobinsonMSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the work here a lot.

src/coreclr/hosts/corerun/corerun.cpp Outdated Show resolved Hide resolved
src/coreclr/hosts/corerun/corerun.cpp Outdated Show resolved Hide resolved
src/coreclr/vm/codeman.cpp Outdated Show resolved Hide resolved
src/coreclr/vm/codeman.cpp Outdated Show resolved Hide resolved
src/coreclr/vm/codeman.cpp Outdated Show resolved Hide resolved
src/coreclr/vm/codeman.cpp Outdated Show resolved Hide resolved
src/coreclr/vm/codeman.cpp Outdated Show resolved Hide resolved
src/coreclr/vm/codeman.cpp Outdated Show resolved Hide resolved
@janvorli
Copy link
Member Author

Where is the invalid arg coming from in this case? Would it be feasible to change it to file not found?

It comes from here:

GO_WITH_HRESULT(E_INVALIDARG);

When the S.P.C. is not found at the expected location, corerun tries to find it in the TPA list. To iterate over the list, it calls the GetNextTPAPath with dllOnly argument set to true. The E_INVALIDARG is returned because it comes to the corerun.exe on the TPA list. The behavior of the iteration is kind of weird, as it stops iterating over the list. I would expect it to skip non-dll entries instead in this case.
The iteration loop is here:

for (SString::Iterator i = sTrustedPlatformAssemblies.Begin(); i != sTrustedPlatformAssemblies.End(); )
{
SString fileName;
SString simpleName;
bool isNativeImage = false;
HRESULT pathResult = S_OK;
IF_FAIL_GO(pathResult = GetNextTPAPath(sTrustedPlatformAssemblies, i, /*dllOnly*/ true, fileName, simpleName, isNativeImage));
if (pathResult == S_FALSE)
{
break;
}
if (simpleName.EqualsCaseInsensitive(sCoreLibSimpleName))
{
sCoreLib = fileName;
found = true;
break;
}
}

* revert the try_get_export to not to have the isOptional argument
* Unify the error logging in JIT loading to single variadic macro
* Also fixes Unix build, the code in gcenv.unix.cpp cannot use the GCToEEInterface
@jkotas
Copy link
Member

jkotas commented Nov 24, 2022

I would expect it to skip non-dll entries instead in this case.

Yes, I agree. (A separate question is why we need to be looking for CoreLib in so many places.)

@@ -947,6 +950,29 @@ void EEStartupHelper()
g_MiniMetaDataBuffMaxSize, MEM_COMMIT, PAGE_READWRITE);
#endif // FEATURE_MINIMETADATA_IN_TRIAGEDUMPS

#ifdef TARGET_UNIX
// Check if the current code is executing in the single file host or in libcoreclr.so. The libSystem.Native is linked
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to just check if (!g_coreclr_embedded)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking for a check like that and have failed to locate this one :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could also use the CORECLR_EMBEDDED define?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nice. I have not noticed that the ceemain.cpp is one of the few files that are compiled separately for single file host and libcoreclr.so.

Copy link
Member Author

@janvorli janvorli Nov 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elinor-fung I have found that when S.P.C. is bundled, this NativeLibrary::LoadLibraryByName call doesn't work as the S.P.C. doesn't exist in the filesystem, just in the bundle. This causes e.g. the
Microsoft.NET.HostModel.Tests.BundleAndRun.TestWithRelativePaths test to fail.
Do I need to pass something else than SystemDomain::SystemAssembly() to NativeLibrary::LoadLibraryByName to make it work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it seems I should pass in the DLLIMPORTSEARCHPATH_ASSEMBLYDIRECTORY.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem was actually different, in the EEStartupHelper, the native dll search paths from the properties passed by the host are not set yet. Moving the check to the end of the CreateAppDomainWithManager resolved the issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh, yet one more host test failing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that might be a variation of #75387? The case that is failing uses an extraction path that has a backslash on Unix and loading native libraries in PAL_LoadLibraryDirect goes through FILEDosToUnixPathA.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, running it locally with the FILEDosToUnixPathA modified to be dummy fixed the issue. There was one more mono related issue that I've fixed, but now it seems fine.

@janvorli janvorli closed this Dec 2, 2022
@janvorli janvorli reopened this Dec 2, 2022
@janvorli
Copy link
Member Author

janvorli commented Dec 2, 2022

@jkotas, @AaronRobinsonMSFT the CI tests have passed except for one known issue and one unrelated issue. I believe I've addressed your feedback. Can you please take another look?

@jkotas
Copy link
Member

jkotas commented Dec 2, 2022

Could you please make the message for missing CoreLib better (file not found instead of invalid arg)?

@janvorli
Copy link
Member Author

janvorli commented Dec 5, 2022

@jkotas I've fixed that in the last commit.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

Comment on lines +166 to +169
// Parameters:
// errorWriter - callback that will be called for each line of the error info
// - passing in NULL removes a callback that was previously set
//
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add that the callback can be called from any thread and thus must be thread safe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not the case at the moment, it is always called from the thread running the initialization. Do you mean that we should add that comment to make it possible if we ever needed that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't realize we don't need it currently. But I would add it anyway... it's really easy to forget about this and it could break in really weird ways (the current host implementation is thread safe fortunately).

@janvorli
Copy link
Member Author

janvorli commented Dec 6, 2022

The AllSubsets_Mono_LLVMFullAot_RuntimeTests llvmfullaot legs for x64 and arm64 are failing due to a sigsegv after I've pushed the fix for the error code. I don't see how that could have introduced such issue and I am unable to repro it on my local Ubuntu 21.04. Trying to repro in the centos 7 container that the lab uses.

@MichalStrehovsky
Copy link
Member

The AllSubsets_Mono_LLVMFullAot_RuntimeTests llvmfullaot legs for x64 and arm64 are failing due to a sigsegv after I've pushed the fix for the error code. I don't see how that could have introduced such issue and I am unable to repro it on my local Ubuntu 21.04. Trying to repro in the centos 7 container that the lab uses.

I think you're hitting #78580 (comment) and it's not your PR.

@janvorli
Copy link
Member Author

janvorli commented Dec 6, 2022

@MichalStrehovsky ah, great, thank you for letting me know!

@janvorli janvorli merged commit ffffc4b into dotnet:main Dec 7, 2022
@janvorli janvorli deleted the initialization-failures-logging-2 branch December 7, 2022 01:29
@ghost ghost locked as resolved and limited conversation to collaborators Jan 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants