Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Release/7.0] CoreCLR initialization failures logging #81134

Conversation

janvorli
Copy link
Member

Port of #78484, #78790, #80104 and #80294
This change adds detecting and logging of failures during coreclr initialization. For logging, it uses a new host API coreclr_set_error_writer to register a callback to report the errors to the host. The hosts have support for optional usage of this API so that they can work with older runtime versions as well.

The logging checks and reports failures with:

  • System.Private.CoreLib.dll
  • GC initialization
  • JIT initialization
  • libSystem.Native.so/dylib on Unix

The logging messages should allow customers to self-diagnose the issues causing the failures.

This change also adds backport of support for standalone GC back compatibility that is a prerequisite for it.

Customer Impact

Without this change, .NET core apps can fail with a dreaded error code 0x80004005 (E_FAIL) in many cases like issues with System.Private.CoreLib.dll loading, GC initialization and other cases. That effectively prevents self-diagnosis of the underlying issue. Both 3rd party customers and internal teams including the Excel team have been asking for more detailed diagnostic information for this kind of failures for quite a long time.

Testing

Local testing with manual injections of various failures like broken / incorrect version / missing System.Private.CoreLib.dll, JIT, libSystem.Native.so/dylib on Unix or wrong GC configuration. The Excel team also tested this port with their .NET host.

Risk

Low, the change influences only the failure code paths during coreclr initialization when the app is already on its way to be aborted with the E_FAIL. It has been in main for 1.5 months now without any issues other than one issue with single file exe that was discovered and fixed quickly.

Port of dotnet#78484, dotnet#78790, dotnet#80104 and dotnet#80294
This change adds detecting and logging of failures during coreclr
initialization. For logging, it uses a new host API
`coreclr_set_error_writer` to register a callback to report the errors
to the host. The hosts have support for optional usage of this API so
that they can work with older runtime versions as well.

The logging checks and reports failures with:
* System.Private.CoreLib.dll
* GC initialization
* JIT initialization
* libSystem.Native.so/dylib on Unix

The logging messages should allow customers to self-diagnose the issues
causing the failures.

This change also adds backport of support for standalone GC back compatibility that
is a prerequisite for it.
@janvorli janvorli added Servicing-consider Issue for next servicing release review area-VM-coreclr labels Jan 24, 2023
@janvorli janvorli added this to the 7.0.x milestone Jan 24, 2023
@janvorli janvorli self-assigned this Jan 24, 2023
Copy link
Member

@jeffschwMSFT jeffschwMSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved. we will take for consideration in 7.0.x

Copy link
Member

@Maoni0 Maoni0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately I didn't know when this was checked into 8.0 and I do want to change some things there but they won't change anything functionally. so I'll just make the changes in 8.0 and those don't need to be backported.

@Maoni0
Copy link
Member

Maoni0 commented Jan 25, 2023

actually we are still discussing this on our side. please don't merge yet.

* Fix GC interfaces versioning

The change that introduced GC interfaces versioning had a bug preventing
it from using .NET 8 GC being used with .NET 7 runtime. The
GC_VersionInfo return value should have been the minimum supported
version, not the current one. But more importantly there was also
some confusion on what interface the GC_INTERFACE_MAJOR_VERSION
represents. While the change considered it to be a version of the
IGCToCLR interface, it really means the version of the IGCHeap
interface. This change creates a separate version,
EE_INTERFACE_MAJOR_VERSION for versioning o the IGCToCLR interface to
rectify that.
@Maoni0 also didn't like the way of creating a new version of
IGCToCLR interface for each major version change, so I am changing it to
single IGCToCLR interface.
@janvorli
Copy link
Member Author

@Maoni0 I have added the change based on your feedback.

@janvorli
Copy link
Member Author

@dotnet/dnceng some of the CI legs in this PR are consistently failing with

NuGet.Configuration.NuGetConfigurationException : Failed to read NuGet.Config due to unauthorized access. Path: '/.nuget/NuGet/NuGet.Config'.
---- System.UnauthorizedAccessException : Access to the path '/.nuget' is denied.
-------- System.IO.IOException : Permission denied

Is it a known issue?

@MattGal
Copy link
Member

MattGal commented Jan 26, 2023

@dotnet/dnceng some of the CI legs in this PR are consistently failing with

NuGet.Configuration.NuGetConfigurationException : Failed to read NuGet.Config due to unauthorized access. Path: '/.nuget/NuGet/NuGet.Config'.
---- System.UnauthorizedAccessException : Access to the path '/.nuget' is denied.
-------- System.IO.IOException : Permission denied

Is it a known issue?

I think so. It's hard in a pipeline with over 100 jobs where you haven't specified which of the jobs you're talking about to deduce what you specifically mean, but if you mean logs like https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-81134-merge-894df07df5fc42d8a6/Microsoft.Extensions.Configuration.UserSecrets.Tests/1/console.446ceeaf.log , this is because you're using mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.14-helix-arm64v8-20210910135810-8a6f4f3. ARM64 ubuntu machines have moved from on-premises to Azure, and in doing so the UID of the Helix bot user changed. This fix has been in place for months.

Regardless of whether I guessed correctly (again, please include the name of a AzDO job or Helix correlation id when asking these questions), you should move to running on alpine-3.14-helix-arm64v8 (no date/ version) so you can float and pick up updates like this in the future. The current version is 20230125182149-e516922, a full 16 months newer version of the image.

@rbhanda rbhanda added Servicing-approved Approved for servicing release and removed Servicing-consider Issue for next servicing release review labels Jan 26, 2023
@rbhanda rbhanda modified the milestones: 7.0.x, 7.0.4 Jan 26, 2023
@MattGal
Copy link
Member

MattGal commented Jan 26, 2023

@MattGal For example, this one: https://dev.azure.com/dnceng-public/public/_build/results?buildId=149037&view=ms.vss-test-web.build-test-results-tab&runId=3173025&resultId=125715&paneView=debug

Ah. This is specifically what I was trying to point out earlier. Change your mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm64v8-20210531091519-97d8652 to mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm64v8 and it should start working again.

The basic reason this started happening recently because we only recently moved the ubuntu.1804.armarch* queues to Azure VMs, which necessitated a parallel change in docker images from last year. There was also a dncpartners email from approximately 9/26/2022 requesting you to make this move you can refer to.

@Maoni0
Copy link
Member

Maoni0 commented Jan 26, 2023

@Maoni0 I have added the change based on your feedback.

thanks @janvorli!

@carlossanlop
Copy link
Member

Approved by Tactics for 7.0.4.
Signed-off by area owners.
No OOB changes needed.
CI failures unrelated: A fix has been merged for the arm64 unauthorized exception issues that this PR hit: #81712. The fixed issues are #81544 #81391 #81619. The other issue I see in the CI is #81123 but it happened in the same machines, so maybe it had the same root cause.
Ready to merge. :shipit:

@carlossanlop carlossanlop merged commit c67e577 into dotnet:release/7.0 Feb 8, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Mar 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-VM-coreclr Servicing-approved Approved for servicing release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants