Skip to content

Conversation

@richlander
Copy link
Member

@richlander richlander commented May 9, 2025

Follow on from: #61755

@halter73 @wtgodbe

@richlander richlander requested review from a team and wtgodbe as code owners May 9, 2025 16:08
@github-actions github-actions bot added the area-infrastructure Includes: MSBuild projects/targets, build scripts, CI, Installers and shared framework label May 9, 2025
@richlander
Copy link
Member Author

Looks like mostly crypto related issues.

@bartonjs @vcsjones

@vcsjones
Copy link
Member

vcsjones commented May 9, 2025

2025-05-09T16:51:48.2734338Z    System.Security.Cryptography.CryptographicException : Error occurred during a cryptographic operation.
2025-05-09T16:51:48.2734366Z   Stack Trace:
2025-05-09T16:51:48.2738127Z      at System.Security.Cryptography.X509Certificates.OpenSslX509ChainProcessor.MapOpenSsl30Code(X509VerifyStatusCode code)
2025-05-09T16:51:48.2738755Z    at System.Security.Cryptography.X509Certificates.OpenSslX509ChainProcessor.AddElementStatus(X509VerifyStatusCode errorCode, List`1 elementStatus, List`1 overallStatus, Boolean& overallHasNotSignatureValid)
2025-05-09T16:51:48.2741390Z    at System.Security.Cryptography.X509Certificates.OpenSslX509ChainProcessor.AddElementStatus(ErrorCollection errorCodes, List`1 elementStatus, List`1 overallStatus, Boolean& overallHasNotSignatureValid)
2025-05-09T16:51:48.2741443Z    at System.Security.Cryptography.X509Certificates.OpenSslX509ChainProcessor.BuildChainElements(WorkingChain workingChain, List`1& overallStatus)
2025-05-09T16:51:48.2741468Z    at System.Security.Cryptography.X509Certificates.OpenSslX509ChainProcessor.Finish(OidCollection applicationPolicy, OidCollection certificatePolicy)
2025-05-09T16:51:48.2741551Z    at System.Security.Cryptography.X509Certificates.ChainPal.BuildChainCore(Boolean useMachineContext, ICertificatePal cert, X509Certificate2Collection extraStore, OidCollection applicationPolicy, OidCollection certificatePolicy, X509RevocationMode revocationMode, X509RevocationFlag revocationFlag, X509Certificate2Collection customTrustStore, X509ChainTrustMode trustMode, DateTime verificationTime, TimeSpan timeout, Boolean disableAia)
2025-05-09T16:51:48.2741667Z    at System.Security.Cryptography.X509Certificates.ChainPal.BuildChain(Boolean useMachineContext, ICertificatePal cert, X509Certificate2Collection extraStore, OidCollection applicationPolicy, OidCollection certificatePolicy, X509RevocationMode revocationMode, X509RevocationFlag revocationFlag, X509Certificate2Collection customTrustStore, X509ChainTrustMode trustMode, DateTime verificationTime, TimeSpan timeout, Boolean disableAia)
2025-05-09T16:51:48.2757190Z    at System.Security.Cryptography.X509Certificates.X509Chain.Build(X509Certificate2 certificate, Boolean throwOnException)
2025-05-09T16:51:48.2757275Z    at System.Net.Security.SslStreamCertificateContext.Create(X509Certificate2 target, X509Certificate2Collection additionalCertificates, Boolean offline, SslCertificateTrust trust, Boolean noOcspFetch)
2025-05-09T16:51:48.2765116Z    at Microsoft.AspNetCore.Server.Kestrel.Https.Internal.HttpsConnectionMiddleware..ctor(ConnectionDelegate next, HttpsConnectionAdapterOptions options, HttpProtocols httpProtocols, ILoggerFactory loggerFactory, KestrelMetrics metrics) in /_/src/Servers/Kestrel/Core/src/Middleware/HttpsConnectionMiddleware.cs:line 108

Looks like dotnet/runtime#114129 but I haven't been able to reproduce it. Without doing so it's a little hard, but maybe we can add some diagnostic code.

Basically OpenSSL 3 is giving us an error we don't know how to handle, but I can't get it to error myself :-D.

@bartonjs What do you think about including the numeric value in the exception? That would at least give us a clue to even understand what part of chain building is failing so we could better understand it.

@bartonjs
Copy link
Member

bartonjs commented May 9, 2025

What do you think about including the numeric value in the exception?

We have it in the assert... but yeah, that's not so nice when it happens outside of a debug build.

I'm torn between whether we want to change the message here, or be weird and do a double-throw so it's contained only in an InnerException. It's probably fine to just change the message here.

I was on a thread about this particular one a week or so ago, and no one can repro it outside of Helix, so something weird seems to be happening on these machines. Might need to pull it down from docker.

@richlander
Copy link
Member Author

Might be useful to investigate in the VM itself. We're not using a container in this scenario.

https://dev.azure.com/dnceng/internal/_wiki/wikis/DNCEng%20Services%20Wiki/915/Investigating-Helix-VM-images

@vcsjones
Copy link
Member

vcsjones commented May 9, 2025

Might be useful to investigate in the VM itself.

I as a GitHub employee don't have access to most things, including pretty much everything in that link, so that looks difficult.

Even if I could get on a VM, I don't know what I would do there, short of "Try running AspNetCore against a debug build of the runtime" and I don't know where to begin with that, either.

@vcsjones
Copy link
Member

vcsjones commented May 9, 2025

Might need to pull it down from docker.

I did, same image @LoopedBard3 mentioned in their report. My concern is it might be network condition, AzSecPack doing something goofy, etc.

@vcsjones
Copy link
Member

We merged dotnet/runtime#115485 that will at least help us identify what the error is instead of a generic CryptographicException. We should wait until that dotnet/runtime flows over here and re-run the build to get better error information.

@richlander
Copy link
Member Author

That sounds great @vcsjones. Thanks for doing that.

@richlander
Copy link
Member Author

@vcsjones -- time to take another run?

@dotnet-policy-service dotnet-policy-service bot added the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label May 20, 2025
@vcsjones
Copy link
Member

@richlander im less familiar with how changes flow in to aspnetcore but you can merge in main and we'll see what happens.

@richlander
Copy link
Member Author

I just checked. We need to wait another day.

@richlander
Copy link
Member Author

We need this PR to merge: #62019.

It includes this runtime commit: dotnet/runtime@29638e8

@richlander
Copy link
Member Author

/azp run

@dotnet-policy-service dotnet-policy-service bot removed the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label May 22, 2025
@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@richlander
Copy link
Member Author

That PR merged so we're ready to go again. Re-running CI.

@vcsjones
Copy link
Member

To also set an expectation: it possible or actually likely the failure is non deterministic. You might need to run a couple of times to see it again. (This was an observation from the performance folks that were running in to this)

@vcsjones
Copy link
Member

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@richlander
Copy link
Member Author

Failure looks unrelated:

Microsoft.AspNetCore.Hosting.WebHostTests.WebHostStopAsyncUsesDefaultTimeoutIfNoTokenProvided [FAIL]
2025-05-23T02:28:03.7696823Z [xUnit.net 00:00:31.33]       Assert.Equal() Failure: Values differ
2025-05-23T02:28:03.7698753Z [xUnit.net 00:00:31.33]       Expected: Task<VoidTaskResult,<StopAsync>d__32> { Status = RanToCompletion }
2025-05-23T02:28:03.7700631Z [xUnit.net 00:00:31.33]       Actual:   Task { Status = RanToCompletion }
2025-05-23T02:28:03.7702476Z [xUnit.net 00:00:31.33]       Stack Trace:
2025-05-23T02:28:03.7704601Z [xUnit.net 00:00:31.33]         /_/src/Hosting/Hosting/test/WebHostTests.cs(264,0): at Microsoft.AspNetCore.Hosting.WebHostTests.WebHostStopAsyncUsesDefaultTimeoutIfNoTokenProvided()
2025-05-23T02:28:03.7706513Z [xUnit.net 00:00:31.33]         --- End of stack trace from previous location ---

@vcsjones
Copy link
Member

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@vcsjones
Copy link
Member

Another try to get that crypto exception to raise.

@richlander
Copy link
Member Author

Thanks for the effort! When I made these changes, the AL3 VM wasn't ready. It now is. I'm thinking we should switch to it now, assuming we don't see the issue show up. Sound good?

@richlander
Copy link
Member Author

This sucks ... CI keeps on passing! /s

@richlander
Copy link
Member Author

Let's see how this works with AL3.

@richlander
Copy link
Member Author

Remaining issues seem Windows specific. I think we can merge. Good? If so, I'll rename the PR to make it more descriptive of the final change.

@vcsjones @wtgodbe

@vcsjones
Copy link
Member

What is the scope of the helix queue change?

Is this moving all Linux testing from Ubuntu 20.04 to AZL3?

@richlander
Copy link
Member Author

Right. Original plan was 20.04 -> 22.04. The AL3 helix VMs became available through that process, so I'm now proposing that we make that move, somewhat matching what we're doing in runtime. I knew that while we (meaning you) were investing the 22.04 issue. That seems like a potentially legitimate issue so I decided to see it through.

@dotnet-policy-service dotnet-policy-service bot added the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Jun 2, 2025
@richlander
Copy link
Member Author

Best to wait on this change until update Azure Linux VMs are deployed.

@bartonjs
Copy link
Member

bartonjs commented Jun 2, 2025

Original plan was 20.04 -> 22.04. The AL3 helix VMs became available through that process, so I'm now proposing that we make that move

AFAIK (which could be wrong, I don't look at the telemetry all that often), Ubuntu is where the customers are, AZL is not. So (to me) if there's only one Linux, it should be Ubuntu.

Saying that AZL is gaining ground for Azure hosted things, and therefore should also be tested is fine; but I raise a very skeptical eye at it being the only Linux.

@richlander
Copy link
Member Author

It's not the only Linux there. It's the only VM.

I see now that I should switch $(HelixQueueAzureLinux); to Ubuntu. I'll do that when the Azure Linux VM is re-ready.

@richlander
Copy link
Member Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@dotnet-policy-service dotnet-policy-service bot removed the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Jun 5, 2025
@richlander richlander changed the title Upgrade Ubuntu VMs Upgrade Ubuntu and Azure Linux VMs and Containers Jun 5, 2025
@richlander
Copy link
Member Author

@wtgodbe -- this is ready to go.

Build is clean, Azure Linux and Ubuntu are updated, and we have both covered.

@wtgodbe wtgodbe merged commit 6a96694 into main Jun 5, 2025
27 checks passed
@wtgodbe wtgodbe deleted the upgrade-ubuntu branch June 5, 2025 16:32
@dotnet-policy-service dotnet-policy-service bot added this to the 10.0-preview6 milestone Jun 5, 2025
richlander added a commit that referenced this pull request Jun 6, 2025
richlander added a commit that referenced this pull request Jun 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-infrastructure Includes: MSBuild projects/targets, build scripts, CI, Installers and shared framework

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants