Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Az commands failing intermittently with StackOverflowException when CheckForUpgrade is enabled #26623

Open
onetocny opened this issue Nov 4, 2024 · 4 comments
Assignees
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. customer-reported Tracking We will track status and follow internally

Comments

@onetocny
Copy link

onetocny commented Nov 4, 2024

Description

The Az modules used in Azure Devops AzurePowerShell tasks are intermittently failing. We are using AzAccounts module to authenticate PowerShell scope against Azure resources. After importing Az.Accounts and running very first Az command (usually Get-AzConfig) PowerShell process intermittently exits with following error output (see the related issues below for more details about the symptoms of the issue):

Process is terminated due to StackOverflowException.
...
Exit code: 57005

The whole completed code that is responsible for Az module initialization could be found here. Here is the shortened code:

$module = Get-Module -Name "Az.Accounts" -ListAvailable | Sort-Object Version -Descending | Select-Object -First 1
Import-Module -Name $module.Path -Global -PassThru -Force
...
Get-AzConfig -AppliesTo Az -Scope CurrentUser

The process does not terminate immediately after Get-AzConfig but exists randomly in further code as it probably does not take constant time to overflow the stack. We have noticed that the issue appears across all versions starting 3.0.0.

Preliminary RCA

We were able to find these records in event log on build agent machines where the issue happens. The error messages are pointing to Azure Watson dump records. Here is the example. Looking at the callstack there I was able to localize that the issue happens in UpgradeNotificationHelper.RefreshVersionInfo. We were able to mitigate the issue on our side by leveraging this code and disabling the whole feature by following command:

Update-AzConfig -CheckForUpgrade $false -AppliesTo Az -Scope Process

However the Update-AzConfig has to be very first Az command we call otherwise the StackOverflowException occurs again. I believe that is caused by the fact that version check is called automatically after every Az command completes in AzurePSCmdlet.EndProcessing.

Related issues

@isra-fel isra-fel transferred this issue from Azure/azure-powershell-common Nov 7, 2024
@microsoft-github-policy-service microsoft-github-policy-service bot added customer-reported needs-triage This is a new issue that needs to be triaged to the appropriate team. labels Nov 7, 2024
@isra-fel isra-fel added bug This issue requires a change to an existing behavior in the product in order to be resolved. Tracking We will track status and follow internally and removed needs-triage This is a new issue that needs to be triaged to the appropriate team. labels Nov 7, 2024
@isra-fel
Copy link
Member

isra-fel commented Nov 7, 2024

Thanks for the analysis. We indeed initiates a background thread to check for updates. Although it is still unclear why that would result in a stackoverflow.

@trevors20
Copy link

I am reposting a post from this Issue: microsoft/azure-pipelines-tasks#20156
henriblMSFT may have a possible reason why this may be happening and it would be good for you folks to look at this possibility:
This is from henriblMSFT:
I'm intermittently hitting stack overflow with version 5.248.3 and looking at the call stack in the linked item I believe the UpgradeNotificationHelper exposes the problem but isn't the root cause. I believe the root cause is that attempting to call a powershell defined AssemblyResolver from a background thread when the assembly to resolve is a resource string will cause a stack overflow. This happens with UpgradeNotificationHelper because it runs in the background but any other background thread may cause this issue.

Unfortunately, I don't have the call stack from my failed run so I am extrapolating here.

If we look at the callstack from the linked Az Powershell issue , the stackoverflow is around assembly resolution. The assembly resolver that's causing the stack overflow is a PowerShell ScriptBlock:

[d:\os\src\onecore\admin\monad\src\engine\lang\Scriptblock.cs @ 774]
unknown!DynamicClass.lambda_method+0x111
Digging further, it gets into the PowerShell internal where it seems to be triggering an error:
:

System_Management_Automation_ni!System.Management.Automation.ErrorCategoryInfo.Ellipsize+0x56 [d:\os\src\onecore\admin\monad\src\engine\ErrorPackage.cs @ 511]
System_Management_Automation_ni!System.Management.Automation.ScriptBlock.GetContextFromTLS+0xc97f60
Looking at what the GetContextFromTLS does:

internal ExecutionContext GetContextFromTLS()
{
ExecutionContext executionContextFromTLS = LocalPipeline.GetExecutionContextFromTLS();
if (executionContextFromTLS == null)
{
string original = ToString();
original = ErrorCategoryInfo.Ellipsize(CultureInfo.CurrentUICulture, original);
PSInvalidOperationException ex = PSTraceSource.NewInvalidOperationException(ParserStrings.ScriptBlockDelegateInvokedFromWrongThread, original);
ex.SetErrorId("ScriptBlockDelegateInvokedFromWrongThread");
throw ex;
}
return executionContextFromTLS;
}
It is trying to emit a ScriptBlockDelegateInvokedFromWrongThread error but to do this it needs to get a localized version of the error string. Since the resource assembly isn't loaded, the runtime attempts to load the assembly, which goes through the PowerShell resolver which triggers an error because it is in a background thread which then tries to resolve resource assembly which runs the PowerShell assembly resolver, and we are now in our stack overflow scenario.

Searching azure-pipeline-tasks for cases where an assembly resolver is registered with powershell script we find two instances:

VstsAzureHelpers_/Utility.ps1#L216
VstsAzureRestHelpers_/VstsAzureRestHelpers_.psm1#L1400
It turns out, that in Azure PowerShell the VstsAzureHelpers module is loaded.

So, to truly fix all the cases where this can happen, VstsAzureHelpers_ would need to be updated to use a c# assembly resolver instead of a powershell script block in VstsAzureHelpers

cc @starkmsu who introduced this change

Call stack for reference

unknown!DynamicClass.lambda_method+0x111
mscorlib_ni!System.AppDomain.OnAssemblyResolveEvent+0xb1 [f:\dd\ndp\clr\src\BCL\system\appdomain.cs @ 3188]
clr!CallDescrWorkerInternal+0x83
clr!CallDescrWorkerWithHandler+0x47
clr!MethodDescCallSite::CallTargetWorker+0xfa
clr!AppDomain::RaiseAssemblyResolveEvent+0x26d9f6
clr!AppDomain::TryResolveAssembly+0x90
clr!AppDomain::PostBindResolveAssembly+0xcc
clr!AppDomain::BindAssemblySpec+0x26aeac
clr!AssemblySpec::LoadDomainAssembly+0x1e6
clr!AssemblyNative::Load+0x3c4
mscorlib_ni!System.Reflection.RuntimeAssembly.InternalGetSatelliteAssembly+0x155 [f:\dd\ndp\clr\src\BCL\system\reflection\assembly.cs @ 2937]
mscorlib_ni!System.Resources.ManifestBasedResourceGroveler.GetSatelliteAssembly+0xf5 [f:\dd\ndp\clr\src\BCL\system\resources\manifestbasedresourcegroveler.cs @ 555]
mscorlib_ni!System.Resources.ManifestBasedResourceGroveler.GrovelForResourceSet+0x1eb [f:\dd\ndp\clr\src\BCL\system\resources\manifestbasedresourcegroveler.cs @ 89]
mscorlib_ni!System.Resources.ResourceManager.InternalGetResourceSet+0x40f [f:\dd\ndp\clr\src\BCL\system\resources\resourcemanager.cs @ 808]
mscorlib_ni!System.Resources.ResourceManager.InternalGetResourceSet+0x2b [f:\dd\ndp\clr\src\BCL\system\resources\resourcemanager.cs @ 752]
mscorlib_ni!System.Resources.ResourceManager.GetString+0x235 [f:\dd\ndp\clr\src\BCL\system\resources\resourcemanager.cs @ 1316]
System_Management_Automation_ni!System.Management.Automation.ErrorCategoryInfo.Ellipsize+0x56 [d:\os\src\onecore\admin\monad\src\engine\ErrorPackage.cs @ 511]
System_Management_Automation_ni!System.Management.Automation.ScriptBlock.GetContextFromTLS+0xc97f60 [d:\os\src\onecore\admin\monad\src\engine\lang\Scriptblock.cs @ 799]
System_Management_Automation_ni!System.Management.Automation.ScriptBlock.InvokeAsDelegateHelper+0x2b [d:\os\src\onecore\admin\monad\src\engine\lang\Scriptblock.cs @ 774]
unknown!DynamicClass.lambda_method+0x111
mscorlib_ni!System.AppDomain.OnAssemblyResolveEvent+0xb1 [f:\dd\ndp\clr\src\BCL\system\appdomain.cs @ 3188]
clr!CallDescrWorkerInternal+0x83
clr!CallDescrWorkerWithHandler+0x47
clr!MethodDescCallSite::CallTargetWorker+0xfa
clr!AppDomain::RaiseAssemblyResolveEvent+0x26d9f6
clr!AppDomain::TryResolveAssembly+0x90
clr!AppDomain::PostBindResolveAssembly+0xcc
clr!AppDomain::BindAssemblySpec+0x26aeac
clr!AssemblySpec::LoadDomainAssembly+0x1e6
clr!AssemblyNative::Load+0x3c4
mscorlib_ni!System.Reflection.RuntimeAssembly.InternalGetSatelliteAssembly+0x155 [f:\dd\ndp\clr\src\BCL\system\reflection\assembly.cs @ 2937]
mscorlib_ni!System.Resources.ManifestBasedResourceGroveler.GetSatelliteAssembly+0xf5 [f:\dd\ndp\clr\src\BCL\system\resources\manifestbasedresourcegroveler.cs @ 555]
mscorlib_ni!System.Resources.ManifestBasedResourceGroveler.GrovelForResourceSet+0x1eb [f:\dd\ndp\clr\src\BCL\system\resources\manifestbasedresourcegroveler.cs @ 89]
mscorlib_ni!System.Resources.ResourceManager.InternalGetResourceSet+0x40f [f:\dd\ndp\clr\src\BCL\system\resources\resourcemanager.cs @ 808]
mscorlib_ni!System.Resources.ResourceManager.InternalGetResourceSet+0x2b [f:\dd\ndp\clr\src\BCL\system\resources\resourcemanager.cs @ 752]
mscorlib_ni!System.Resources.ResourceManager.GetString+0x235 [f:\dd\ndp\clr\src\BCL\system\resources\resourcemanager.cs @ 131

@isra-fel
Copy link
Member

@YanaXu can you take a look and see whether we have action items to take?

@YanaXu
Copy link
Contributor

YanaXu commented Jan 10, 2025

Hi @onetocny, I got what you described and how to workaround this issue. I'm sorry for your inconvenience.
I'd like to reprocue it but my build succeeded.
Here is my pipeline.

pr: none
trigger: none

steps:

- task: AzurePowerShell@5
  displayName: Test with default version
  inputs:
    azureSubscription: 'my-service-connection-name'
    ScriptPath: 'ps/test.ps1'
    azurePowerShellVersion: LatestVersion

- task: AzurePowerShell@5
  displayName: Test with latest version
  inputs:
    azureSubscription: 'my-service-connection-name'
    ScriptType: 'FilePath'
    ScriptPath: 'ps/test.ps1'
    preferredAzurePowerShellVersion: '13.0.0'

The ps/test.ps1 content is

Write-Host "---- Step 1"

$module = Get-Module -Name "Az.Accounts" -ListAvailable | Sort-Object Version -Descending | Select-Object -First 1
Import-Module -Name $module.Path -Global -PassThru -Force
Get-AzConfig -AppliesTo Az -Scope CurrentUser

Write-Host "---- Step 2"

Get-AzConfig

Write-Host "---- Step 3"

In my build, the agent image is https://github.com/actions/runner-images/releases/tag/ubuntu22%2F20250105.1, the Azure PowerShell task version is 5.249.1. Azure PowerShell versions are pre-installed 12.1.0 and downloaded 13.0.0.

Could you try my pipeline and tell me if it fails? Or how to reproduce this issue?
And I also want to know if you're using a self-hosted agent or a default agent from Azure Pipelines?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. customer-reported Tracking We will track status and follow internally
Projects
None yet
Development

No branches or pull requests

4 participants