Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSBuildWorkspace.OpenSolutionAsync hangs indefinitely on 4.9.0+ in Windows 2022 Azure CI/CD #75967

Closed
kamronbatman opened this issue Nov 19, 2024 · 34 comments
Assignees
Labels
Milestone

Comments

@kamronbatman
Copy link

kamronbatman commented Nov 19, 2024

Version Used:

        <PackageReference Include="Microsoft.Build.Locator" Version="1.7.1" />
        <PackageReference Include="Microsoft.CodeAnalysis.CSharp" Version="4.11.0" />
        <PackageReference Include="Microsoft.CodeAnalysis.CSharp.Workspaces" Version="4.11.0" />
        <PackageReference Include="Microsoft.CodeAnalysis.Workspaces.MSBuild" Version="4.11.0" />

Steps to Reproduce:

  1. Writing a dotnet tool that opens a solution to run through all projects. This code hangs indefinitely:
        MSBuildLocator.RegisterDefaults();

        using var workspace = MSBuildWorkspace.Create();
        workspace.WorkspaceFailed += (_, args) => Console.WriteLine(args.Diagnostic.Message);

        var solution = await workspace.OpenSolutionAsync(solutionPath);

Expected Behavior:

The code should move past OpenSolutionAsync, returning the workspace and finish running the application.

Actual Behavior:

Specifically on Azure Pipelines, latest Windows 2022, with .NET 9:

jobs:
- job: BuildWindows
  displayName: 'Windows Server 2022'
  pool:
    vmImage: 'windows-latest'

  steps:
  - task: UseDotNet@2
    displayName: 'Install .NET 9'
    inputs:
      useGlobalJson: true

The code above hangs indefinitely.

Link to the PR in our repo that replicates this:
https://github.com/modernuo/SerializationGenerator/pull/43/files#diff-f2b522070e7bfd744704829ee17ae88ea259383f12f3a45fb2d860a2a0c3e19c

In that PR, I also tried to clean up any possible deadlocks, but nothing changed.

Repo that uses this tool and it's PR:

https://github.com/modernuo/ModernUO/pull/1984/files#diff-7afd3bcf0d0c06d6f87c451ef06b321beef770ece4299b3230bb280470ada2f6

The azure pipelines yaml file we use is here:
https://github.com/modernuo/ModernUO/pull/1984/files#diff-7915b9b726a397ae7ba6af7b9703633d21c031ebf21682f3ee7e6a4ec52837a5

The tool is run through a cmd file as part of the build process:
https://github.com/modernuo/ModernUO/blob/1d08572926ad3e6376fbebb1ec169624f9cfdb74/publish.cmd#L46

I cannot replicate this locally on Windows 11, nor on other operating systems. If I had to guess, it is related to how .NET is installed/path'd on that docker

@dotnet-issue-labeler dotnet-issue-labeler bot added Area-IDE untriaged Issues and PRs which have not yet been triaged by a lead labels Nov 19, 2024
@kamronbatman
Copy link
Author

Per conversation on discord, @dibarbet you said you will try to take a look. Please contact me or let me know how I can help. I haven't had time to set up another azure pipelines just for a minimal reproducible snippet, sorry.

@kamronbatman kamronbatman changed the title MSBuildWorkspace.OpenSolutionAsync hangs indefinitely on 4.9+ in Windows 2022 Azure CI/CD MSBuildWorkspace.OpenSolutionAsync hangs indefinitely on 4.9.0+ in Windows 2022 Azure CI/CD Nov 19, 2024
@dibarbet
Copy link
Member

Been taking a look at this and nothing obvious is standing out. Can you try the following:

Update the code to pass in the progress parameter to OpenSolutionAsync that writes to the console:

public static async Task<IEnumerable<Project>> GetProjectsAsync(string solutionPath)
{
    ...
    var solution = await workspace.OpenSolutionAsync(solutionPath, progress: new Progress());
    ...
}

private class Progress : IProgress<ProjectLoadProgress>
{
    public void Report(ProjectLoadProgress value)
    {
        Console.WriteLine($"{value.Operation} completed for {value.FilePath} ({value.TargetFramework}) in {value.ElapsedTime.TotalMilliseconds}ms");
    }
}

That should write logs similar to

Running Migrations for C:\Users\dabarbet\source\repos\trash\SerializationGenerator\SerializationGenerator.sln
Evaluate completed for C:\Users\dabarbet\source\repos\trash\SerializationGenerator\ModernUO.Serialization.Generator\ModernUO.Serialization.Generator.csproj () in 636.8839ms
Build completed for C:\Users\dabarbet\source\repos\trash\SerializationGenerator\ModernUO.Serialization.Generator\ModernUO.Serialization.Generator.csproj () in 276.8386ms
Resolve completed for C:\Users\dabarbet\source\repos\trash\SerializationGenerator\ModernUO.Serialization.Generator\ModernUO.Serialization.Generator.csproj (netstandard2.0) in 141.4972ms
...

Beyond that I think we'll have to get a hang dump (probably via procdump)

@dibarbet
Copy link
Member

I also wasn't able to get a minimal repro. For example I created this dotnet tool that loads a solution successfully on windows-2022 github action images - https://github.com/dibarbet/MSBuildWorkspaceConsole/actions/runs/11923292319/job/33231361423

Based on this link - https://learn.microsoft.com/en-us/azure/devops/pipelines/agents/hosted?view=azure-devops&tabs=yaml , the github image should be the same as the azure devops one.

I would guess something related to the specific solution or environment being setup is causing an issue, though not sure what that would be.

@kamronbatman
Copy link
Author

kamronbatman commented Nov 20, 2024

Been taking a look at this and nothing obvious is standing out. Can you try the following:

Update the code to pass in the progress parameter to OpenSolutionAsync that writes to the console:

public static async Task<IEnumerable> GetProjectsAsync(string solutionPath)
{
...
var solution = await workspace.OpenSolutionAsync(solutionPath, progress: new Progress());
...
}

private class Progress : IProgress
{
public void Report(ProjectLoadProgress value)
{
Console.WriteLine($"{value.Operation} completed for {value.FilePath} ({value.TargetFramework}) in {value.ElapsedTime.TotalMilliseconds}ms");
}
}
That should write logs similar to

Running Migrations for C:\Users\dabarbet\source\repos\trash\SerializationGenerator\SerializationGenerator.sln
Evaluate completed for C:\Users\dabarbet\source\repos\trash\SerializationGenerator\ModernUO.Serialization.Generator\ModernUO.Serialization.Generator.csproj () in 636.8839ms
Build completed for C:\Users\dabarbet\source\repos\trash\SerializationGenerator\ModernUO.Serialization.Generator\ModernUO.Serialization.Generator.csproj () in 276.8386ms
Resolve completed for C:\Users\dabarbet\source\repos\trash\SerializationGenerator\ModernUO.Serialization.Generator\ModernUO.Serialization.Generator.csproj (netstandard2.0) in 141.4972ms
...

Beyond that I think we'll have to get a hang dump (probably via procdump)

I'll do this and see what happens. Regarding a procdump, if you have steps I can throw together for the CI/CD to do that, let me know.

Note that with no change to code or solution that is opened, v4.8.0 worked fine.

EDIT: With the progress change, here is the run:
https://dev.azure.com/modernuo/modernuo/_build/results?buildId=6447

@kamronbatman
Copy link
Author

kamronbatman commented Nov 20, 2024

So the run shows:

Generating serialization migration schema...
Running Migrations for ModernUO.sln
Evaluate completed for D:\a\1\s\Projects\Server\Server.csproj () in 1587.7229ms

Contrast to locally (or other CI/CD runs like github actions with linux/macos):

Running Migrations for C:\Repositories\ModernUO\ModernUO.sln
Evaluate completed for C:\Repositories\ModernUO\Projects\Server\Server.csproj () in 418.6091ms
Build completed for C:\Repositories\ModernUO\Projects\Server\Server.csproj () in 537.2811ms
Resolve completed for C:\Repositories\ModernUO\Projects\Server\Server.csproj (net9.0) in 101.3583ms
Evaluate completed for C:\Repositories\ModernUO\Projects\UOContent\UOContent.csproj () in 303.8839ms
Build completed for C:\Repositories\ModernUO\Projects\UOContent\UOContent.csproj () in 463.5592ms
Resolve completed for C:\Repositories\ModernUO\Projects\UOContent\UOContent.csproj (net9.0) in 54.8129ms
Evaluate completed for C:\Repositories\ModernUO\Projects\Server.Tests\Server.Tests.csproj () in 52.6765ms
Build completed for C:\Repositories\ModernUO\Projects\Server.Tests\Server.Tests.csproj () in 198.41ms
Evaluate completed for C:\Repositories\ModernUO\Projects\Application\Application.csproj () in 28.9806ms
Build completed for C:\Repositories\ModernUO\Projects\Application\Application.csproj () in 63.5944ms
Resolve completed for C:\Repositories\ModernUO\Projects\Application\Application.csproj (net9.0) in 4.6206ms
Resolve completed for C:\Repositories\ModernUO\Projects\Server.Tests\Server.Tests.csproj (net9.0) in 108.1894ms
Evaluate completed for C:\Repositories\ModernUO\Projects\UOContent.Tests\UOContent.Tests.csproj () in 37.7079ms
Build completed for C:\Repositories\ModernUO\Projects\UOContent.Tests\UOContent.Tests.csproj () in 101.6578ms
Resolve completed for C:\Repositories\ModernUO\Projects\UOContent.Tests\UOContent.Tests.csproj (net9.0) in 5.6494ms
Completed migrations for Server

@dibarbet
Copy link
Member

@kamronbatman interesting note here - I took your .NET9 PR and ran it against github actions - appeared to have no issues there
https://github.com/dibarbet/ModernUO/tree/actions_only - I don't have access to an azure pipeline to run it against. I'll see if I can figure out procdump instructions tomorrow.

@dibarbet
Copy link
Member

@kamronbatman
Copy link
Author

@kamronbatman interesting note here - I took your .NET9 PR and ran it against github actions - appeared to have no issues there https://github.com/dibarbet/ModernUO/tree/actions_only - I don't have access to an azure pipeline to run it against. I'll see if I can figure out procdump instructions tomorrow.

Good to know, thanks. I stopped using GitHub actions because it wasn't part of the unlimited actions for public repos. There was also another issue in the past but I don't remember what it was.

@dibarbet
Copy link
Member

@kamronbatman try something similar to this - dibarbet/ModernUO@202ae64

You may have to tweak it a bit to make sure it captures the dump on hang and uploads correctly.

It is incredibly mystifying that the github actions run passes, but fails on AzDo with the exact same image.

@kamronbatman
Copy link
Author

kamronbatman commented Nov 21, 2024

@dibarbet - The tool never exits, so can we put a 5minutes timer on the proc dump PS1 and kill the process? Otherwise I don't see how this would work. 🤔

PR Changes for the azure run:
https://github.com/modernuo/ModernUO/pull/1984/files#diff-7915b9b726a397ae7ba6af7b9703633d21c031ebf21682f3ee7e6a4ec52837a5

https://dev.azure.com/modernuo/modernuo/_build/results?buildId=6452&view=logs&j=3e8797c7-5b0a-5f2c-07a4-1bc5e60a122e&t=8bbe3afb-863e-5d44-0040-a95fcabd47cd

  • It's "hung" and doesn't dump.

EDIT: I added messages, and it never gets past running the schema generator. We might need to tweak the ps1 so it's non-blocking?

@dibarbet
Copy link
Member

dibarbet commented Nov 21, 2024

@kamronbatman procdump should collect a hang dump after the process doesn't respond for 5 seconds. IIRC it should also exit after that (and we only wait on the procdump process exiting). Either procdump doesn't actually exit after it collects a dump or procdump isn't detecting a hang. How we fix it depends on which one is happening.

I've adjusted the script so it should live write the process output to the console, instead of waiting until the end. This should help us tell if procdump is collecting anything or not - dibarbet/ModernUO@52a27e1

@kamronbatman
Copy link
Author

@kamronbatman procdump should collect a hang dump after the process doesn't respond for 5 seconds. IIRC it should also exit after that (and we only wait on the procdump process exiting). Either procdump doesn't actually exit after it collects a dump or procdump isn't detecting a hang. How we fix it depends on which one is happening.

I've adjusted the script so it should live write the process output to the console, instead of waiting until the end. This should help us tell if procdump is collecting anything or not - dibarbet/ModernUO@52a27e1

Generating serialization migration schema...
Procdump folder: D:\a\1\s\/procDump
Downloading procdump
Running schema generator
Running procdump
Waiting for proc dump exit
Running Migrations for ModernUO.sln

ProcDump v11.0 - Sysinternals process dump utility
Copyright (C) 2009-2022 Mark Russinovich and Andrew Richards
Sysinternals - www.sysinternals.com

Process:               dotnet.exe (5328)
Process image:         C:\hostedtoolcache\windows\dotnet\dotnet.exe
CPU threshold:         n/a
Performance counter:   n/a
Commit threshold:      n/a
Threshold seconds:     10
Hung window check:     Enabled
Log debug strings:     Disabled
Exception monitor:     Disabled
Exception filter:      [Includes]
                       *
                       [Excludes]
Terminate monitor:     Disabled
Cloning type:          Disabled
Concurrent limit:      n/a
Avoid outage:          n/a
Number of dumps:       1
Dump folder:           D:\a\1\s\/procDump\
Dump filename/mask:    PROCESSNAME_YYMMDD_HHMMSS
Queue to WER:          Disabled
Kill after dump:       Disabled


Press Ctrl-C to end monitoring without terminating the process.

Evaluate completed for D:\a\1\s\Projects\Server\Server.csproj () in 1578.1493ms

@dibarbet
Copy link
Member

Thanks - it looks like procdump isn't detecting a hang and not collecting any dumps.

Try this instead - it will collect a dump every 5 seconds up to 5 total. You may have to adjust the time depending on how long it takes to get to the broken state - dibarbet/ModernUO@14ea3c6

@kamronbatman
Copy link
Author

@dibarbet I sent you a link to the dumps via discord.

@dibarbet
Copy link
Member

Unfortunately hit another snag that I didn't realize beforehand - the processId from launching the dotnet tool is unfortunately only a wrapper around another process that is actually running the tool. And there are additionally a further couple of processes involved in the MSBuildWorkspace project loading. The previous dumps only captured the wrapper process which doesn't have anything really useful in it.

Procdump is definitely not very ergonomic to use for this case, but unfortunately I don't think there's anything better 😆 . 
I've modified the dump capture to instead look for processes with matching command line arguments. Hopefully this is the last iteration...
dibarbet/ModernUO@0946e25

@kamronbatman
Copy link
Author

Thanks @dibarbet, I sent you the dumps.

@dibarbet
Copy link
Member

@kamronbatman took a look at the dumps this morning. Nothing extremely concret yet, but I do have some clues.

  1. It looks like the hang is happening writing the build output when building the Server.csproj project.
  2. I'm seeing error code 8 as we try to serialize the project info back to the host. I believe this corresponds to not enough memory-
    Image
  3. There's an OutOfMemory exception in the dump (unrooted)
    Image
  4. Azure pipelines images run with 7GB RAM (link), whereas github actions run with 16GB (link), which may explain why it works on github actions but not azdo.

Not super clear as to why it is OOM'ing however, the dump itself doesn't show a huge amount of memory being used. Need to look at other issues, but will continue this investigation later.

@kamronbatman
Copy link
Author

Ok thanks. It is concerning if the project requires more than 7 or 8GB to build or analyze. Regular builds hang indefinitely on Linux with minimal memory and a single vCPU too when they shouldn't. We got around that by simply not building on the VM.

@jasonmalinowski
Copy link
Member

@dibarbet The unrooted OutOfMemoryException could be a red herring; I think the CLR may preallocate an OutOfMemoryException to throw in the case it needs to, since otherwise if you're out of memory it won't be able to allocate one! That said, the error code still looks extremely suspicious and @kamronbatman's comment implies there's similar symptoms elsewhere.

Was that OOM happening in the build host process? We should have code in the host app to deal with sudden disappearances of the build host process but maybe we're losing that somewhere. Might be easy to try to reproduce by adding some logic to the build host just to start having it allocate endless memory when we process a project until it crashes.

@dibarbet
Copy link
Member

dibarbet commented Nov 26, 2024

Was that OOM happening in the build host process? We should have code in the host app to deal with sudden disappearances of the build host process but maybe we're losing that somewhere.

@jasonmalinowski Build host process was showing the out of memory error code writing the project info to the stream. Both processes were still running, just seemingly stuck. Neither the workspace dumps nor the buildhost dumps showed a huge amount of memory buildup, so a bit strange.

One idea I was thinking about earlier today was if there was potentially an issue writing the project info contents to the stdout buffer due to a buffer size limit.

@jasonmalinowski
Copy link
Member

stdout buffer due to a buffer size limit.

Possibly! I guess anything written to a pipe (or something like stdout) has to go somewhere in memory. Maybe that's not really accounted to the process but rather it's just kernel memory being used up. And that can run out. Does something somewhere track bytes written?

@arunchndr arunchndr added IDE-MSBuildWorkspace MSBuildWorkspace Bug and removed untriaged Issues and PRs which have not yet been triaged by a lead Bug labels Nov 26, 2024
@arunchndr arunchndr removed the IDE-MSBuildWorkspace MSBuildWorkspace label Nov 26, 2024
@arunchndr arunchndr added this to the 17.14 P1 milestone Nov 26, 2024
@kamronbatman
Copy link
Author

kamronbatman commented Nov 27, 2024

@kamronbatman took a look at the dumps this morning. Nothing extremely concret yet, but I do have some clues.

  1. It looks like the hang is happening writing the build output when building the Server.csproj project.
  2. I'm seeing error code 8 as we try to serialize the project info back to the host. I believe this corresponds to not enough memory-
    Image
  3. There's an OutOfMemory exception in the dump (unrooted)
    Image
  4. Azure pipelines images run with 7GB RAM (link), whereas github actions run with 16GB (link), which may explain why it works on github actions but not azdo.

Not super clear as to why it is OOM'ing however, the dump itself doesn't show a huge amount of memory being used. Need to look at other issues, but will continue this investigation later.

Not sure if also makes a difference, but it's 2 vCPU in Azure instead of 4 for GitHub Actions.

Regarding the indefinite hang on Linux, I'll try to get more info. What I know for now is it is also a 2 vCPU machine with 4GB, manually set up with debian and .net 8 on GCP. Increasing the machine's CPU/ram to 4 vCPU and 8GM ram by choosing another instance type and rebooting the machine resolves the issue. It hangs while building the projects, not running the tool I made.

Also let me know if there is anything else I can do to help move this further along. Hope everyone has a great holiday this week!

@dibarbet
Copy link
Member

I will have to get back to this next week. I think the next step is to create a similarly spec'd VM to see if I can reproduce the issue outside of azure pipelines.

@jasonmalinowski
Copy link
Member

@dibarbet What happens if we make the build host just allocate a bunch of memory? Can that force the OOM to happen more easily?

@dibarbet
Copy link
Member

dibarbet commented Dec 4, 2024

Made some progress on this today. It looks like if anything writes to stdout from the build host process, we will see an indefinite hang. This includes a simple Console.Write (or related), any process exception message (for example when an OOM is thrown, an error message is written to the console), any MSBuild unhandled exception, and possibly other scenarios.

The cause appears to be a json serialization exception thrown here - https://github.com/dotnet/roslyn/blob/main/src/Workspaces/Core/MSBuild/Rpc/RpcClient.cs#L52 which ends up never being observed, and the requests infinitely hang.

@jasonmalinowski I have a draft change here to at least observe the failures - #76252 . However I think we need to do what we did in the LSP server - switch to named pipes. stdin/stdout is too prone to issues like this.

@kamronbatman - I'm fairly confident this is related, but not 100% sure. I'll reply back here once my first changes merge and we have official packages you can use. While it won't fix it, it should at least tell us if and what the erroring stdout message is.

@kamronbatman
Copy link
Author

kamronbatman commented Dec 4, 2024

At least one step further! Thank you

When I was debugging, I posted this line in Discord, but didn't know if I was chasing a red herring:

So I guess I was close-ish. It's hard to follow the code because of how dynamic it is.

@jasonmalinowski
Copy link
Member

@jasonmalinowski I have a draft change here to at least observe the failures - #76252 . However I think we need to do what we did in the LSP server - switch to named pipes. stdin/stdout is too prone to issues like this.

Agreed, moving to pipes would be a good thing to do.

dibarbet added a commit that referenced this issue Dec 4, 2024
Related to #75967

Previously, any errors processing messages would just cause an inifnite
hang. This is updates the implementation to do two things
1.  Catch the specific string that failed to be deserialized (if any)
2. If processing failed for any reason, report the exception to the
outstanding requests and trigger disconnect.
@JoeRobich
Copy link
Member

I think we need to do what we did in the LSP server - switch to named pipes.

Agreed, moving to pipes would be a good thing to do.

@dibarbet @jasonmalinowski Please see #76259

@dibarbet
Copy link
Member

dibarbet commented Dec 5, 2024

@kamronbatman would you be able to try package version 4.13.0-3.24605.1 (you will need to add the dotnet-tools azdo feed as they are not published to nuget.org). This should have the both the changes to observe failures and the named pipe fix.

@kamronbatman
Copy link
Author

@kamronbatman would you be able to try package version 4.13.0-3.24605.1 (you will need to add the dotnet-tools azdo feed as they are not published to nuget.org). This should have the both the changes to observe failures and the named pipe fix.

I may not have time today, but I can definitely try tomorrow.

@kamronbatman
Copy link
Author

kamronbatman commented Dec 6, 2024

@dibarbet - It's fixed! - Thank you! I look forward to a release! And if I could be selfish, I would love a 4.12.x release, instead of waiting for 4.13.0, especially if it requires more testing/merges.

https://dev.azure.com/modernuo/modernuo/_build/results?buildId=6474&view=logs&j=3e8797c7-5b0a-5f2c-07a4-1bc5e60a122e&t=dd40485f-f45f-5a73-1682-431718955049

Not sure if this fix would be related to msbuild itself also hanging indefinitely on 1 vCPU machines. If this change will be included in the msbuild binary, let me know and I can arrange testing that for after the holidays.

@dibarbet
Copy link
Member

dibarbet commented Dec 6, 2024

Great to hear!

And if I could be selfish, I would love a 4.12.x release, instead of waiting for 4.13.0, especially if it requires more testing/merges.

Unfortunately this is unlikely to meet the bar for servicing the 4.12 packages - it will have to go out with the normal 4.13 versions. However if you're happy with how the version I linked above is working, you can continue to use it. It's an official Roslyn build, just not the final one published to nuget.org.

Not sure if this fix would be related to msbuild itself also hanging indefinitely on 1 vCPU machines. If this change will be included in the msbuild binary, let me know and I can arrange testing that for after the holidays.

If MSBuild itself is hanging in a regular build this won't help. This only affects loading projects/slns with Roslyn's MSBuildWorkspace.

@jasonmalinowski
Copy link
Member

@dibarbet should we close this now or do we still have something more to do?

@dibarbet
Copy link
Member

dibarbet commented Jan 2, 2025

Closing - I believe the only thing left is for this to flow into real packages (17.13) which should happen soon with the regular flow.

@dibarbet dibarbet closed this as completed Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants