-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSBuildWorkspace.OpenSolutionAsync hangs indefinitely on 4.9.0+ in Windows 2022 Azure CI/CD #75967
Comments
Per conversation on discord, @dibarbet you said you will try to take a look. Please contact me or let me know how I can help. I haven't had time to set up another azure pipelines just for a minimal reproducible snippet, sorry. |
Been taking a look at this and nothing obvious is standing out. Can you try the following: Update the code to pass in the progress parameter to public static async Task<IEnumerable<Project>> GetProjectsAsync(string solutionPath)
{
...
var solution = await workspace.OpenSolutionAsync(solutionPath, progress: new Progress());
...
}
private class Progress : IProgress<ProjectLoadProgress>
{
public void Report(ProjectLoadProgress value)
{
Console.WriteLine($"{value.Operation} completed for {value.FilePath} ({value.TargetFramework}) in {value.ElapsedTime.TotalMilliseconds}ms");
}
} That should write logs similar to
Beyond that I think we'll have to get a hang dump (probably via procdump) |
I also wasn't able to get a minimal repro. For example I created this dotnet tool that loads a solution successfully on windows-2022 github action images - https://github.com/dibarbet/MSBuildWorkspaceConsole/actions/runs/11923292319/job/33231361423 Based on this link - https://learn.microsoft.com/en-us/azure/devops/pipelines/agents/hosted?view=azure-devops&tabs=yaml , the github image should be the same as the azure devops one. I would guess something related to the specific solution or environment being setup is causing an issue, though not sure what that would be. |
I'll do this and see what happens. Regarding a procdump, if you have steps I can throw together for the CI/CD to do that, let me know. Note that with no change to code or solution that is opened, v4.8.0 worked fine. EDIT: With the progress change, here is the run: |
So the run shows:
Contrast to locally (or other CI/CD runs like github actions with linux/macos):
|
@kamronbatman interesting note here - I took your .NET9 PR and ran it against github actions - appeared to have no issues there |
Good to know, thanks. I stopped using GitHub actions because it wasn't part of the unlimited actions for public repos. There was also another issue in the past but I don't remember what it was. |
@kamronbatman try something similar to this - dibarbet/ModernUO@202ae64 You may have to tweak it a bit to make sure it captures the dump on hang and uploads correctly. It is incredibly mystifying that the github actions run passes, but fails on AzDo with the exact same image. |
@dibarbet - The tool never exits, so can we put a 5minutes timer on the proc dump PS1 and kill the process? Otherwise I don't see how this would work. 🤔 PR Changes for the azure run:
EDIT: I added messages, and it never gets past running the schema generator. We might need to tweak the ps1 so it's non-blocking? |
@kamronbatman procdump should collect a hang dump after the process doesn't respond for 5 seconds. IIRC it should also exit after that (and we only wait on the procdump process exiting). Either procdump doesn't actually exit after it collects a dump or procdump isn't detecting a hang. How we fix it depends on which one is happening. I've adjusted the script so it should live write the process output to the console, instead of waiting until the end. This should help us tell if procdump is collecting anything or not - dibarbet/ModernUO@52a27e1 |
|
Thanks - it looks like procdump isn't detecting a hang and not collecting any dumps. Try this instead - it will collect a dump every 5 seconds up to 5 total. You may have to adjust the time depending on how long it takes to get to the broken state - dibarbet/ModernUO@14ea3c6 |
@dibarbet I sent you a link to the dumps via discord. |
Unfortunately hit another snag that I didn't realize beforehand - the processId from launching the dotnet tool is unfortunately only a wrapper around another process that is actually running the tool. And there are additionally a further couple of processes involved in the MSBuildWorkspace project loading. The previous dumps only captured the wrapper process which doesn't have anything really useful in it. Procdump is definitely not very ergonomic to use for this case, but unfortunately I don't think there's anything better 😆 . |
Thanks @dibarbet, I sent you the dumps. |
@kamronbatman took a look at the dumps this morning. Nothing extremely concret yet, but I do have some clues.
Not super clear as to why it is OOM'ing however, the dump itself doesn't show a huge amount of memory being used. Need to look at other issues, but will continue this investigation later. |
Ok thanks. It is concerning if the project requires more than 7 or 8GB to build or analyze. Regular builds hang indefinitely on Linux with minimal memory and a single vCPU too when they shouldn't. We got around that by simply not building on the VM. |
@dibarbet The unrooted OutOfMemoryException could be a red herring; I think the CLR may preallocate an OutOfMemoryException to throw in the case it needs to, since otherwise if you're out of memory it won't be able to allocate one! That said, the error code still looks extremely suspicious and @kamronbatman's comment implies there's similar symptoms elsewhere. Was that OOM happening in the build host process? We should have code in the host app to deal with sudden disappearances of the build host process but maybe we're losing that somewhere. Might be easy to try to reproduce by adding some logic to the build host just to start having it allocate endless memory when we process a project until it crashes. |
@jasonmalinowski Build host process was showing the out of memory error code writing the project info to the stream. Both processes were still running, just seemingly stuck. Neither the workspace dumps nor the buildhost dumps showed a huge amount of memory buildup, so a bit strange. One idea I was thinking about earlier today was if there was potentially an issue writing the project info contents to the stdout buffer due to a buffer size limit. |
Possibly! I guess anything written to a pipe (or something like stdout) has to go somewhere in memory. Maybe that's not really accounted to the process but rather it's just kernel memory being used up. And that can run out. Does something somewhere track bytes written? |
Not sure if also makes a difference, but it's 2 vCPU in Azure instead of 4 for GitHub Actions. Regarding the indefinite hang on Linux, I'll try to get more info. What I know for now is it is also a 2 vCPU machine with 4GB, manually set up with debian and .net 8 on GCP. Increasing the machine's CPU/ram to 4 vCPU and 8GM ram by choosing another instance type and rebooting the machine resolves the issue. It hangs while building the projects, not running the tool I made. Also let me know if there is anything else I can do to help move this further along. Hope everyone has a great holiday this week! |
I will have to get back to this next week. I think the next step is to create a similarly spec'd VM to see if I can reproduce the issue outside of azure pipelines. |
@dibarbet What happens if we make the build host just allocate a bunch of memory? Can that force the OOM to happen more easily? |
Made some progress on this today. It looks like if anything writes to stdout from the build host process, we will see an indefinite hang. This includes a simple The cause appears to be a json serialization exception thrown here - https://github.com/dotnet/roslyn/blob/main/src/Workspaces/Core/MSBuild/Rpc/RpcClient.cs#L52 which ends up never being observed, and the requests infinitely hang. @jasonmalinowski I have a draft change here to at least observe the failures - #76252 . However I think we need to do what we did in the LSP server - switch to named pipes. stdin/stdout is too prone to issues like this. @kamronbatman - I'm fairly confident this is related, but not 100% sure. I'll reply back here once my first changes merge and we have official packages you can use. While it won't fix it, it should at least tell us if and what the erroring stdout message is. |
At least one step further! Thank you When I was debugging, I posted this line in Discord, but didn't know if I was chasing a red herring:
So I guess I was close-ish. It's hard to follow the code because of how dynamic it is. |
Agreed, moving to pipes would be a good thing to do. |
Related to #75967 Previously, any errors processing messages would just cause an inifnite hang. This is updates the implementation to do two things 1. Catch the specific string that failed to be deserialized (if any) 2. If processing failed for any reason, report the exception to the outstanding requests and trigger disconnect.
@dibarbet @jasonmalinowski Please see #76259 |
@kamronbatman would you be able to try package version 4.13.0-3.24605.1 (you will need to add the dotnet-tools azdo feed as they are not published to nuget.org). This should have the both the changes to observe failures and the named pipe fix. |
I may not have time today, but I can definitely try tomorrow. |
@dibarbet - It's fixed! - Thank you! I look forward to a release! And if I could be selfish, I would love a 4.12.x release, instead of waiting for 4.13.0, especially if it requires more testing/merges. Not sure if this fix would be related to msbuild itself also hanging indefinitely on 1 vCPU machines. If this change will be included in the msbuild binary, let me know and I can arrange testing that for after the holidays. |
Great to hear!
Unfortunately this is unlikely to meet the bar for servicing the 4.12 packages - it will have to go out with the normal 4.13 versions. However if you're happy with how the version I linked above is working, you can continue to use it. It's an official Roslyn build, just not the final one published to nuget.org.
If MSBuild itself is hanging in a regular build this won't help. This only affects loading projects/slns with Roslyn's MSBuildWorkspace. |
@dibarbet should we close this now or do we still have something more to do? |
Closing - I believe the only thing left is for this to flow into real packages (17.13) which should happen soon with the regular flow. |
Version Used:
Steps to Reproduce:
Expected Behavior:
The code should move past OpenSolutionAsync, returning the workspace and finish running the application.
Actual Behavior:
Specifically on Azure Pipelines, latest Windows 2022, with .NET 9:
The code above hangs indefinitely.
Link to the PR in our repo that replicates this:
https://github.com/modernuo/SerializationGenerator/pull/43/files#diff-f2b522070e7bfd744704829ee17ae88ea259383f12f3a45fb2d860a2a0c3e19c
In that PR, I also tried to clean up any possible deadlocks, but nothing changed.
Repo that uses this tool and it's PR:
https://github.com/modernuo/ModernUO/pull/1984/files#diff-7afd3bcf0d0c06d6f87c451ef06b321beef770ece4299b3230bb280470ada2f6
The azure pipelines yaml file we use is here:
https://github.com/modernuo/ModernUO/pull/1984/files#diff-7915b9b726a397ae7ba6af7b9703633d21c031ebf21682f3ee7e6a4ec52837a5
The tool is run through a cmd file as part of the build process:
https://github.com/modernuo/ModernUO/blob/1d08572926ad3e6376fbebb1ec169624f9cfdb74/publish.cmd#L46
I cannot replicate this locally on Windows 11, nor on other operating systems. If I had to guess, it is related to how .NET is installed/path'd on that docker
The text was updated successfully, but these errors were encountered: