Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build-server shutdown causes issues in the VMR orchestrator #4175

Open
ViktorHofer opened this issue Feb 27, 2024 · 21 comments
Open

build-server shutdown causes issues in the VMR orchestrator #4175

ViktorHofer opened this issue Feb 27, 2024 · 21 comments

Comments

@ViktorHofer
Copy link
Member

ViktorHofer commented Feb 27, 2024

We use the build-server shutdown CLI command in the VMR orchestrator after each repo build so that we can clean the repository's artifacts (including the package cache). There are two issues with that appraoch:

  1. dotnet build-server shutdown will only shutdown the compiler server from its own install: dotnet.exe keeps running in background sdk#10358 (comment). Repositories that use the toolset compiler package can't be cleaned-up correctly as the compiler assemblies in the package cache are locked.
  2. When enabling parallel build in the VMR the build-server shutdown execs could cause issues, hopefully just perf and not stability.

Is there a better way to handle this without killing all build-servers and only for the process tree spawned per repository? If not, should we disable shared compilation completely? What if we would disallow toolset compiler packages entirely and require the VMR to build with the compilers within the SDK? While that might me possible at some point, there are times when i.e. runtime depends on a newer language feature than what's in a given SDK.

I meanwhile worked around is for razor which was hitting this issue by disabling the compiler toolset package: dotnet/installer#18776

cc @jaredpar @mmitche @dotnet/source-build-internal

@dotnet-issue-labeler dotnet-issue-labeler bot added area-build Improvements in source-build's own build process untriaged labels Feb 27, 2024
@mthalman
Copy link
Member

Here's some history on the use of build-server shutdown: dotnet/installer#15488

@mmitche
Copy link
Member

mmitche commented Feb 27, 2024

Repositories that use the toolset compiler package can't be cleaned-up correctly as the compiler assemblies in the package cache are locked.

This is surprising. In source-only modes it's pretty important to not use the toolset package to avoid prebuilts. I generally think we should not allow use of the toolset package in all VMR modes. Where were you seeing that?

@mmitche
Copy link
Member

mmitche commented Feb 27, 2024

I think the right thing to do is probably propagate the remotes from the outer->inner

@jaredpar
Copy link
Member

jaredpar commented Feb 27, 2024

Soapbox Start
I honestly think that the build-server shutdown approach is the wrong direction overall. What we actually desire is a dotnet build command which, upon completion, terminates all of its child processes. This type of build end event that can be used for tasks to clean up and shut down gracefully. The build-server shutdown command is instead more of a dotnet build didn't do the right thing so lets clean up after it.

Imagine for a second if we didn't need to have a bunch of places in our infra where we called dotnet build-server shutdown , kill VBCSCompiler, plumbing through -nr:false or ExitWithExitCode (then hours plumbing $ci and $prepareMachine through the build scripts and YML files). What if instead dotnet build was a self contained event?
Soapbox End

I'm not sure what the design of build-server shutdown is anymore (been a while since we implemented it). If someone can point to the docs we can see why it doesn't work for the toolset compiler.

@ViktorHofer
Copy link
Member Author

This is surprising. In source-only modes it's pretty important to not use the toolset package to avoid prebuilts. I generally think we should not allow use of the toolset package in all VMR modes. Where were you seeing that?

I.e. when enabling razor: dotnet/installer#18776. See the conversation in there and the builds that I queued. At the end I just disabled the toolset package as I wasn't able to kill the actual process that still had a lock on the toolset compiler assemblies.

I'm not sure what the design of build-server shutdown is anymore (been a while since we implemented it). If someone can point to the docs we can see why it doesn't work for the toolset compiler.

@marcpopMSFT would you know who owns the build-server CLI frontend and the protocol to the compiler servers?

@jaredpar
Copy link
Member

@baronfel may know where it is too.

@baronfel
Copy link
Member

baronfel commented Feb 27, 2024

There's not really a unifying protocol for the different build servers - there's just an interface with a void Shutdown() method that we have a few different implementations of

There's not a ton of commonality.

  • MSBuild delegates the shutdown entirely to the engine loaded from the current SDK.
  • Roslyn calls -shutdown on the VBCSCompiler.dll bundled in the SDK.
  • Razor is the one with the most sophisticated process:
    • There is a list of files in the user's home directory that track PIDs and dll paths for existing Razor build servers
    • the shutdown command reads data from all of them and shuts down each with a call to the shutdown command to each DLL

SDK generally owns the frontend here, though I expect different teams have had the source ideas about how to shut down their own servers.

@ViktorHofer
Copy link
Member Author

Even if we would teach Roslyn's shutdown implementation to kill PIDs based on files on disk (what razor does), we would still shutdown all of the repo build servers, right?

For the VMR, we are building multiple repositories in parallel with a single SDK (single dotnet) and only want to terminate the running PIDs that were triggered for the individual repo build.

Example: roslyn and runtime build in parallel and when roslyn finishes, we only want to kill the PIDs that were opened for the roslyn repo build and not the ones currently running for the runtime repo build. Is that somehow possible today with build-server? If not, should we disable shared compilation completely? Alternatively, what if we just look for the open handles and kill the corresponding PIDs ourselves?

@baronfel
Copy link
Member

baronfel commented Feb 27, 2024

Today, unless Roslyn and Runtime were using different SDKs (different layouts on disk) then we have no way to shutdown just the servers for each individual repo using the build-server shutdown command. If this is a recurring problem we could design some extensions to build-server shutdown that would accept some discriminator for the PIDs/servers and let you specify a set to terminate I suppose, but we haven't really thought about this space in detail for a while so we don't have any designs ready-to-go.

@agocke
Copy link
Member

agocke commented Feb 27, 2024

When enabling parallel build in the VMR the build-server shutdown execs could cause issues, hopefully just perf and not stability.

Roslyn's version is perfectly safe. You can kill it an any time and it will only impact perf.

@agocke
Copy link
Member

agocke commented Feb 27, 2024

Is there a better way to handle this without killing all build-servers and only for the process tree spawned per repository?

I think the long-term solution is that we shouldn't have to be cleaning up extra artifacts. The VMR should be reasonably efficient once we get used to building the product in one big run. I don't see the need to do intermediate cleanup.

@jaredpar
Copy link
Member

There is a list of files in the user's home directory that track PIDs and dll paths for existing Razor build servers

Was PID recycling considered in this design?

MSBuild delegates the shutdown entirely to the engine loaded from the current SDK.

Is there a way that tasks can hook that engine? Like can I register an IBuildServerOwner factory where I can control the shutdown in the same code that launched the compiler server?

@MichaelSimons
Copy link
Member

Is there a better way to handle this without killing all build-servers and only for the process tree spawned per repository?

I think the long-term solution is that we shouldn't have to be cleaning up extra artifacts. The VMR should be reasonably efficient once we get used to building the product in one big run. I don't see the need to do intermediate cleanup.

IDK, I see this as more of a repo level problem, there is a lot of content in the individual repo artifacts directories (e.g. 34GB in a linux Source build). Distro maintainers were hitting size limits in their build systems which is what prompted the changes to optionally clean after each repo build.

@agocke
Copy link
Member

agocke commented Feb 27, 2024

As a long-term goal we probably want to build the VMR as fast as possible and generating extraneous artifacts in the repo builds probably impedes the perf goals. I expect we should get rid of that stuff through natural optimization.

@rainersigwald
Copy link
Member

MSBuild delegates the shutdown entirely to the engine loaded from the current SDK.

Is there a way that tasks can hook that engine? Like can I register an IBuildServerOwner factory where I can control the shutdown in the same code that launched the compiler server?

No but I think it's reasonable to want and build this.

@ViktorHofer
Copy link
Member Author

I honestly think that the build-server shutdown approach is the wrong direction overall. What we actually desire is a dotnet build command which, upon completion, terminates all of its child processes. This type of build end event that can be used for tasks to clean up and shut down gracefully. The build-server shutdown command is instead more of a dotnet build didn't do the right thing so lets clean up after it.

Here's another recent incident that highlights why we really need this.

When building the product from source, analyzer assemblies that target the compiler API either use the N-1 or live version of it. N-1 when the repo doesn't depend on the roslyn repo being built first and live when it does. That means that the compiler used during the build (which loads these live built analyzer assemblies) must match. Here's what happens when it doesn't:

CSC : error CS9057: The analyzer assembly '/vmr/src/sdk/artifacts/sb/package-cache/microsoft.aspnetcore.analyzers/9.0.0-preview.4.24177.3/analyzers/dotnet/cs/Microsoft.AspNetCore.Analyzers.dll' references version '4.11.0.0' of the compiler, which is newer than the currently running version '4.10.0.0'. [/vmr/src/sdk/src/WebSdk/Web/Tasks/Microsoft.NET.Sdk.Web.Tasks.csproj::TargetFramework=net9.0]

from dotnet/installer#19221

Source-build doesn't enforce the use of the N-1 / live compiler toolset package which results in the nonmatch. If a repository doesn't explicitly use the compiler toolset package, the compiler from the SDK is used. Whenever the compiler API revs its major/minor version and the compiler toolset package isn't used, the build starts to fail.

If source-build would now starts enforcing the compiler toolset package then the current build-server shutdown approach in the VMR would stop working: https://github.com/dotnet/dotnet/blob/73e529885113639d84f71652c93a9acacb808719/repo-projects/Directory.Build.targets#L521

I wonder how difficult it would be to design the protocol to achieve what Jared mentioned above.

@jaredpar
Copy link
Member

I'm very open to ideas on how to solve this. Spent a bit of time trying to reason this out but can't find a way that is sufficiently robust to make me want to use it.

The basic problem is that we need a way to identify either the VBCSCompiler processes or the location they were launched from. Ideally the processes though cause killing those is more robust. The ideas I've considered are the following:

  1. Have every server write a key to a place on disk. There are multiple keys I could use (named pipe name, location, etc ...). The build-server shutdown could then use that to kill the launched processes. Problems:
    a. Ask three runtime devs where you can validly write a file on Unix and you will get four different answers
    b. How do you avoid polluting the user file system over time?
  2. Have msbuild implement a "build start / stop" event and just kill VBCSCompiler on build stop. Basically make it a one shot deal.

Think there is potential for (1) to be a solution but can't get any kind of consensus on where a valid location for writing files is. Also it would only be a solution going forward.

@akoeplinger
Copy link
Member

akoeplinger commented Mar 28, 2024

Have every server write a key to a place on disk

This is actually what the razor server does already: https://github.com/dotnet/sdk/blob/main/src/Cli/dotnet/BuildServer/RazorServer.cs

It uses a path in the user profile: https://github.com/dotnet/sdk/blob/main/src/Cli/dotnet/BuildServer/BuildServerProvider.cs#L72

edit sorry, just realized this was already mentioned earlier in the thread 😄

@baronfel
Copy link
Member

One semi-serious proposal for this - to avoid bloating use a sqlite db for the data?

@jaredpar
Copy link
Member

This is actually what the razor server does already:

They also write PIDs which means it's subject to PID recycle issues.

It uses a path in the user profile

Understood but when you ask runtime team if this is the right decision you get my "ask three people and you get four answers" problem.

@akoeplinger
Copy link
Member

Right, I just wonder given the Razor precedent whether that solution is "good enough" to make progress even though it doesn't cover every obscure setup. There is an env var to override the PID file location as an escape hatch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

9 participants