Skip to content

Comments

workload repair can recover from corrupt workload sets#52434

Merged
nagilson merged 24 commits intodotnet:release/10.0.1xxfrom
nagilson:nagilson-recover-manifests-on-install
Jan 27, 2026
Merged

workload repair can recover from corrupt workload sets#52434
nagilson merged 24 commits intodotnet:release/10.0.1xxfrom
nagilson:nagilson-recover-manifests-on-install

Conversation

@nagilson
Copy link
Member

@nagilson nagilson commented Jan 13, 2026

context

for #52090
copilot failed attempt #52338

Issue: The Workload Manifest Reader can fail with ManifestFromWorkloadSetNotFound when calling GetManifests when package managers clean up manifests from older sdk versions that were still needed by the newer version workloads.

❗ This PR is better reviewed with hiding whitespace changes.

Manual Testing Loop Demo

./artifacts/bin/redist/Debug/dotnet/dotnet exec ./artifacts/bin/redist/Debug/dotnet/sdk/10.0.103-dev/dotnet.dll --debug workload install macos
rm -rf  /workspaces/sdk2/sdk/artifacts/bin/redist/Debug/dotnet/sdk-manifests/10.0.100/microsoft.net.workload.emscripten.current/10.0.102/

./artifacts/bin/redist/Debug/dotnet/dotnet exec ./artifacts/bin/redist/Debug/dotnet/sdk/10.0.103-dev/dotnet.dll --debug workload --info

./artifacts/bin/redist/Debug/dotnet/dotnet exec ./artifacts/bin/redist/Debug/dotnet/sdk/10.0.103-dev/dotnet.dll --debug workload repair
image image

What's this do

I considered removing the workload set entirely during workload manifest search if the manifests didn't exist and it was corrupt but this would prevent the repair from working and still require user interaction.

Removing the error is not sufficient because then we never recover from the bad state or detect the bad state.

So, now we have a separate class which tries to repair missing manifests before we start the manifest lookup. The errors for missing manifests are disabled in the recorder because the recorder will fail to initialize if the manifests are corrupted and it doesn't have the data available to repair or recover from that state. We do the sub-manifest repair because the manifests need to be in a good state for the rest of repair to have the correct data to function. The repairer being hooked into the Installer means commands such as repair will get the hook to do so, but commands such as --info never initialize an installer through the factory, so they instead print a prettier error.

Notes

I tested and confirmed:
repair, restore, update are fixed.
--info, workload --info, workload list show the small error.

Plan: (this commit is step 1 of the plan but idk if it works completely yet)

- Workload Manifest Reader will always fail with `ManifestFromWorkloadSetNotFound` when calling GetManifests.

I considered removing the workload set entirely during workload set search if the manifests didn't exist and it was corrupt but this would prevent the repair from working and still require user interaction.

Removing the error is not sufficient because then we never recover from the bad state or detect the bad state.

So,

1. Add a function to repair so it can repair missing manifests, because before it called ReinstallWorkloadsBasedOnCurrentManifests which calls install workloads, which calls install packs, which needs to read the manifests to know the packs, so repair would fail.

2. Once repair is able to fix the issue, resolve the other commands to be able to fix the issue by A - detecting corrupt issue, and then calling the repair subcomponent to install missing manifests

3. Make sure repair isn't cyclical and doesn't try to call the sub repair option (should be ok, because we don't get to the missing manifests code until we've already repaired the manifests)

4. Add boolean for manifest reader or whatever to print during info, version or list that there was a corrupt workload set  likely from pkg manager but still allow repairing it (or decide if we don't want to do that.)
workload restore is not setup for the injection so didn't add it there but the code is shared between the two so testing like this should be decent
…rkload command so it can be shared"

This reverts commit a3959ba.
… also fixed

this prevents needing to inherit or modify many command files at once
workload history recorder tries to get manifests and it does this before file baesd installer is created.
…ble to repair the manifests - the repair family commands should be able repair
the install state is empty on disk - this is not safe to rely on. Also if we refresh manifests we will create a dupe manifest
@nagilson nagilson changed the title workload x can recover from corrupt workload sets workload repair can recover from corrupt workload sets Jan 15, 2026
@nagilson nagilson marked this pull request as ready for review January 15, 2026 00:51
@nagilson nagilson requested a review from a team January 15, 2026 00:52
Copy link
Member Author

@nagilson nagilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some polish needed if this isnt urgent:

  1. The error string should likely contain the workload id and not just the version
  2. We should try to dedupe the message if possible, despite separate assemblies
  3. The MissingManifests logic is also implemented in two approaches, which has potential for consolidation
  4. It may not be required to attach in the factory method and we should reconsider if we can do that at the IInstaller level
  5. The sdkFeatureBand parameter is unnecessary in the CorruptionRepairer

@nagilson
Copy link
Member Author

The test failure seems to be due to a change in the new, more specific error message presenting in some cases instead of an existing error message.

@marcpopMSFT
Copy link
Member

I tested in codespaces. I can confirm that if I delete any workload manifest (after installing), I will get errors during --info, workload --info, and build. I can also confirm that repair, restore, and update all fix the problem.

The error experiences themselves could probably be cleaned up a bit as they give full exception stacks in 2/3 of those flows (workload --info is the best Ux). But this is sufficient for unblocking the customer. Need to fix the tests and review.

we only searched the manifest  root at dotnetDir provided by the installer factory  but the resolver can search all roots, so the global root and or the user local root.
@nagilson
Copy link
Member Author

/xlf

@github-actions
Copy link
Contributor

▶️ XLF update workflow started. Track progress in this workflow run.

@nagilson
Copy link
Member Author

nagilson commented Jan 21, 2026

There's some polish needed if this isnt urgent:

  1. The error string should likely contain the workload id and not just the version
  2. We should try to dedupe the message if possible, despite separate assemblies
  3. The MissingManifests logic is also implemented in two approaches, which has potential for consolidation
  4. It may not be required to attach in the factory method and we should reconsider if we can do that at the IInstaller level
  5. The sdkFeatureBand parameter is unnecessary in the CorruptionRepairer
  1. there is no id, just the set version. I've improved the err message
  2. not easily done as it'd make a circular reference
  3. fixed and realized one only searched 1 manifest root
  4. not investing the time to do that
  5. fixed

@nagilson
Copy link
Member Author

I was still able to confirm the fix worked for me
image

@nagilson nagilson requested a review from dsplaisted January 21, 2026 23:57
@nagilson
Copy link
Member Author

Failures now are due to dotnet-watch

@nagilson nagilson merged commit 27df5c7 into dotnet:release/10.0.1xx Jan 27, 2026
26 checks passed
@marcpopMSFT
Copy link
Member

@nagilson can you send a belated email to tactics for approval as this still falls under servicing. 2xx was still open for changes today.

@nagilson
Copy link
Member Author

nagilson commented Jan 28, 2026

Thanks, will do - We should discuss tomorrow: I believe this was for the March release, not intending for this to go into Feb since it'd need to be in servicing and only for 1.0.2xx. I'm surprised we still do servicing since this isn't in dotnet/dotnet and march will open soon, though it will impact tests results. I suppose it's because 10 at large is in servicing since it's GA'd. Anyways, will send that out.

@nagilson
Copy link
Member Author

/backport to release/10.0.2xx

@github-actions
Copy link
Contributor

Started backporting to release/10.0.2xx (link to workflow run)

@github-actions
Copy link
Contributor

@nagilson backporting to release/10.0.2xx failed, the patch most likely resulted in conflicts. Please backport manually!

git am output
$ git am --3way --empty=keep --ignore-whitespace --keep-non-patch changes.patch

Applying: Workload Repair should detect Corrupt Workload Sets
Applying: improve workload set check
Applying: syntax errors
Applying: Consider different install roots.
Applying: workload update also tested
Applying: move the mini corrupt wkl set repair logic into installing workload command so it can be shared
Using index info to reconstruct a base tree...
M	src/Cli/dotnet/Commands/Workload/Install/WorkloadInstallCommand.cs
M	src/Cli/dotnet/Commands/Workload/InstallingWorkloadCommand.cs
M	src/Cli/dotnet/Commands/Workload/Repair/WorkloadRepairCommand.cs
M	src/Cli/dotnet/Commands/Workload/Update/WorkloadUpdateCommand.cs
M	test/dotnet.Tests/CommandTests/Workload/Update/GivenDotnetWorkloadUpdate.cs
Falling back to patching base and 3-way merge...
Auto-merging src/Cli/dotnet/Commands/Workload/Install/WorkloadInstallCommand.cs
Auto-merging src/Cli/dotnet/Commands/Workload/InstallingWorkloadCommand.cs
Auto-merging src/Cli/dotnet/Commands/Workload/Repair/WorkloadRepairCommand.cs
CONFLICT (content): Merge conflict in src/Cli/dotnet/Commands/Workload/Repair/WorkloadRepairCommand.cs
Auto-merging src/Cli/dotnet/Commands/Workload/Update/WorkloadUpdateCommand.cs
Auto-merging test/dotnet.Tests/CommandTests/Workload/Update/GivenDotnetWorkloadUpdate.cs
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0006 move the mini corrupt wkl set repair logic into installing workload command so it can be shared
Error: The process '/usr/bin/git' failed with exit code 128

Link to workflow output

nagilson added a commit to nagilson/sdk that referenced this pull request Jan 28, 2026
Co-authored-by: Marc Paine <marcpop@microsoft.com>
Copilot AI added a commit that referenced this pull request Jan 30, 2026
This ports the fixes from PR #52700 (backport of #52434) to main branch, including:
- Workload repair can now recover from corrupt workload sets
- Added WorkloadManifestCorruptionRepairer to detect and repair missing manifests
- Updated WorkloadInstallerFactory to use the repairer
- Added proper error messages for corrupt workload sets
- Copilot improvements: clarified nullability check, removed extra blank lines, removed unused import

Resolves workload corruption issue where manifests are removed by package managers.

Co-authored-by: nagilson <23152278+nagilson@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants