workload repair can recover from corrupt workload sets#52434
workload repair can recover from corrupt workload sets#52434nagilson merged 24 commits intodotnet:release/10.0.1xxfrom
workload repair can recover from corrupt workload sets#52434Conversation
Plan: (this commit is step 1 of the plan but idk if it works completely yet) - Workload Manifest Reader will always fail with `ManifestFromWorkloadSetNotFound` when calling GetManifests. I considered removing the workload set entirely during workload set search if the manifests didn't exist and it was corrupt but this would prevent the repair from working and still require user interaction. Removing the error is not sufficient because then we never recover from the bad state or detect the bad state. So, 1. Add a function to repair so it can repair missing manifests, because before it called ReinstallWorkloadsBasedOnCurrentManifests which calls install workloads, which calls install packs, which needs to read the manifests to know the packs, so repair would fail. 2. Once repair is able to fix the issue, resolve the other commands to be able to fix the issue by A - detecting corrupt issue, and then calling the repair subcomponent to install missing manifests 3. Make sure repair isn't cyclical and doesn't try to call the sub repair option (should be ok, because we don't get to the missing manifests code until we've already repaired the manifests) 4. Add boolean for manifest reader or whatever to print during info, version or list that there was a corrupt workload set likely from pkg manager but still allow repairing it (or decide if we don't want to do that.)
workload restore is not setup for the injection so didn't add it there but the code is shared between the two so testing like this should be decent
…ommand so it can be shared
…rkload command so it can be shared" This reverts commit a3959ba.
… also fixed this prevents needing to inherit or modify many command files at once
workload history recorder tries to get manifests and it does this before file baesd installer is created.
…b.com/nagilson/sdk into nagilson-recover-manifests-on-install
…ble to repair the manifests - the repair family commands should be able repair
the install state is empty on disk - this is not safe to rely on. Also if we refresh manifests we will create a dupe manifest
src/Cli/dotnet/Commands/Workload/WorkloadManifestCorruptionRepairer.cs
Outdated
Show resolved
Hide resolved
workload x can recover from corrupt workload setsworkload repair can recover from corrupt workload sets
nagilson
left a comment
There was a problem hiding this comment.
There's some polish needed if this isnt urgent:
- The error string should likely contain the workload id and not just the version
- We should try to dedupe the message if possible, despite separate assemblies
- The MissingManifests logic is also implemented in two approaches, which has potential for consolidation
- It may not be required to attach in the factory method and we should reconsider if we can do that at the IInstaller level
- The sdkFeatureBand parameter is unnecessary in the CorruptionRepairer
|
The test failure seems to be due to a change in the new, more specific error message presenting in some cases instead of an existing error message. |
|
I tested in codespaces. I can confirm that if I delete any workload manifest (after installing), I will get errors during --info, workload --info, and build. I can also confirm that repair, restore, and update all fix the problem. The error experiences themselves could probably be cleaned up a bit as they give full exception stacks in 2/3 of those flows (workload --info is the best Ux). But this is sufficient for unblocking the customer. Need to fix the tests and review. |
test/dotnet.Tests/CommandTests/Workload/Install/CorruptWorkloadSetTestHelper.cs
Outdated
Show resolved
Hide resolved
we only searched the manifest root at dotnetDir provided by the installer factory but the resolver can search all roots, so the global root and or the user local root.
|
/xlf |
|
|
|
|
Failures now are due to dotnet-watch |
|
@nagilson can you send a belated email to tactics for approval as this still falls under servicing. 2xx was still open for changes today. |
|
Thanks, will do - We should discuss tomorrow: I believe this was for the March release, not intending for this to go into Feb since it'd need to be in servicing and only for 1.0.2xx. I'm surprised we still do servicing since this isn't in |
|
/backport to release/10.0.2xx |
|
Started backporting to |
|
@nagilson backporting to git am output$ git am --3way --empty=keep --ignore-whitespace --keep-non-patch changes.patch
Applying: Workload Repair should detect Corrupt Workload Sets
Applying: improve workload set check
Applying: syntax errors
Applying: Consider different install roots.
Applying: workload update also tested
Applying: move the mini corrupt wkl set repair logic into installing workload command so it can be shared
Using index info to reconstruct a base tree...
M src/Cli/dotnet/Commands/Workload/Install/WorkloadInstallCommand.cs
M src/Cli/dotnet/Commands/Workload/InstallingWorkloadCommand.cs
M src/Cli/dotnet/Commands/Workload/Repair/WorkloadRepairCommand.cs
M src/Cli/dotnet/Commands/Workload/Update/WorkloadUpdateCommand.cs
M test/dotnet.Tests/CommandTests/Workload/Update/GivenDotnetWorkloadUpdate.cs
Falling back to patching base and 3-way merge...
Auto-merging src/Cli/dotnet/Commands/Workload/Install/WorkloadInstallCommand.cs
Auto-merging src/Cli/dotnet/Commands/Workload/InstallingWorkloadCommand.cs
Auto-merging src/Cli/dotnet/Commands/Workload/Repair/WorkloadRepairCommand.cs
CONFLICT (content): Merge conflict in src/Cli/dotnet/Commands/Workload/Repair/WorkloadRepairCommand.cs
Auto-merging src/Cli/dotnet/Commands/Workload/Update/WorkloadUpdateCommand.cs
Auto-merging test/dotnet.Tests/CommandTests/Workload/Update/GivenDotnetWorkloadUpdate.cs
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0006 move the mini corrupt wkl set repair logic into installing workload command so it can be shared
Error: The process '/usr/bin/git' failed with exit code 128 |
Co-authored-by: Marc Paine <marcpop@microsoft.com>
This ports the fixes from PR #52700 (backport of #52434) to main branch, including: - Workload repair can now recover from corrupt workload sets - Added WorkloadManifestCorruptionRepairer to detect and repair missing manifests - Updated WorkloadInstallerFactory to use the repairer - Added proper error messages for corrupt workload sets - Copilot improvements: clarified nullability check, removed extra blank lines, removed unused import Resolves workload corruption issue where manifests are removed by package managers. Co-authored-by: nagilson <23152278+nagilson@users.noreply.github.com>

context
for #52090
copilot failed attempt #52338
Issue: The Workload Manifest Reader can fail with
ManifestFromWorkloadSetNotFoundwhen callingGetManifestswhen package managers clean up manifests from older sdk versions that were still needed by the newer version workloads.❗ This PR is better reviewed with hiding whitespace changes.
Manual Testing Loop Demo
What's this do
I considered removing the workload set entirely during workload manifest search if the manifests didn't exist and it was corrupt but this would prevent the repair from working and still require user interaction.
Removing the error is not sufficient because then we never recover from the bad state or detect the bad state.
So, now we have a separate class which tries to repair missing manifests before we start the manifest lookup. The errors for missing manifests are disabled in the recorder because the recorder will fail to initialize if the manifests are corrupted and it doesn't have the data available to repair or recover from that state. We do the sub-manifest repair because the manifests need to be in a good state for the rest of repair to have the correct data to function. The repairer being hooked into the Installer means commands such as repair will get the hook to do so, but commands such as
--infonever initialize an installer through the factory, so they instead print a prettier error.Notes
I tested and confirmed:
repair, restore, update are fixed.
--info, workload --info, workload list show the small error.