Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Restore corrupts NuGet cache #12047

Closed
manfred-brands opened this issue Aug 24, 2022 · 9 comments
Closed

[Bug]: Restore corrupts NuGet cache #12047

manfred-brands opened this issue Aug 24, 2022 · 9 comments
Labels

Comments

@manfred-brands
Copy link

NuGet Product Used

dotnet.exe

Product Version

dotnet 6.0.400

Worked before?

Pre dotnet 5.0.0?

Impact

It's more difficult to complete my work

Repro Steps & Context

We have seen several instances of "corrupt" nuget caches, possibly be caused by different parallel dotnet builds requiring different nuget package versions.

Say one build requires version 1.0.0 and another 2.0.0 of a nuget package.
1.0.0 was previously restored correctly.
We then build a project requiring version 2.0.0 requiring a fresh restore of a new nuget package.
That package is restored correctly, but somehow sometimes one of the 2.0.0 dlls, e.g. lib/net48 makes it way to the 1.0.0 lib/net48 nuget folder.
If these version are not compatible, we get compile errors the next time compiling a project that requires 1.0.0.
If they are compatible it might work by stealth and nobody will notice.
We did not see this previously, but now that we strong name our nuget packages, we see unexpected binding redirects.

We use central package management

Verbose Logs

No response

@zivkan
Copy link
Member

zivkan commented Aug 27, 2022

@manfred-brands do your builds do anything like assembly signing (of 3rd party assemblies, not just the assembly being compiled), or something else that may write/overwrite files?

Alternatively, are your CI machines "stateful" (don't get wiped between every single build)? NuGet is designed around the principal of immutable packages. This means that if during development of your packages, if at any point you re-use a package version, any machines with the old contents will not refresh (unless someone explicitly deleted the old package contents off the machine). For example, while still in development, if your project changed to version 2.0.0 (rather than using 2.0.0-prerelease.* dynamic versions), and later changed APIs (from 1.0.0 APIs to 2.0.0 APIs), then any machine with the 1.0.0 APIs but 2.0.0 package version, will have the issue you described.

Something else that might help to investigate is if you can check the *.nuspec file in the package version's root directory, to see if it's only the dll under lib\net48 that is incorrect, or if the entire version directory (including nuspec) is in the wrong folder.

I haven't seen any other customers reporting similar issues, and I'm not sure what "mechanism" could theoretically cause NuGet to write a package's contents to the wrong folder. Our code could really use a refactor to reduce duplication (I think one method is used for packages.config, the other duplicate is used by PackageReference), but picking one at random (the first one), PackageExtractor.InstallPackageFromSourceAsync is what extracts downloaded nupkgs into the global packages folder. It calculates the target path using a helper function, and then passes that to the method that that extracts all the files. I just don't see how it's possible for this code to accidentally write package 2.0.0 contents into the 1.0.0 folder.

Even if the nuget server provided a nupkg whose nuspec says version 1.0.0, even though NuGet asked to download version 2.0.0, I believe NuGet will use the version it thinks it downloaded (2.0.0), not the version from the nuspec. If I'm wrong about that, that points to an error on the nuget server, not the client. Although there could be an opportunity here for the client to validate and report a friendly error message, rather than silently doing something unexpected.

In short, without being able to reproduce the error, or having more information, I'm not sure what actions we can take.

Perhaps something you could do is to 1. ensure that your build scripts do a restore separately from build, and then 2. write a program that runs after restore, but before build. What the program should validate depends on exactly what's going wrong on your machines. For example, if the nuspec shows that the package was extracted to entirely the wrong directory (directory is 1.0.0, but nuspec says 2.0.0), then the program could search all nuspec files under the global packages directory, parse the package id and version out of the nuspec, and compare to the full directory path they were found in. However, if the nuspec is correct, but only the contents of a lib/net48/*.dll is wrong, then your program might need to somehow compare the lib/ file to the contents of the *.nupkg file in the package root. Start simple and compare file length and/or date (zip does not contain timezones, and NuGet is supposed to assume it's UTC, so you'll need to adjust the timestamp for your machine's timezone).

@zivkan zivkan added the WaitingForCustomer Applied when a NuGet triage person needs more info from the OP label Aug 27, 2022
@manfred-brands
Copy link
Author

@zivkan Thanks for your reply.

Our builds are about 500 projects targeting .NET Framework 4.8. We use dotnet build which build several projects in parallel. All outputs are going to a single Binaries folder. This does unfortunately mean that 3rd party dlls are copied multiple times to that directory and occassionally we see msbuild retries on those. We also use CentralPackage management with the PackageVersion in a single Directory.Package.props and PackageReference in all .csproj files.

We have seen this phenomenon more on developer machines. One of the projects gets changed and we either to a dotnet rebuild or a build in VS2022 (using partial loading of the solution only a few projects). As a total rebuild takes up to 7 minutes. Developers sometimes switch to a 2nd directory and do some work there. E.g. Two branches of the same repository checked out to different directories. Those builds are independent, but share the same NuGet cache.

Something else that might help to investigate is if you can check the *.nuspec file in the package version's root directory, to see if it's only the dll under lib\net48 that is incorrect, or if the entire version directory (including nuspec) is in the wrong folder.

All files except the .dll are from the correct version, this includes the .xml file in the lib\net48 folder.

I also looked at the nuget source code and couldn't find anything there.
Maybe it is not nuget, but msbuild copying in the wrong direction.

I will develop that NuGet cache verifier tool. At least when the issue occurs it will find the offending dll straight away.

@ghost ghost added WaitingForClientTeam Customer replied, needs attention from client team. Do not apply this label manually. and removed WaitingForCustomer Applied when a NuGet triage person needs more info from the OP labels Aug 29, 2022
@zivkan
Copy link
Member

zivkan commented Aug 29, 2022

We use dotnet build which build several projects in parallel.

dotnet build does a restore first, unless you also provide --no-restore. It sounds to me like you do a single dotnet build to restore & build your repo, which is ideal from NuGet's perspective. Some customers think they can speed up their builds be implementing their own parallel build solution, but NuGet does not support two different processes restoring the same project at the same time and so often experience problems (usually writing files to the obj\ folder fail because another process has the same file open). Furthermore, NuGet is faster in solution restore, since there are more opportunities for in-memory caching and efficient blocking, when a single process restores the entire solution. So, just in case you're running one restore per project, I strongly suggest that you do not.

Maybe it is not nuget, but msbuild copying in the wrong direction.

In case you're not aware already, MSBuild (and therefore commands like dotnet build, dotnet restore, dotnet publish) support a -bl argument to write a binary log that can be read using https://msbuildlog.com. You can try doing a full, clean build with -bl, then open the msbuild.binlog file and search for anywhere the global packages folder path is used as an output, rather than an input.

@manfred-brands
Copy link
Author

I created the tool and found several "misplaced" files, which that tool then repaired by extracting the correct files from the .nupkg. I keep an eye on to see when it reappears.

@erdembayar
Copy link
Contributor

I created the tool and found several "misplaced" files, which that tool then repaired by extracting the correct files from the .nupkg. I keep an eye on to see when it reappears.

It looks the issue didn't repro for about 3 weeks, so I'm closing it for now considering no other customer reported same problem.
But definitely let us know if you can still reproduce this problem on the latest version of NuGet. Thank you for your feedback and happy coding!

@ghost ghost removed the WaitingForClientTeam Customer replied, needs attention from client team. Do not apply this label manually. label Oct 4, 2022
@marcin-krystianc
Copy link

@manfred-brands Do you use hard links (i.e.: /p:CreateHardLinksForCopyLocalIfPossible=true /p:CreateHardLinksForCopyFilesToOutputDirectoryIfPossible=true /p:CreateHardLinksForCopyAdditionalFilesIfPossible=true /p:CreateHardLinksForPublishFilesIfPossible=true) by any chance? I'm investigating a similar problem in our company. We also see sporadic NuGet cache corruption ("misplaced" files), but it is almost certainly caused by using hard links.

@manfred-brands
Copy link
Author

@marcin-krystianc Yes we do use hard-links.
I haven't been able to track it down to specific situations, but I created a tool and it regularly reports NuGet cache corruption.

@marcin-krystianc
Copy link

FYI: I discovered that It is not a problem with NuGet itself. It is a problem with MSBuild and use of hard or symbolic links. I've opened dotnet/msbuild#8273 with a detailed description.

@manfred-brands
Copy link
Author

@marcin-krystianc Thanks for finding the real cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants