-
Notifications
You must be signed in to change notification settings - Fork 696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default to mmap-based I/O on Windows only #4186
Default to mmap-based I/O on Windows only #4186
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my own machine (desktop with nvme storage), nuget.exe restore of NuGet.Client was nearly twice as fast as HEAD~2 from your branch.
This is clearly more nuanced and more investigation is warranted. @zkat is on leave at the moment, but when they're back we'll figure out who should lead an investigation when when we can schedule it.
FWIW, I validate that mmap appears to be much slower on Linux. Using an Azure Standard_D4as_v4 restoring orleans with mmap took ~40 seconds, and without mmap it was ~15 seconds |
I've run one benchmark more, this time I've tested performance of Raw data: |
I have one more set of results which shows that NuGet without mmap is faster. Raw data: |
I just wanted to say that we really appreciate all the hard work you have done here. |
Version 6.0 is out, would be nice if someone has a look at this change now. |
sorry, it's indeed in my backlog. We were prioritizing things that we couldn't land in 6.0.* after 6.0.0 went out, but this is something I'll be looking into myself in the next sprint or two. Sorry for the wait, and thanks for the contribution! I basically want to understand why we're seeing so much variance in our own benchmarks vs what you're showing, and hopefully come up with a more nuanced solution than a binary "mmap or nah". So I don't know if we'll land this patch as-is, but also I doubt we're going to keep the mmap code intact, because, as you're showing, it's definitely worse in some scenarios. |
@zkat Thanks for prioritising this. I don't see how file writing with mmap can be faster than regular file streams, but I might have overlooked something. |
@marcin-krystianc sorry we haven't got to this earlier. Since your own benchmarks contradict my own, and @zkat's perf measurements on Windows, can you use an environment variable instead of completely removing it? Default it to using mmap io since that's what the current product code does. I think (I have not talked to my team about this) this will allow us to more quickly accept the PR and allow customers to choose which behavior they want. Whereas if it's not configurable, we need to make a decision that affects everyone, and it would appear that this would harm perf for Windows devs on physical machines according to the information we have so far. So, no configuration means we must do a lot of testing to understand the perf tradeoffs before we're comfortable to accept the change. Whereas configurable allows us to make the option available more quickly, and later we can do the deep perf analysis and then change defaults for everyone.
"Writing" to mmap io memory does not synchronously write to disk. Therefore, I feel like writing the file content is more asynchronous that an async stream is, since from the program's perspective, it's just copying/writing memory (a byte array). There are no API calls to either the .NET runtime or the kernel. Therefore, no userspace-kernel context switches either. My guess, and it's just a guess, is that stream copying might also involve additional memory copies (copying from app memory buffer to stream buffer), whereas mmap io only writes to a single memory buffer (the mapped memory). However, flush with mmap io is synchronous (at least in .NET), whereas with stream IO there is a Therefore, the more quickly the operating system and hardware can flush the mmap io buffers and close the file, the less impactful mmap file close has compared to file streams. When the sync flush impact is lower than the content copy/write benefit, mmap has better perf than streams (which is why I suspect that mmap has better perf on large files). I started off writing that my theory is that mmap io is blocking ThreadPool threads, since we close the file as soon as we finish writing the contents, and therefore the operating system doesn't have time to asynchronously flush the contents to disk in the background (whereas other apps like databases would keep the mmap io open even after writing). However, since NuGet isn't using async file IO, I just don't know. For what it's worth, the ThreadPool only spins up new threads when it hasn't made progress for 500ms (I'm unsure on the exact duration or algorithm, but it's trivial to validate it's a slow incremental scale-up). Therefore, when making blocking calls on background threads, we just end up with a lot of thread pool starvation, causing the perf issue. I recently noticed VS telemetry flagging NuGet as a potential thread pool starvation problem, and when I look at the stack traces, it's exactly this mmap file io. But if extraction is much faster, then even if we cause thread pool starvation, customers have a better experience. Maybe we need to move package extraction to dedicated threads, rather than the thread pool 🤷♂️ All this makes me wonder if we'll get a perf improvement making file writing async, and whether So, there's a lot of things to investigate if we want to hyper-optimize extraction perf, which means it will take a lot of time. The broad perf investigations that we've done so far don't show an obvious winner (I know your own measurements show stream IO always faster than mmap, but as mentioned this was not my experience on Windows). From a selfish point of view, this investigation would probably be an interesting topic for my personal blog, so maybe I'll start during my free time, but I make no promises. |
@zivkan thanks for your reply. Your idea about using async I/O to avoid thread pool contention was very interesting so I've decided to try it. Regarding the configurability of file streams vs. memory maps, I've updated my PR. There is a remaining question though about which one should be a default behavior? |
For what it's worth, on Monday I wrote some benchmarks of sync vs async vs mmap writes. I've only tested on my desktop on my nvme drive, on my desktop on my sata ssd drive, and on my personal laptop. I found that mmap was faster only on my desktop nvme drive. Same computer, but using sata ssd, mmap was slower, and my laptop, despite also being nvme, mmap was also slower. All this on Windows. I also wrote another app/benchmark where I took all ~230 packages that OrchardCore restores, pre-downloads the nupkgs, and then the app extracts all those packages as quickly as possible (no other logic, just extract the zip files), using sync, async and mmap, using a few different levels of parallelism. However, in my limited tested so far, all their performances (on windows) are within measurement errors, no meaningful differences (number of parallel extractions did change performance, but I haven't done enough testing to draw solid conclusions). I'm on vacation, so not super motivated to work on this at the moment (I started hoping it would be easy and have a clear conclusion, that I could blog about), plus it's not my scheduled work so I won't be able to dedicate a lot of time to this once I return to work. But I find it surprising and even more confusing why I see big differences between mmap vs sync stream in nuget.exe on my machine, but not in the benchmarks. I know that nuget.exe is doing a lot more work, writing the intermediate files at the end of restore, and doing all the graph walking, and package downloading. But it's still not intuitive to me what might be going on. I also need to test on CI agents and possibly some Azure VMs with different types of storage (HDD storage, if they still offer that, vs SSD).
At a glance, it looks fine. I hate that there are so many PublicAPIs being added, but it looks like it's just adding one new CopyAsync method to an interface, then every class that implements it gets the public APIs too. So, nothing I can reasonably ask to change.
My gut feeling, and certainly what I would have proposed before my Monday benchmarks, would be to keep mmap as the default. Now, I'm not so sure. The "write only" microbenchmarks seem to indicate that mmap writes on slower disks is worse than the benefit when mmap is faster. Hence, if disk distribution was equal, it would be a net loss. But hardware distribution is absolutely not equal and is always changing, and I haven't thought about it long enough to come to any opinion. However, my previous point remains that if you keep the current default (use mmap by default), then we can theoretically accept this quickly, and make decisions about changes to the default later. Your current change turns off mmap by default, which is therefore a change in behaviour to what we have now, and therefore needs discussion with the team given that an analysis of Visual Studio telemetry appears to say that extraction performance improved for our Visual Studio customers since mmap was added (the analysis is complicated since different versions of VS contain other perf changes, not just mmap), and therefore this PR would (probably?) have a negative impact to Visual Studio customers. Since we don't have telemetry from our CLI tools, and the community is generally hostile towards telemetry, we don't large scale data, other than Visual Studio customers (as previously mentioned, despite your benchmarks earlier in this thread showing mmap is always worse, my experience was the opposite, hence data from a large number of machines would be extremely beneficial). |
I just re-ran my single file write perf benchmarks, this time testing both net6.0 and net48. I was expecting the async version to be slower than the sync version, but being async, parallel writes not blocking threads might make it ok. However, I found that on .NET Framework, FileStream.WriteAsync is about 5x slower than FileStream.Write, at least on small files (below about 100k). Obviously this would need more testing, but from an initial test it would appear that this change would not be worthwhile in fact. On .NET 5 and earlier, the However, microbenchmarks are not representative of real performance impact, so we could do more benchmarks of msbuild and dotnet cli restore. But it's obvious that we need to test both .NET Framework and .NET (Core) separately, since they have increasingly diverging performance characteristics. Visual Studio and MSBuild runs on .NET Framework, so even though people are migrating to SDK style projects and building on CI with the dotnet cli, there's still far too many customers using NuGet on the .NET Framework runtime to discount performance impacts here. |
FWIW, I'm rerunning my "pre-download Orchard Core's packages, then extract them in parallel" benchmark on .NET Framework 4.8. Using Maybe we use mmap for .NET Framework (but then we have the question about Mono), and normal streams for .NET (CoreApp). I still need to run all my benchmarks on Linux (and Mac, ideally) to see how any of it is different to Windows, but we're certainly not getting the same behaviour on all runtimes. |
Just a quick update from my side. I've been running some benchmarks today and I was able to reproduce the performance difference for |
Good find! Yes, I also see that telling Windows Defender to exclude my benchmark program's folder that .NET Framework CopyTo performance matches mmap writes. However, I'm not sure what to do with this information. I would be horrified if anyone excludes NuGet's global packages folder from their AV of choice. NuGet's entire point of existance is to download files from other machines. Countless "supply chain" attacks, including, but not limited to, the recent SolarWinds attack, show that you can't trust anything, not even company internal servers. Stuxnet shows that you can't even trust fully isolated networks. The concept of a trusted network or trusted host doesn't really exist. I'm preparing my benchmark apps to put them in a public repo (I always put them in github.com/zivkan/Benchmarks), and once I do that, I'll start discussions internally to see if I can get any advice from experts to see if there's some way to improve Stream.CopyTo or Stream.Write performance without turning off AV. |
@marcin-krystianc sorry about the delay. I've found internal contacts for Windows Defender, and instructions on how to provide them with perf traces in their preferred format, but all this is taking time, and everything else I'm working on is taking a lot longer than I estimated (as usual), so I don't have time to dedicate to this. Given my own very limited testing on Linux agrees with you there's a big perf regression on Linux, and I'll trust you on Mac, I suggest that we make mmap default on Windows and FileStream default on everything else. Personally, I no longer feel strongly about making it user controllable with the environment variable, but I'll accept keeping it as long as the defaults are per-platform. |
@zivkan I've finally completed re-running my benchmarks.
I think that the solution that you have proposed is a good compromise. It makes sense to use mmap on Windows because most likely Windows machines will have Defender enabled. I am of the opinion that we should leave a possibility to users override default behavior so I've updated my branch accordingly. Test HardwareAWS VM :
GitHub-hosted runners:Benchmarks
Results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @marcin-krystianc! Your patience with us, and the investigations you've helped us with have been fantastic. I still have a low priority task to follow up with the Windows Defender team, to try to understand why FileStream writing is so slow on Windows, but where this PR ended up is the best compromise right now. And I love that you kept the environment variable, since that makes it very easy for us to test mmap vs FileStream performance without recompiling.
@NuGet/nuget-client I plan on merging before the end of the week. Please review if you want to provide feedback before it's merged.
src/NuGet.Core/NuGet.Packaging/PackageExtraction/StreamExtensions.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're having some pretty serious CI infrastructure issues at the moment, so it might take us a few days to work around/fix them, unless we just give up and start merging PRs with CI builds that are not green.
test/NuGet.Core.Tests/NuGet.Packaging.Test/PackageExtraction/StreamExtensionsTests.cs
Outdated
Show resolved
Hide resolved
test/NuGet.Core.Tests/NuGet.Packaging.Test/PackageExtraction/StreamExtensionsTests.cs
Outdated
Show resolved
Hide resolved
test/NuGet.Core.Tests/NuGet.Packaging.Test/PackageExtraction/StreamExtensionsTests.cs
Outdated
Show resolved
Hide resolved
15379ea
to
a0d657e
Compare
a0d657e
to
82cc350
Compare
Bug
Fixes: NuGet/Home#11031
Regression? Last working version:
dotnet 3.x
Description
As it was noted in the NuGet/Home#11031, restore operations which involve downloading and installing packages into the
global-packages
folder became slow indotnet 5.x
(when compared todotnet 3.x
).It turns out that the problem was introduced by introducing the mmap-based package extraction (#3524) which was supposed to make the package extraction faster. According to my testing the opposite is true and it is a better to avoid the memory mapped I/O for package extraction.
What is the evidence ?
dotnet restore
is significantly faster without mmap-based file extraction,NuGet.exe
is also faster (the effect is not so strong though but it is still visible)Test Hardware
AWS VM
Laptop
Tested solutions
Test scenarios
Test results:
clean restores ("arctic") with
dotnet restore
on Windows:clean restores ("arctic") with
NuGet.exe restore
on Windows:clean restores with
dotnet restore
on Linux:results outside NuGet from the test application. Tables below show how much data was written after 120 seconds writing files of a particular size. On Linux the mmap-IO is slower when working with files up to 1MB. On Windows the mmap approach is particularly slow for the 1KB-1MB files.
Raw data
Comparison of disk and CPU utilisation
Both graphs show CPU and disk utilisation when running
dotnet nuget locals all --clear && dotnet restore
for the Orleans solution.with mmap:
without mmap:
It's clear from the graphs above that CPU utilisation is much higher when the regular file streams API is being used. With mmap, the restore operation is rather I/O bound, where with file streams it becomes CPU bound.
PR Checklist
PR has a meaningful title
PR has a linked issue.
Described changes
Tests
- [ ] Automated tests added
- OR
- [ ] Test exception
- OR
- [x] N/A
Documentation
- [ ] Documentation PR or issue filled
- OR
- [x] N/A