Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default to mmap-based I/O on Windows only #4186

Merged
merged 9 commits into from
Jan 24, 2022

Conversation

marcin-krystianc
Copy link
Contributor

@marcin-krystianc marcin-krystianc commented Aug 6, 2021

Bug

Fixes: NuGet/Home#11031

Regression? Last working version: dotnet 3.x

Description

As it was noted in the NuGet/Home#11031, restore operations which involve downloading and installing packages into the global-packages folder became slow in dotnet 5.x (when compared to dotnet 3.x).
It turns out that the problem was introduced by introducing the mmap-based package extraction (#3524) which was supposed to make the package extraction faster. According to my testing the opposite is true and it is a better to avoid the memory mapped I/O for package extraction.

What is the evidence ?

  • dotnet restore is significantly faster without mmap-based file extraction, NuGet.exe is also faster (the effect is not so strong though but it is still visible)
  • Tested with 4 different solutions
  • Tested on Windows and Linux systems
  • Tested on physical machine (Windows) and VMs (Windows, Linux)
  • Tested with physical SSD and SSD-based ESB storage on AWS
  • Results reproduced outside NuGet (Windows and Linux) with a test application https://github.com/marcin-krystianc/nuget_11031/tree/master/TestFileWriting

Test Hardware

AWS VM

  • m5.xlarge instance with EBS storage (General Purpose SSD - gp3 3000/125 ("non-burstable"))
  • CPU Intel(R) Xeon 8259CL (4 virtual cores - "non-burstable"), 16 GB RAM
  • Windows Server 2019 / Ubuntu 20.04

Laptop

  • Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz (Cores 4/8)
  • 16 GB RAM
  • Windows 10 + Docker for testing Linux

Tested solutions

Test scenarios

  • To avoid a noise in the results from the HTTP traffic, all tests were run with locally defined package sources:
<add key="LocalNuGetPackages" value="d:\LocalNuGetPackages" />
  • All runs were repeated multiple times (30 times) to improve confidence in the results
  • Since we are interested in the package extraction process the "arctic" scenario from the perf scripts was used. For tests on Linux a following command was used (perf scripts are written in PowerShell and I couldn't make them work on Linux):
dotnet nuget locals all --clear && \
dotnet restore -clp:PerformanceSummary -clp:summary --force /p:RestoreUseStaticGraphEvaluation=true

Test results:

  • clean restores ("arctic") with dotnet restore on Windows:
    dotnet_restore

  • clean restores ("arctic") with NuGet.exe restore on Windows:
    nuget_restore

  • clean restores with dotnet restore on Linux:
    dotnet_restore_linux

  • results outside NuGet from the test application. Tables below show how much data was written after 120 seconds writing files of a particular size. On Linux the mmap-IO is slower when working with files up to 1MB. On Windows the mmap approach is particularly slow for the 1KB-1MB files.

Windows AWS 1B 10B 100B 1KB 10KB 100KB 1MB 10MB 100MB
filestreams 174.81KB 2.63MB 26.95MB 241.69MB 2.22GB 14.09GB 15.86GB 17.54GB 14.84GB
memorymaps 177.63KB 2.14MB 23.06MB 57.32MB 569.08MB 4.52GB 14.56GB 14.75GB 14.55GB
Windows Laptop 1B 10B 100B 1KB 10KB 100KB 1MB 10MB 100MB
filestreams 204.78KB 3.59MB 39.66MB 353.12MB 3.19GB 18.15GB 33.59GB 46.75GB 32.42GB
memorymaps 282.23KB 3.05MB 29.10MB 172.38MB 1.51GB 7.81GB 20.08GB 33.23GB 35.74GB
Linux AWS 1B 10B 100B 1KB 10KB 100KB 1MB 10MB 100MB
filestreams 345.08KB 3.29MB 32.84MB 342.13MB 3.29GB 14.42GB 14.71GB 14.76GB 14.75GB
memorymaps 41.39KB 410.77KB 3.96MB 42.06MB 394.16MB 3.42GB 14.21GB 14.82GB 14.84GB

Raw data

Comparison of disk and CPU utilisation

Both graphs show CPU and disk utilisation when running dotnet nuget locals all --clear && dotnet restore for the Orleans solution.

  • with mmap:
    0

  • without mmap:
    2

It's clear from the graphs above that CPU utilisation is much higher when the regular file streams API is being used. With mmap, the restore operation is rather I/O bound, where with file streams it becomes CPU bound.

PR Checklist

  • PR has a meaningful title

  • PR has a linked issue.

  • Described changes

  • Tests
      - [ ] Automated tests added
      - OR
     
      - [ ] Test exception
      - OR
      - [x] N/A

  • Documentation
     
      - [ ] Documentation PR or issue filled
      - OR
      - [x] N/A

@ghost ghost added the Community PRs created by someone not in the NuGet team label Aug 6, 2021
@marcin-krystianc marcin-krystianc marked this pull request as ready for review August 6, 2021 09:28
@marcin-krystianc marcin-krystianc requested a review from a team as a code owner August 6, 2021 09:28
Copy link
Member

@zivkan zivkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my own machine (desktop with nvme storage), nuget.exe restore of NuGet.Client was nearly twice as fast as HEAD~2 from your branch.

This is clearly more nuanced and more investigation is warranted. @zkat is on leave at the moment, but when they're back we'll figure out who should lead an investigation when when we can schedule it.

@zivkan
Copy link
Member

zivkan commented Aug 6, 2021

FWIW, I validate that mmap appears to be much slower on Linux. Using an Azure Standard_D4as_v4 restoring orleans with mmap took ~40 seconds, and without mmap it was ~15 seconds

@marcin-krystianc
Copy link
Contributor Author

I've run one benchmark more, this time I've tested performance of dotnet/NuGet when the global-packages is configured on a RAM disk on Windows (I've used ImDisk Toolkit to set-up the RAM disk.)
Because the latency of a RAM disk is very low and the throughput is very high, the difference between mmap and nommap versions is greatly reduced. However, the nommap is still a winner.

image
image

Raw data:

@marcin-krystianc
Copy link
Contributor Author

I have one more set of results which shows that NuGet without mmap is faster.
This time I've been running restores on the GitHub-hosted runners. The code I've been using is publicly available (https://github.com/marcin-krystianc/nuget_11031_perf) so my results can be confirmed independently.

dotnet_linux
dotnet_windows
nuget_windows

Raw data:

@nkolev92
Copy link
Member

hey @marcin-krystianc

I just wanted to say that we really appreciate all the hard work you have done here.
The team is working hard prepping for the new release, but we are paying attention to this change as well!

@marcin-krystianc
Copy link
Contributor Author

Version 6.0 is out, would be nice if someone has a look at this change now.

@zkat
Copy link
Contributor

zkat commented Oct 7, 2021

sorry, it's indeed in my backlog. We were prioritizing things that we couldn't land in 6.0.* after 6.0.0 went out, but this is something I'll be looking into myself in the next sprint or two. Sorry for the wait, and thanks for the contribution!

I basically want to understand why we're seeing so much variance in our own benchmarks vs what you're showing, and hopefully come up with a more nuanced solution than a binary "mmap or nah". So I don't know if we'll land this patch as-is, but also I doubt we're going to keep the mmap code intact, because, as you're showing, it's definitely worse in some scenarios.

@marcin-krystianc
Copy link
Contributor Author

@zkat Thanks for prioritising this.
My understanding is that writing files with mmap is slower because it is synchronous - i.e. it blocks the thread until the entire file content is written to disk. Whereas the no-mmap file writing is asynchronous, i.e. it doesn't block because it leverages the file system cache (page cache). If you look at screenshots Performance Monitor (in the PR description) you can see that the no-mmap version wins because it has better CPU utilization and also a lot of disk writes happens even after the restore has already finished (it shows that the page cache is being used)

I don't see how file writing with mmap can be faster than regular file streams, but I might have overlooked something.
If you have in mind any missing scenarios to try or there there is anything else can do to help you with your analysis please let me know.

@zivkan
Copy link
Member

zivkan commented Nov 30, 2021

@marcin-krystianc sorry we haven't got to this earlier. Since your own benchmarks contradict my own, and @zkat's perf measurements on Windows, can you use an environment variable instead of completely removing it? Default it to using mmap io since that's what the current product code does.

I think (I have not talked to my team about this) this will allow us to more quickly accept the PR and allow customers to choose which behavior they want. Whereas if it's not configurable, we need to make a decision that affects everyone, and it would appear that this would harm perf for Windows devs on physical machines according to the information we have so far. So, no configuration means we must do a lot of testing to understand the perf tradeoffs before we're comfortable to accept the change. Whereas configurable allows us to make the option available more quickly, and later we can do the deep perf analysis and then change defaults for everyone.

My understanding is that writing files with mmap is slower because it is synchronous
I don't see how file writing with mmap can be faster than regular file streams

"Writing" to mmap io memory does not synchronously write to disk. Therefore, I feel like writing the file content is more asynchronous that an async stream is, since from the program's perspective, it's just copying/writing memory (a byte array). There are no API calls to either the .NET runtime or the kernel. Therefore, no userspace-kernel context switches either. My guess, and it's just a guess, is that stream copying might also involve additional memory copies (copying from app memory buffer to stream buffer), whereas mmap io only writes to a single memory buffer (the mapped memory).

However, flush with mmap io is synchronous (at least in .NET), whereas with stream IO there is a FlushAsync. However, given that the code is question isn't even async I don't know if this even matters. For what it's worth, as far as I can tell, neither Windows nor Linux have async file close (or open) APIs. If either do, .NET doesn't expose it, but I did investigate async file open/create years ago, and I couldn't find any operating system that supported async file open in the kernel. However, file close is fast as long as the buffers as flushed. If the buffers are not flushed, then closing will synchronously block while the flush happens, and the file is eventually closed.

Therefore, the more quickly the operating system and hardware can flush the mmap io buffers and close the file, the less impactful mmap file close has compared to file streams. When the sync flush impact is lower than the content copy/write benefit, mmap has better perf than streams (which is why I suspect that mmap has better perf on large files).

I started off writing that my theory is that mmap io is blocking ThreadPool threads, since we close the file as soon as we finish writing the contents, and therefore the operating system doesn't have time to asynchronously flush the contents to disk in the background (whereas other apps like databases would keep the mmap io open even after writing). However, since NuGet isn't using async file IO, I just don't know.

For what it's worth, the ThreadPool only spins up new threads when it hasn't made progress for 500ms (I'm unsure on the exact duration or algorithm, but it's trivial to validate it's a slow incremental scale-up). Therefore, when making blocking calls on background threads, we just end up with a lot of thread pool starvation, causing the perf issue. I recently noticed VS telemetry flagging NuGet as a potential thread pool starvation problem, and when I look at the stack traces, it's exactly this mmap file io. But if extraction is much faster, then even if we cause thread pool starvation, customers have a better experience. Maybe we need to move package extraction to dedicated threads, rather than the thread pool 🤷‍♂️

All this makes me wonder if we'll get a perf improvement making file writing async, and whether CopyToAsync does FlushAsync internally, or if we'll gain perf by doing it ourselves before closing/disposing the file.

So, there's a lot of things to investigate if we want to hyper-optimize extraction perf, which means it will take a lot of time. The broad perf investigations that we've done so far don't show an obvious winner (I know your own measurements show stream IO always faster than mmap, but as mentioned this was not my experience on Windows). From a selfish point of view, this investigation would probably be an interesting topic for my personal blog, so maybe I'll start during my free time, but I make no promises.

@marcin-krystianc
Copy link
Contributor Author

@zivkan thanks for your reply.

Your idea about using async I/O to avoid thread pool contention was very interesting so I've decided to try it.
It required quite a lot of changes so I've made it on another branch: https://github.com/NuGet/NuGet.Client/compare/dev...marcin-krystianc:dev-marcink-20211214-mmapasyncio?expand=1 .
My conclusion is that there is not much difference in performance whether we use fully async I/O or not (regardless of whether we combine it with file streams or memory maps). I also didn't see any difference in the number of threads.
Anyway, I think it is still a good modification to make. So if you think that change is likely to be accepted I can submit a new PR.

Regarding the configurability of file streams vs. memory maps, I've updated my PR. There is a remaining question though about which one should be a default behavior?

@zivkan
Copy link
Member

zivkan commented Dec 17, 2021

For what it's worth, on Monday I wrote some benchmarks of sync vs async vs mmap writes. I've only tested on my desktop on my nvme drive, on my desktop on my sata ssd drive, and on my personal laptop. I found that mmap was faster only on my desktop nvme drive. Same computer, but using sata ssd, mmap was slower, and my laptop, despite also being nvme, mmap was also slower. All this on Windows.

I also wrote another app/benchmark where I took all ~230 packages that OrchardCore restores, pre-downloads the nupkgs, and then the app extracts all those packages as quickly as possible (no other logic, just extract the zip files), using sync, async and mmap, using a few different levels of parallelism. However, in my limited tested so far, all their performances (on windows) are within measurement errors, no meaningful differences (number of parallel extractions did change performance, but I haven't done enough testing to draw solid conclusions).

I'm on vacation, so not super motivated to work on this at the moment (I started hoping it would be easy and have a clear conclusion, that I could blog about), plus it's not my scheduled work so I won't be able to dedicate a lot of time to this once I return to work. But I find it surprising and even more confusing why I see big differences between mmap vs sync stream in nuget.exe on my machine, but not in the benchmarks. I know that nuget.exe is doing a lot more work, writing the intermediate files at the end of restore, and doing all the graph walking, and package downloading. But it's still not intuitive to me what might be going on. I also need to test on CI agents and possibly some Azure VMs with different types of storage (HDD storage, if they still offer that, vs SSD).

So if you think that change is likely to be accepted I can submit a new PR.

At a glance, it looks fine. I hate that there are so many PublicAPIs being added, but it looks like it's just adding one new CopyAsync method to an interface, then every class that implements it gets the public APIs too. So, nothing I can reasonably ask to change.

There is a remaining question though about which one should be a default behavior?

My gut feeling, and certainly what I would have proposed before my Monday benchmarks, would be to keep mmap as the default. Now, I'm not so sure. The "write only" microbenchmarks seem to indicate that mmap writes on slower disks is worse than the benefit when mmap is faster. Hence, if disk distribution was equal, it would be a net loss. But hardware distribution is absolutely not equal and is always changing, and I haven't thought about it long enough to come to any opinion.

However, my previous point remains that if you keep the current default (use mmap by default), then we can theoretically accept this quickly, and make decisions about changes to the default later. Your current change turns off mmap by default, which is therefore a change in behaviour to what we have now, and therefore needs discussion with the team given that an analysis of Visual Studio telemetry appears to say that extraction performance improved for our Visual Studio customers since mmap was added (the analysis is complicated since different versions of VS contain other perf changes, not just mmap), and therefore this PR would (probably?) have a negative impact to Visual Studio customers. Since we don't have telemetry from our CLI tools, and the community is generally hostile towards telemetry, we don't large scale data, other than Visual Studio customers (as previously mentioned, despite your benchmarks earlier in this thread showing mmap is always worse, my experience was the opposite, hence data from a large number of machines would be extremely beneficial).

@zivkan
Copy link
Member

zivkan commented Dec 17, 2021

Anyway, I think it is still a good modification to make. So if you think that change is likely to be accepted I can submit a new PR.

I just re-ran my single file write perf benchmarks, this time testing both net6.0 and net48. I was expecting the async version to be slower than the sync version, but being async, parallel writes not blocking threads might make it ok.

However, I found that on .NET Framework, FileStream.WriteAsync is about 5x slower than FileStream.Write, at least on small files (below about 100k). Obviously this would need more testing, but from an initial test it would appear that this change would not be worthwhile in fact.

On .NET 5 and earlier, the ThreadPool spins up new threads about once a second at most, whereas with .NET 6 it is faster at detecting blocked threads and spins up threads much faster. Therefore, the threadpool starvation problem on .NET 6 is minimal, whereas WriteAsync vs Write performance on .NET Framework seems really bad. Therefore, overall I'd say that we should leave nupkg extraction as sync blocking code.

However, microbenchmarks are not representative of real performance impact, so we could do more benchmarks of msbuild and dotnet cli restore. But it's obvious that we need to test both .NET Framework and .NET (Core) separately, since they have increasingly diverging performance characteristics. Visual Studio and MSBuild runs on .NET Framework, so even though people are migrating to SDK style projects and building on CI with the dotnet cli, there's still far too many customers using NuGet on the .NET Framework runtime to discount performance impacts here.

@zivkan
Copy link
Member

zivkan commented Dec 17, 2021

FWIW, I'm rerunning my "pre-download Orchard Core's packages, then extract them in parallel" benchmark on .NET Framework 4.8. Using Stream.CopyTo takes around 25 seconds, Stream.CopyToAsync takes around 32 seconds, and mmap takes just 9 seconds, which is how fast all the extraction methods on .NET 6 are. I don't currently understand why mmap writes for nupkg extraction is so much faster on .NET Framework, and I've only tested on a single machine so far, but this benchmark replicates that nuget.exe restore perf relationships that I found back when this PR was first opened.

Maybe we use mmap for .NET Framework (but then we have the question about Mono), and normal streams for .NET (CoreApp).

I still need to run all my benchmarks on Linux (and Mac, ideally) to see how any of it is different to Windows, but we're certainly not getting the same behaviour on all runtimes.

@marcin-krystianc
Copy link
Contributor Author

Just a quick update from my side. I've been running some benchmarks today and I was able to reproduce the performance difference for dotnet restore (based on modified dotnet 6.0.100) in favour of Memory Maps (~14s for file streams vs ~7s for memory maps) only when the Windows AntiVirus protection was enabled.
After disabling AntiVirus protection the performance difference is not such significant anymore (but this time File Streams are slightly faster).
Still nothing conclusive, but I wanted to share my findings about key role of AntiVirus protection in the benchmarks. @zivkan Can you check whether you see the same impact of AV protection for package extraction performance?

@zivkan
Copy link
Member

zivkan commented Dec 18, 2021

Good find! Yes, I also see that telling Windows Defender to exclude my benchmark program's folder that .NET Framework CopyTo performance matches mmap writes.

However, I'm not sure what to do with this information. I would be horrified if anyone excludes NuGet's global packages folder from their AV of choice. NuGet's entire point of existance is to download files from other machines. Countless "supply chain" attacks, including, but not limited to, the recent SolarWinds attack, show that you can't trust anything, not even company internal servers. Stuxnet shows that you can't even trust fully isolated networks. The concept of a trusted network or trusted host doesn't really exist.

I'm preparing my benchmark apps to put them in a public repo (I always put them in github.com/zivkan/Benchmarks), and once I do that, I'll start discussions internally to see if I can get any advice from experts to see if there's some way to improve Stream.CopyTo or Stream.Write performance without turning off AV.

@zivkan
Copy link
Member

zivkan commented Jan 7, 2022

@marcin-krystianc sorry about the delay. I've found internal contacts for Windows Defender, and instructions on how to provide them with perf traces in their preferred format, but all this is taking time, and everything else I'm working on is taking a lot longer than I estimated (as usual), so I don't have time to dedicate to this.

Given my own very limited testing on Linux agrees with you there's a big perf regression on Linux, and I'll trust you on Mac, I suggest that we make mmap default on Windows and FileStream default on everything else.

Personally, I no longer feel strongly about making it user controllable with the environment variable, but I'll accept keeping it as long as the defaults are per-platform.

@marcin-krystianc
Copy link
Contributor Author

@zivkan I've finally completed re-running my benchmarks.
There are a few interesting findings, but in general, it confirms what we knew already:

  • Defender on Windows makes writing files with FileStreams much slower than writing them with mmap
  • When Defender is disabled, mmap can be faster or slower - it depends on the disk speed (the faster the disk the better mmap performs)
  • On other platforms (Linux and macOS), mmap is slower
  • Using async I/O (generally doesn't make a significant difference, but in the case of nuget.exe (.NET Framework) writing files with regular FileStream, it is a massive regression.
  • GitHub-hosted runners have two disks, where one seems to be very slow (drive C on Windows runners) and the other one seems to be very fast (drive D on Windows runners). It is unfortunate that on Windows, the location of user profile (including the default location of global-packages folder for NuGet) is on a very slow C drive.

I think that the solution that you have proposed is a good compromise. It makes sense to use mmap on Windows because most likely Windows machines will have Defender enabled. I am of the opinion that we should leave a possibility to users override default behavior so I've updated my branch accordingly.

Test Hardware

AWS VM :

  • m5d.xlarge VM with SSD disk attached directly to the instance
  • 16GB RAM, 4 vCPUs

GitHub-hosted runners:

https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources:

Benchmarks

Results

graph

Copy link
Member

@zivkan zivkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @marcin-krystianc! Your patience with us, and the investigations you've helped us with have been fantastic. I still have a low priority task to follow up with the Windows Defender team, to try to understand why FileStream writing is so slow on Windows, but where this PR ended up is the best compromise right now. And I love that you kept the environment variable, since that makes it very easy for us to test mmap vs FileStream performance without recompiling.

@NuGet/nuget-client I plan on merging before the end of the week. Please review if you want to provide feedback before it's merged.

@zivkan zivkan changed the title Don't use mmap-based I/O for package extraction Default to mmap-based I/O on Windows only Jan 12, 2022
Copy link
Member

@zivkan zivkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're having some pretty serious CI infrastructure issues at the moment, so it might take us a few days to work around/fix them, unless we just give up and start merging PRs with CI builds that are not green.

@zivkan zivkan force-pushed the dev-marcink-20210802-memorymaps branch from 15379ea to a0d657e Compare January 23, 2022 15:44
@zivkan zivkan force-pushed the dev-marcink-20210802-memorymaps branch from a0d657e to 82cc350 Compare January 24, 2022 16:13
@zivkan zivkan merged commit 8123544 into NuGet:dev Jan 24, 2022
@marcin-krystianc marcin-krystianc deleted the dev-marcink-20210802-memorymaps branch January 25, 2022 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community PRs created by someone not in the NuGet team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Regression]: Performance regression for cold restores in .NET 5.0.x
6 participants