-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate Windows Commit charge for snmalloc #223
Comments
@davidchisnall mentionned: "By default, on Windows, snmalloc only decommits memory when the kernel notifies it that memory is constrained. If you've got loads of spare memory, there's no problem letting the commit size grow a lot, it's only a negative if the memory could be usefully used for something else." The test machine has 128 GB of RAM and memory is far from being constrained. I will nevertheless re-test with IS_ADDRESS_SPACE_CONSTRAINED and the latest master branch. My test was using the tree_index branch. |
Figures with latest master:
It seems the latest snmalloc checkout is a tad slower but the commit is now much better with David's suggestion. |
That looks a lot more plausible. We should probably rename |
@aganea thank you so much for running more tests. I wonder if the small regression in performance is not using the I have been setting up an LLVM Windows build, so I can test the link time. Just wanted to confirm I am doing the same as you
When building this, I assume you are measuring the final step
I am building with the latest master, but I haven't tried to apply your patch yet. Just getting the very slow version with |
@davidchisnall I think moving to the 1MiB size as default would make a lot of sense. I think 1MiB could work well with a little fiddling around huge pages, so we can put two threads into one huge page, for low-memory multi-threaded scenarios. Agreed @Licenser, @darach, @SchrodingerZhu any thoughts on changing the default to 1MiB? |
You're very welcome! You folks have been very helpful so far :)
In essence, you have to do a two-stage LLVM build.
So it's using the stage2 LLD to relink the stage2 clang. I use Bruce Dawson's UIforETW to take profile traces: https://github.com/google/UIforETW - ensure to check 'trace to file' first on the right side. Click 'Start Tracing' at the top before running the above cmd-line, then 'Save Trace Buffers' once it ends. After it is done compressing the trace, double-clicking on it would open WPA. If the traces are too big and you get out-of-memory crashes, set the following the wpa.exe.config file next to wpa.exe:
I've attached my build script: You need to run it from a VS 2017 or 2019 x64 Native Tools Command Prompt:
GnuWin32, Python 3.8, ninja are also needed. Please let me know if there're difficulties along the way. |
We're running it with |
on a small linux openvz instance (2Gib in total), after upgrading to The following rust is not very useful, but I may have time to check the problem further:
I confirmed that this was resulted by |
this is how I reproduce a similar problem on my PC:
|
|
@SchrodingerZhu I have raised an issue (#224) for what you have reported. I believe it is independent of the Windows commit issue. |
So I ran our benchmark with and without 1mib and I couldn't see any significant difference. |
@Licenser thanks. Did you monitor RSS, or just throughput. If you did monitor RSS, do you have transparent huge pages enabled |
I just looked at throughput we don't have any benchmarks that look at memory, sorry. |
@Licenser thanks for doing this. |
@aganea I have replicated the experiment so far. I have checked out a One minor tip, you can do |
@aganea I have also got rpmalloc and mimalloc working in the way your patch describes. Initially, I am observing rpmalloc as slightly slower than snmalloc, but mimalloc is quite a bit slower. Is there anything I might be missing in building
from the Obviously, the machines are different, so we should expect different results. As this is running in the Cloud the cost of various operations are different, and may occur contention in Hyper-V. It is definitely not running the system heap, as it is getting up to a reasonable percentage CPU utilization, which the system allocator does not. |
@mjp41 What Windows 10 version is the underlying cloud system? It might definitly be something related to allocating hardware pages on the underlying system. mimalloc makes a lot more calls to VirtualAlloc than rpmalloc than snmalloc. Please take a ETW trace, then in WPA go the RandomAscii inclusive view, right-lick "Filter to selection" on lld-link, than add two colums Module and Function, in this order: Process, Module, Function. You'll be able to tell pretty quickly where the bottleneck is. Normally, ntdll.dll & ntoskrnl.exe combined shouldn't take more that 0.5-0.8% of CPU, and most of the time is spent by xperf Rtl functions capturing the callstacks. |
@aganea it is running in Azure, so I assume HyperV at the bottom, and Windows 10 version 1809 as the OS. Looking at the traces it is spending a lot of time inside |
If the instance is running on 1809, then the behavior you're seeing is 'normal'. There's a known issue in the NT kernel, there was a contention in the page zero-out mechanism: https://stackoverflow.com/questions/45024029/windows-10-poor-performance-compared-to-windows-7-page-fault-handling-is-not-sc Same dataset, same LLD linker: However after 1909 there's a new contention issue in the large page allocation -- I don't know if it was fixed in version 2004: https://twitter.com/alex_toresh/status/1215125422226231297 |
Okay, I'll try to update the VM. Though, Windows update wants to go to 1909. Looking at the numbers on the machine today rpmalloc and 16MiB configuration were about the same, and the 1MiB was slightly slower, but all pretty close and within the level of noise, so would actually have to do some statistics to draw a conclusion. The machine Azure gave me yesterday, had rpmalloc as slightly slower, I didn't run enough tests to see if it was statistically significant though. Memory usage was approximately as you saw but off by a factor.
|
So my VM upgraded to 1909 and now mimalloc is even worse. On this machine it is giving rpmalloc about 5% faster then snmalloc 1MiB, with the 16 MiB in the middle of them. Memory usage looks about the same. I am going to move to 1MiB as the default. It works much better in terms of RSS/PSW, and there are very few scenarios where the reduced throughput seem too costly. |
If you are using the rust crate, it has been just updated and it now requires setting either the |
@aganea has enabled snmalloc, mimalloc, and rpmalloc to be the allocator for lld-link. He has benchmarked this with performing ThinLTO on a clang build. The results taken from https://reviews.llvm.org/D71786 are
The time is pretty comparable, but this shows
snmalloc
on Windows as committing considerably more memory than other allocators.Experiments to try
The text was updated successfully, but these errors were encountered: