Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate Windows Commit charge for snmalloc #223

Closed
mjp41 opened this issue Jul 3, 2020 · 24 comments
Closed

Investigate Windows Commit charge for snmalloc #223

mjp41 opened this issue Jul 3, 2020 · 24 comments
Assignees

Comments

@mjp41
Copy link
Member

mjp41 commented Jul 3, 2020

@aganea has enabled snmalloc, mimalloc, and rpmalloc to be the allocator for lld-link. He has benchmarked this with performing ThinLTO on a clang build. The results taken from https://reviews.llvm.org/D71786 are

Allocator Wall clock Page ranges commited/decommited Total touched pages Peak Mem
Windows 10 version 2004 38 min 47 sec 14.9 GB
mimalloc 2 min 22 sec 1,449,501 174,3 GB 19,8 GB
rpmalloc 2 min 15 sec 270,796 45,9 GB 31,9 GB
snmalloc 2 min 19 sec 102,839 47,0 GB 42,0 GB

The time is pretty comparable, but this shows snmalloc on Windows as committing considerably more memory than other allocators.

Experiments to try

  • Different "chunk" size, 16MiB versus 1MiB.
  • Sub chunk, commit/decommit operations
@mjp41 mjp41 self-assigned this Jul 3, 2020
@aganea
Copy link

aganea commented Jul 3, 2020

@davidchisnall mentionned: "By default, on Windows, snmalloc only decommits memory when the kernel notifies it that memory is constrained. If you've got loads of spare memory, there's no problem letting the commit size grow a lot, it's only a negative if the memory could be usefully used for something else."

The test machine has 128 GB of RAM and memory is far from being constrained. I will nevertheless re-test with IS_ADDRESS_SPACE_CONSTRAINED and the latest master branch. My test was using the tree_index branch.

@aganea
Copy link

aganea commented Jul 3, 2020

Figures with latest master:

Allocator Wall clock Page ranges commited/decommited Total touched pages Peak Mem
default 2 min 21 sec 73,611 43,8 GB 42,6 GB
+IS_ADDRESS_SPACE_CONSTRAINED 2 min 21 sec 48,836 92,7 GB 21,6 GB

It seems the latest snmalloc checkout is a tad slower but the commit is now much better with David's suggestion.

snmalloc_is_constrained

@davidchisnall
Copy link
Collaborator

That looks a lot more plausible. We should probably rename IS_ADDRESS_SPACE_CONSTRAINED: it's a bit misleading. Since @mjp41's recent work, we rarely see a performance advantage from using 16MiB super slabs, I wonder if we should consider adjusting the default to 1MiB (or 2MiB, which would play nicely with superpages on x86).

@mjp41
Copy link
Member Author

mjp41 commented Jul 4, 2020

@aganea thank you so much for running more tests. I wonder if the small regression in performance is not using the tree_index branch. That should have improved Windows performance a bit. I have rebased the tree_index branch onto master, so we can test those changes again, so see if they account for the regression.

I have been setting up an LLVM Windows build, so I can test the link time. Just wanted to confirm I am doing the same as you

cmake ..\llvm-project\llvm \
   -DLLVM_ENABLE_LTO=On \
   -DCMAKE_LINKER=c:/src/malloc-llvm/build/lld-link.exe \
   -DLLVM_ENABLE_PROJECTS=clang \
   -DCMAKE_BUILD_TYPE=Release \
   -DCMAKE_C_COMPILER=c:/src/malloc-llvm/build/bin/clang-cl.exe \
   -DCMAKE_CXX_COMPILER=c:/src/malloc-llvm/build/bin/clang-cl.exe \
   -G Ninja

When building this, I assume you are measuring the final step

Linking CXX executable bin\clang.exe  

I am building with the latest master, but I haven't tried to apply your patch yet. Just getting the very slow version with lld-link so far.

@mjp41
Copy link
Member Author

mjp41 commented Jul 4, 2020

@davidchisnall I think moving to the 1MiB size as default would make a lot of sense. I think 1MiB could work well with a little fiddling around huge pages, so we can put two threads into one huge page, for low-memory multi-threaded scenarios.

Agreed IS_ADDRESS_SPACE_CONSTRAINED is a terrible name, I named it after why I needed it originally, rather than what it does.

@Licenser, @darach, @SchrodingerZhu any thoughts on changing the default to 1MiB?

@mjp41
Copy link
Member Author

mjp41 commented Jul 4, 2020

This comment contains some benchmarking using the microbenchmarks for mimalloc for the different chunk sizes.

@plietar any thoughts on changing the default to 1MiB?

@aganea
Copy link

aganea commented Jul 4, 2020

@aganea thank you so much for running more tests.

You're very welcome! You folks have been very helpful so far :)

Just wanted to confirm I am doing the same as you

In essence, you have to do a two-stage LLVM build.

  1. git checkout https://github.com/llvm/llvm-project -or- git pull, then git apply https://reviews.llvm.org/D71786
  2. The first stage builds LLVM with the bootstrap compiler (any compiler). You could use the allocator at this point if you wish.
  3. The second stage builds LLVM with with the 1st stage. At this point cmake uses ThinLTO & the allocator & O3 & -march=skylake or whatever your CPU is to ensure max. perfomance.
  4. Once everything is built, delete buildninjaStage2\bin\clang.exe, then re-run ninja clang -v. While it's linking (it should last a bit), go into the folder buildninjaStage2\CMakeFiles and copy clang.rsp (which is temp file created during link) to clang2.rsp. You can cancel the link at the point.
  5. Put the following line in a new file buildninjaStage2\link.rsp: /nologo @CMakeFiles\clang2.rsp /out:bin\clang.exe /implib:lib\clang.lib /pdb:bin\clang.pdb /version:0.0 /machine:x64 -fuse-ld=lld /STACK:10000000 /DEBUG /OPT:REF /OPT:ICF /INCREMENTAL:NO /subsystem:console /opt:lldltojobs=all. All these gymnastics are needed because there's no option to disable the ThinLTO cache from cmake, nor an option to use all hardware threads (by default, only one thread per core is used).
  6. You only need to do the above steps once. You can now run the test with:
> cd buildninjaStage2
> bin\lld-link @link.rsp

So it's using the stage2 LLD to relink the stage2 clang.

I use Bruce Dawson's UIforETW to take profile traces: https://github.com/google/UIforETW - ensure to check 'trace to file' first on the right side. Click 'Start Tracing' at the top before running the above cmd-line, then 'Save Trace Buffers' once it ends. After it is done compressing the trace, double-clicking on it would open WPA. If the traces are too big and you get out-of-memory crashes, set the following the wpa.exe.config file next to wpa.exe:

<configuration>
  ...
  <runtime>
   	<gcAllowVeryLargeObjects enabled="true" />
  </runtime>
</configuration>

I've attached my build script:
make_llvm_snmalloc.zip

You need to run it from a VS 2017 or 2019 x64 Native Tools Command Prompt:

D:\llvm-project> make_llvm_snmalloc.bat buildninjaStage1
D:\llvm-project> ninja check-all -C buildninjaStage1
D:\llvm-project> make_llvm_snmalloc.bat buildninjaStage2
D:\llvm-project> ninja check-all -C buildninjaStage2

GnuWin32, Python 3.8, ninja are also needed.

Please let me know if there're difficulties along the way.

@Licenser
Copy link

Licenser commented Jul 4, 2020

We're running it with default-features = false so it won't affect us but I'll try to get in some benchmarks on Monday of the impact of 1mb vs no features :)

@SchrodingerZhu
Copy link
Collaborator

on a small linux openvz instance (2Gib in total), after upgrading to snmalloc-rs==0.2.16, I run into Out of memory on initialisation with and without 1mib flag.

The following rust is not very useful, but I may have time to check the problem further:

Out of memory
/bin/utopia(+0x62ff7b)[0x56062d9fdf7b]
/bin/utopia(+0x6304b3)[0x56062d9fe4b3]
/bin/utopia(+0x630511)[0x56062d9fe511]
/bin/utopia(+0x630b72)[0x56062d9feb72]
/bin/utopia(+0x6340fc)[0x56062da020fc]
/bin/utopia(+0x3c75e3)[0x56062d7955e3]
/bin/utopia(+0x17c2a7)[0x56062d54a2a7]
/lib64/libc.so.6(__libc_start_main+0xf3)[0x7f27569406a3]
/bin/utopia(+0xe8ede)[0x56062d4b6ede]

I confirmed that this was resulted by snmalloc since using system default malloc will solve the memory issue. This problem occurs right after upgrading to 0.2.16 (94a2ba4).

@SchrodingerZhu
Copy link
Collaborator

SchrodingerZhu commented Jul 4, 2020

this is how I reproduce a similar problem on my PC:

  • firejail --noprofile --rlimit-as=2147483648 bash: no problem
  • firejail --noprofile --rlimit-as=2147483648 --env=LD_PRELOAD=/tmp/snmalloc/test/libsnmallocshim.so bash: dead
  • firejail --noprofile --env=LD_PRELOAD=/tmp/snmalloc/test/libsnmallocshim.so bash: no problem

@SchrodingerZhu
Copy link
Collaborator

[schrodinger@Monad utopia]$ ulimit -Sv 500000
[schrodinger@Monad utopia]$ strace env LD_PRELOAD=/tmp/snmalloc/test/libsnmallocshim.so bash
execve("/usr/bin/env", ["env", "LD_PRELOAD=/tmp/snmalloc/test/li"..., "bash"], 0x7ffe909ace00 /* 69 vars */) = 0
brk(NULL)                               = 0x55a2b674d000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffe9daa37e0) = -1 EINVAL (Invalid argument)
access("/etc/ld.so.preload", R_OK)      = 0
openat(AT_FDCWD, "/etc/ld.so.preload", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
close(3)                                = 0
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=458620, ...}) = 0
mmap(NULL, 458620, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0ca548a000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@q\2\0\0\0\0\0"..., 832) = 832
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32, 848) = 32
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\346~\200\347\6\31qw\t\343\30\16U*\21\242"..., 68, 880) = 68
fstat(3, {st_mode=S_IFREG|0755, st_size=2146832, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0ca5488000
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32, 848) = 32
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\346~\200\347\6\31qw\t\343\30\16U*\21\242"..., 68, 880) = 68
mmap(NULL, 1860456, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0ca52c1000
mprotect(0x7f0ca52e6000, 1671168, PROT_NONE) = 0
mmap(0x7f0ca52e6000, 1363968, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x25000) = 0x7f0ca52e6000
mmap(0x7f0ca5433000, 303104, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x172000) = 0x7f0ca5433000
mmap(0x7f0ca547e000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1bc000) = 0x7f0ca547e000
mmap(0x7f0ca5484000, 13160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0ca5484000
close(3)                                = 0
arch_prctl(ARCH_SET_FS, 0x7f0ca5489580) = 0
mprotect(0x7f0ca547e000, 12288, PROT_READ) = 0
mprotect(0x55a2b5e45000, 4096, PROT_READ) = 0
mprotect(0x7f0ca5525000, 4096, PROT_READ) = 0
munmap(0x7f0ca548a000, 458620)          = 0
brk(NULL)                               = 0x55a2b674d000
brk(0x55a2b676e000)                     = 0x55a2b676e000
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=6187360, ...}) = 0
mmap(NULL, 6187360, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0ca4cda000
close(3)                                = 0
execve("/home/schrodinger/.opam/default/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/var/lib/snapd/snap/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/home/schrodinger/.idris2/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/opt/mpich/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/home/schrodinger/.local/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/opt/intel/system_studio_2020/compilers_and_libraries/linux/bin/intel64/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/opt/testa/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/home/schrodinger/.cargo/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/usr/local/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/usr/local/sbin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/usr/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = 0
brk(NULL)                               = 0x557ec3ddb000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffe95b408f0) = -1 EINVAL (Invalid argument)
openat(AT_FDCWD, "/tmp/snmalloc/test/libsnmallocshim.so", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@\21\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1880712, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1659e35000
mmap(NULL, 16998432, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1658dfe000
mmap(0x7f1658dff000, 57344, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1000) = 0x7f1658dff000
mmap(0x7f1658e0d000, 147456, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xf000) = 0x7f1658e0d000
mmap(0x7f1658e31000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x32000) = 0x7f1658e31000
mmap(0x7f1658e33000, 16781344, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f1658e33000
close(3)                                = 0
access("/etc/ld.so.preload", R_OK)      = 0
openat(AT_FDCWD, "/etc/ld.so.preload", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
close(3)                                = 0
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=458620, ...}) = 0
mmap(NULL, 458620, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f1658d8e000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libreadline.so.8", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 `\1\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=326416, ...}) = 0
mmap(NULL, 334344, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1658d3c000
mmap(0x7f1658d52000, 163840, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16000) = 0x7f1658d52000
mmap(0x7f1658d7a000, 40960, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3e000) = 0x7f1658d7a000
mmap(0x7f1658d84000, 36864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x47000) = 0x7f1658d84000
mmap(0x7f1658d8d000, 2568, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f1658d8d000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20\22\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=18608, ...}) = 0
mmap(NULL, 20624, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1658d36000
mmap(0x7f1658d37000, 8192, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1000) = 0x7f1658d37000
mmap(0x7f1658d39000, 4096, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0x7f1658d39000
mmap(0x7f1658d3a000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0x7f1658d3a000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@q\2\0\0\0\0\0"..., 832) = 832
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32, 848) = 32
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\346~\200\347\6\31qw\t\343\30\16U*\21\242"..., 68, 880) = 68
fstat(3, {st_mode=S_IFREG|0755, st_size=2146832, ...}) = 0
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32, 848) = 32
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\346~\200\347\6\31qw\t\343\30\16U*\21\242"..., 68, 880) = 68
mmap(NULL, 1860456, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1658b6f000
mprotect(0x7f1658b94000, 1671168, PROT_NONE) = 0
mmap(0x7f1658b94000, 1363968, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x25000) = 0x7f1658b94000
mmap(0x7f1658ce1000, 303104, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x172000) = 0x7f1658ce1000
mmap(0x7f1658d2c000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1bc000) = 0x7f1658d2c000
mmap(0x7f1658d32000, 13160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f1658d32000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\220\201\0\0\0\0\0\0"..., 832) = 832
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0:(A\261\254\325W\2768O\340i9\4#\234"..., 68, 824) = 68
fstat(3, {st_mode=S_IFREG|0755, st_size=161024, ...}) = 0
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0:(A\261\254\325W\2768O\340i9\4#\234"..., 68, 824) = 68
mmap(NULL, 135600, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1658b4d000
mmap(0x7f1658b54000, 65536, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x7000) = 0x7f1658b54000
mmap(0x7f1658b64000, 20480, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17000) = 0x7f1658b64000
mmap(0x7f1658b69000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b000) = 0x7f1658b69000
mmap(0x7f1658b6b000, 12720, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f1658b6b000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libatomic.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0  \0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=167952, ...}) = 0
mmap(NULL, 36936, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1658b43000
mmap(0x7f1658b45000, 12288, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7f1658b45000
mmap(0x7f1658b48000, 8192, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x5000) = 0x7f1658b48000
mmap(0x7f1658b4a000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6000) = 0x7f1658b4a000
mmap(0x7f1658b4c000, 72, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f1658b4c000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libstdc++.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@`\t\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=20945112, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1658b41000
mmap(NULL, 1951744, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1658964000
mprotect(0x7f16589fa000, 1269760, PROT_NONE) = 0
mmap(0x7f16589fa000, 966656, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x96000) = 0x7f16589fa000
mmap(0x7f1658ae6000, 299008, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x182000) = 0x7f1658ae6000
mmap(0x7f1658b30000, 57344, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1cb000) = 0x7f1658b30000
mmap(0x7f1658b3e000, 10240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f1658b3e000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libm.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260\363\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1328000, ...}) = 0
mmap(NULL, 1327128, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f165881f000
mmap(0x7f165882e000, 634880, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xf000) = 0x7f165882e000
mmap(0x7f16588c9000, 626688, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xaa000) = 0x7f16588c9000
mmap(0x7f1658962000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x142000) = 0x7f1658962000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libncursesw.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 p\1\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=457736, ...}) = 0
mmap(NULL, 462072, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f16587ae000
mmap(0x7f16587c5000, 245760, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17000) = 0x7f16587c5000
mmap(0x7f1658801000, 98304, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x53000) = 0x7f1658801000
mmap(0x7f1658819000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6a000) = 0x7f1658819000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 0\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0644, st_size=595552, ...}) = 0
mmap(NULL, 103144, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1658794000
mmap(0x7f1658797000, 69632, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0x7f1658797000
mmap(0x7f16587a8000, 16384, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x14000) = 0x7f16587a8000
mmap(0x7f16587ac000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17000) = 0x7f16587ac000
close(3)                                = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1658792000
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f165878f000
arch_prctl(ARCH_SET_FS, 0x7f165878f780) = 0
mprotect(0x7f1658d2c000, 12288, PROT_READ) = 0
mprotect(0x7f16587ac000, 4096, PROT_READ) = 0
mprotect(0x7f1658819000, 20480, PROT_READ) = 0
mprotect(0x7f1658962000, 4096, PROT_READ) = 0
mprotect(0x7f1658b30000, 53248, PROT_READ) = 0
mprotect(0x7f1658b69000, 4096, PROT_READ) = 0
mprotect(0x7f1658b4a000, 4096, PROT_READ) = 0
mprotect(0x7f1658d3a000, 4096, PROT_READ) = 0
mprotect(0x7f1658d84000, 12288, PROT_READ) = 0
mprotect(0x7f1658e31000, 4096, PROT_READ) = 0
mprotect(0x557ec257d000, 12288, PROT_READ) = 0
mprotect(0x7f1659e62000, 4096, PROT_READ) = 0
munmap(0x7f1658d8e000, 458620)          = 0
set_tid_address(0x7f165878fa50)         = 2446087
set_robust_list(0x7f165878fa60, 24)     = 0
rt_sigaction(SIGRTMIN, {sa_handler=0x7f1658b54bf0, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7f1658b61960}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {sa_handler=0x7f1658b54c90, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_SIGINFO, sa_restorer=0x7f1658b61960}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
mmap(NULL, 4294967296, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0xb), ...}) = 0

@mjp41
Copy link
Member Author

mjp41 commented Jul 5, 2020

@SchrodingerZhu I have raised an issue (#224) for what you have reported. I believe it is independent of the Windows commit issue.

@Licenser
Copy link

Licenser commented Jul 6, 2020

So I ran our benchmark with and without 1mib and I couldn't see any significant difference.

@mjp41
Copy link
Member Author

mjp41 commented Jul 6, 2020

@Licenser thanks. Did you monitor RSS, or just throughput. If you did monitor RSS, do you have transparent huge pages enabled

@Licenser
Copy link

Licenser commented Jul 6, 2020

I just looked at throughput we don't have any benchmarks that look at memory, sorry.

@mjp41
Copy link
Member Author

mjp41 commented Jul 6, 2020

@Licenser thanks for doing this.

@mjp41
Copy link
Member Author

mjp41 commented Jul 6, 2020

@aganea I have replicated the experiment so far. I have checked out a Standard F72s_v2 (72 vcpus, 144 GiB memory) on Azure with Windows 10 instance, and am getting about 32GiB PWS with the 16MiB chunk size, and 16GiB PWS with the 1MiB chunk size. The times look like they might be slightly faster with 16MiB, but not sure, so running some statistically meaningful tests. Also re-tested the tree_index branch.

One minor tip, you can do ninja -v -d keeprsp, then you don't have to worry about the rsp file being deleted by ninja in step 3 of your instructions.

@mjp41
Copy link
Member Author

mjp41 commented Jul 6, 2020

@aganea I have also got rpmalloc and mimalloc working in the way your patch describes.

Initially, I am observing rpmalloc as slightly slower than snmalloc, but mimalloc is quite a bit slower. Is there anything I might be missing in building mimalloc. I manually applied your patch, and then did

msbuild mimalloc.sln /m /P:Configuration=Release /t:rebuild

from the ide\vs2019 directory.

Obviously, the machines are different, so we should expect different results. As this is running in the Cloud the cost of various operations are different, and may occur contention in Hyper-V.

It is definitely not running the system heap, as it is getting up to a reasonable percentage CPU utilization, which the system allocator does not.

@aganea
Copy link

aganea commented Jul 6, 2020

@mjp41 What Windows 10 version is the underlying cloud system? It might definitly be something related to allocating hardware pages on the underlying system. mimalloc makes a lot more calls to VirtualAlloc than rpmalloc than snmalloc. Please take a ETW trace, then in WPA go the RandomAscii inclusive view, right-lick "Filter to selection" on lld-link, than add two colums Module and Function, in this order: Process, Module, Function. You'll be able to tell pretty quickly where the bottleneck is. Normally, ntdll.dll & ntoskrnl.exe combined shouldn't take more that 0.5-0.8% of CPU, and most of the time is spent by xperf Rtl functions capturing the callstacks.

@mjp41
Copy link
Member Author

mjp41 commented Jul 7, 2020

@aganea it is running in Azure, so I assume HyperV at the bottom, and Windows 10 version 1809 as the OS.

Looking at the traces it is spending a lot of time inside ntoskrnl.dll inside spin locks, about 45%. I haven't drilled into the traces much, but I think it is seems to be around page handling.

@aganea
Copy link

aganea commented Jul 7, 2020

If the instance is running on 1809, then the behavior you're seeing is 'normal'.

There's a known issue in the NT kernel, there was a contention in the page zero-out mechanism: https://stackoverflow.com/questions/45024029/windows-10-poor-performance-compared-to-windows-7-page-fault-handling-is-not-sc
This was fixed after version 1903.

Same dataset, same LLD linker:
6140_ThinLTO_1709_vs_1909

However after 1909 there's a new contention issue in the large page allocation -- I don't know if it was fixed in version 2004: https://twitter.com/alex_toresh/status/1215125422226231297

@mjp41
Copy link
Member Author

mjp41 commented Jul 7, 2020

Okay, I'll try to update the VM. Though, Windows update wants to go to 1909.

Looking at the numbers on the machine today rpmalloc and 16MiB configuration were about the same, and the 1MiB was slightly slower, but all pretty close and within the level of noise, so would actually have to do some statistics to draw a conclusion. The machine Azure gave me yesterday, had rpmalloc as slightly slower, I didn't run enough tests to see if it was statistically significant though.

Memory usage was approximately as you saw but off by a factor.

  • 16MiB configuration - 32.4Gb
  • 1MiB configuration - 16.2 Gb
  • rpmalloc - 28 Gb (Not tried the array-cache branch yet)
  • mimalloc - 14.3 Gb

@mjp41
Copy link
Member Author

mjp41 commented Jul 7, 2020

So my VM upgraded to 1909 and now mimalloc is even worse. On this machine it is giving rpmalloc about 5% faster then snmalloc 1MiB, with the 16 MiB in the middle of them. Memory usage looks about the same.

I am going to move to 1MiB as the default. It works much better in terms of RSS/PSW, and there are very few scenarios where the reduced throughput seem too costly.

@SchrodingerZhu
Copy link
Collaborator

SchrodingerZhu commented Jul 9, 2020

@Licenser

We're running it with default-features = false so it won't affect us but I'll try to get in some benchmarks on Monday of the impact of 1mb vs no features :)

If you are using the rust crate, it has been just updated and it now requires setting either the 1mib feature or the 16mib feature.
This is a broken change if you are using default-features=false.

@mjp41 mjp41 closed this as completed Nov 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants