-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
shall we make TCmalloc default (in place of JeMalloc)? #42387
Comments
A new Issue was created by @VinInn Vincenzo Innocente. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
here are the two jobs (je and TC) at some point
than cmsRun was terminated and TC reclaimed huge pages w/o any impact on RSS
compare with |
I'm not a fan of the way JeMalloc treats address space as cheaply disposable, especially as @VinInn details, on heavily loaded systems--so, at least in principle, I support this proposal. |
TCMalloc is designed to work with "standard system settings". |
From some tests I did a few months ago, it looks like the default behaviour of jemalloc has changed between v4 (or even v3?) and v5, and now jemalloc is more conservative before releasing memory to the OS. However, this can be tuned at runtime via environment variables: export MALLOC_CONF="dirty_decay_ms:1000,muzzy_decay_ms:1000"
cmsRun ... The default is 10 seconds. |
I suggest we discuss about this topic in the core software meeting tomorrow. |
assign core |
New categories assigned: core @Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
In core software meeting yesterday it was agreed that we switch the default |
Here is a first look at the performance of a recent HLT menu with CMSSW 13.0.10, using
As can be seen, jemalloc with the default settings is significantly faster than glibc and tcmalloc |
is this single process or full machine? |
It's a full machine with the HLT production setup: 8 jobs, each with 32 threads and 24 streams. |
Thanks for the ping Andrea. Of course -12.5% is throughput at HLT is a lot. I hope that one of the intermediate solution (jemalloc with 5s, 2s, 1s ...) could be good to reduce the usage of RSS, maintaing a large throughput |
HLT does not have any issues at all with memory usage, so I do not see any reason why we would change the settings at HLT. |
@gartung ran some measurements at NERSC on Run3 reconstruction (with So on <= 16 threads/process the tradeoff would be between "up to 5 % higher throughput" vs "4-8 % lower RSS" (and "20-30 % lower VSIZE"). On >= 32 threads/process the tradeoff would be between "10-15 % higher throughput" vs "15 % lower RSS" (and "40-50 % lower VSIZE"). |
My conclusion is that JeMalloc remains more cost effective if "crashes" due to RSS run-off are not disruptive. These results are interesting in general and worth sharing to a wider audience (HEPIX?) |
@fwyzard look to the companion issue in WMCORE |
I wonder if it would be worth of testing I also came across this page https://www.evanjones.ca/hugepages-are-a-good-idea.html that lists several cases where transparent huge pages have caused performance problems in the past (of which many, but not all, in conjunction of jemalloc) |
from the test I did in January switching (at kernel level) THP off makes digi and reco slower (at least with JeMalloc) |
I don't see these options documented here, maybe there were valid for jemalloc 4 ? |
The |
In the meantime, I've rerun the same HLT jobs with more configurations, and keeping track of the VSize, RSS and PSS:
|
OK, I can rerun the same HLT tests with those options. |
One can always switch off THP at system level as root
at least for test purposes one can also clean the machine defragmenting the memory before the test
|
@gartung ran more tests on Perlmutter by adding MiniAOD+DQM on top of the Run 3 reco mentioned in #42387 (comment). The job has now two |
|
@fwyzard @gartung , yesterday we have integrated a new version (2.11) of greftools/tcmalloc in 13.3.X IBs, can you please rerun performance comparison tests? |
@smuzaffar all my tests were done in 13.0.x, I don't have a working HLT configuration for 13.3.x . Can I just |
yes I think that should work |
Here is the comparison of the old (gperftools 2.8.1, tcmalloc 4.5.9) and new (gperftools 2.11, tcmalloc 4.5.11) releases:
The difference seems negligible. |
The IBs are not available at NERSC where I ran the measurements. |
@gartung , can you install and use local IB? If yes then I can send you the instructions |
Yes, and I now how to do a local install. |
@makortel , given that the old and new tcmalloc show nearly identical results , should we revert the tcmalloc change and go back on using jemalloc? |
I agree, let's go back to jemalloc. |
Given that the revertion to jemalloc as the default #42671 was merged, I think we can close the issue. |
+core |
This issue is fully signed and ready to be closed. |
@cmsbuild, please close |
TCmalloc seems to be more robust in conditions of high fragmentation.
This was already observed earlier this year in the case of a huge malloc/free storm
(see https://indico.cern.ch/event/1164136/contributions/5228823/attachments/2580782/4451207/Fragmentation.pdf and
#40537 )
More recently in #40437 (in presence of high fragmentation) it was noted how RSS can suddenly grow of even more 2GB when a different process terminates. This is associated to an increase of HugePages (defragmentation).
On a large machine (not fragmented) JeMalloc and TCmalloc seem to behave similarly with JeMalloc always allocating more Virtual memory (zero-pages?)
on a smaller (virtual machine) highly fragmented (essentially no large pages available)
JeMalloc seems to use a much larger amount of VSS
and when a different job is terminated it will promptly trigger defragmentation, increase the amount of "AnonHugePages" increasing RSS while a similar TCMalloc process will increase the number of HugePages without affecting RSS.
In reality in this condition of high fragmentation TCMalloc seems to use HugePages even more effectively than JeMalloc.
Should be noted that cmsRunTC is in general faster than cmsRunJE by some small amount (few percent).
My personal conclusion is that there is evidence of adverse effects in using TCMalloc while there are several incidents that can be connected to "misbehavior" of the OS due to JeMalloc.
The text was updated successfully, but these errors were encountered: