Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shall we make TCmalloc default (in place of JeMalloc)? #42387

Closed
VinInn opened this issue Jul 27, 2023 · 41 comments
Closed

shall we make TCmalloc default (in place of JeMalloc)? #42387

VinInn opened this issue Jul 27, 2023 · 41 comments

Comments

@VinInn
Copy link
Contributor

VinInn commented Jul 27, 2023

TCmalloc seems to be more robust in conditions of high fragmentation.

This was already observed earlier this year in the case of a huge malloc/free storm
(see https://indico.cern.ch/event/1164136/contributions/5228823/attachments/2580782/4451207/Fragmentation.pdf and
#40537 )

More recently in #40437 (in presence of high fragmentation) it was noted how RSS can suddenly grow of even more 2GB when a different process terminates. This is associated to an increase of HugePages (defragmentation).

On a large machine (not fragmented) JeMalloc and TCmalloc seem to behave similarly with JeMalloc always allocating more Virtual memory (zero-pages?)

 741203 innocent  20   0   12.4g   8.5g 731196 R 399.0   0.8 360:38.86 cmsRun
 746752 innocent  20   0 8358992   6.8g 735316 R 399.0   0.7  45:22.00 cmsRunTC
[innocent@olspx-01 rereco]$ cat /proc/741203/smaps_rollup
Rss:             8924524 kB
Anonymous:       8193328 kB
AnonHugePages:   5349376 kB
Swap:                  0 kB
[innocent@olspx-01 rereco]$ cat /proc/746752/smaps_rollup
00400000-7ffeaed74000 ---p 00000000 00:00 0                              [rollup]
Rss:             7081552 kB
Anonymous:       6346236 kB
AnonHugePages:   3268608 kB
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+  P   SWAP    DATA nMaj nDRT WCHAN      COMMAND
 109551 innocent  20   0 8121920   7.1g 619328 R 400.0  24.6  69:21.34  3      0 7150016 1009    0 -          cmsRunTC
 111064 innocent  20   0   18.1g   7.1g 619520 R 399.7  24.5  18:29.82  8      0   17.2g   13    0 -          cmsRun:   3268608 kB
Swap:                  0 kB

on a smaller (virtual machine) highly fragmented (essentially no large pages available)
JeMalloc seems to use a much larger amount of VSS

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+  P   SWAP    DATA nMaj nDRT WCHAN      COMMAND
 109551 innocent  20   0 8121920   7.1g 619328 R 400.0  24.6  69:21.34  3      0 7150016 1009    0 -          cmsRunTC
 111064 innocent  20   0   18.1g   7.1g 619520 R 399.7  24.5  18:29.82  8      0   17.2g   13    0 -          cmsRun

and when a different job is terminated it will promptly trigger defragmentation, increase the amount of "AnonHugePages" increasing RSS while a similar TCMalloc process will increase the number of HugePages without affecting RSS.
In reality in this condition of high fragmentation TCMalloc seems to use HugePages even more effectively than JeMalloc.

Should be noted that cmsRunTC is in general faster than cmsRunJE by some small amount (few percent).

My personal conclusion is that there is evidence of adverse effects in using TCMalloc while there are several incidents that can be connected to "misbehavior" of the OS due to JeMalloc.

@cmsbuild
Copy link
Contributor

A new Issue was created by @VinInn Vincenzo Innocente.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@VinInn VinInn changed the title shall we make TCmalloc default (in place of JeMalloc) shall we make TCmalloc default (in place of JeMalloc)? Jul 27, 2023
@VinInn
Copy link
Contributor Author

VinInn commented Jul 27, 2023

here are the two jobs (je and TC) at some point

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+  P   SWAP    DATA nMaj nDRT WCHAN      COMMAND
 116744 innocent  20   0 8960512   7.2g 183360 R 399.7  24.9 424:19.94  4 224768 7988480 3667    0 -          cmsRunTC
 111064 innocent  20   0   18.1g   7.6g 182144 R 398.3  26.3 558:51.76  1   1.1g   17.2g 4473    0 -          cmsRun

than cmsRun was terminated and TC reclaimed huge pages w/o any impact on RSS

[innocent@lxplus801 rereco]$ cat /proc/116744/smaps_rollup
Rss:             7572416 kB
Anonymous:       7387648 kB
AnonHugePages:         0 kB
Swap:             412160 kB
[innocent@lxplus801 rereco]$ cat /proc/116744/smaps_rollup
Rss:             7813568 kB
Anonymous:       7628800 kB
AnonHugePages:   1048576 kB
Swap:             175744 kB
[innocent@lxplus801 rereco]$ cat /proc/116744/smaps_rollup
Rss:             8172416 kB
Anonymous:       7987648 kB
AnonHugePages:   5767168 kB
Swap:             125824 kB
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+  P   SWAP    DATA nMaj nDRT WCHAN      COMMAND
 116744 innocent  20   0 9123968   7.8g 184640 R 396.7  26.9 835:46.57  1 126272 8151936 8794    0 -          cmsRunTC

compare with
#40437 (comment)

@dan131riley
Copy link

I'm not a fan of the way JeMalloc treats address space as cheaply disposable, especially as @VinInn details, on heavily loaded systems--so, at least in principle, I support this proposal.

@VinInn
Copy link
Contributor Author

VinInn commented Jul 27, 2023

TCMalloc is designed to work with "standard system settings".
https://google.github.io/tcmalloc/tuning.html#system-level-optimizations
so one can hope it is less likely to trigger misbehaviour

@fwyzard
Copy link
Contributor

fwyzard commented Jul 29, 2023

From some tests I did a few months ago, it looks like the default behaviour of jemalloc has changed between v4 (or even v3?) and v5, and now jemalloc is more conservative before releasing memory to the OS.

However, this can be tuned at runtime via environment variables:

export MALLOC_CONF="dirty_decay_ms:1000,muzzy_decay_ms:1000"
cmsRun ...

The default is 10 seconds.
See https://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms for more info.

@makortel
Copy link
Contributor

I suggest we discuss about this topic in the core software meeting tomorrow.

@makortel
Copy link
Contributor

makortel commented Aug 2, 2023

assign core

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 2, 2023

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

makortel commented Aug 2, 2023

In core software meeting yesterday it was agreed that we switch the default cmsRun in master to use TCmalloc now (done in #42440) in order to gain experience in IBs and have it validated as part of 13_3_0_pre1. @fwyzard agreed to run performance tests for HLT comparing default jemalloc, TCmalloc, glibc malloc, and jemalloc with some tuning. If no problems surface, we'll continue with TCmalloc.

@fwyzard
Copy link
Contributor

fwyzard commented Aug 2, 2023

Here is a first look at the performance of a recent HLT menu with CMSSW 13.0.10, using

  • glibc malloc (RHEL 8.9, glibc 2.28)
  • tcmalloc 2.9.1
  • jemalloc 5.3.0 with the default settings, which return freed memory to the OS after about 10s
  • jemalloc, with different delays of 5s, 2s, 1s, 0.5s, 0.2s, 0.1s
  • jemalloc, with the the freed memory returned immediately to the OS

image

memory manager HLT throughput error
glibc 535.7 ev/s ± 1.0 ev/s
     
tcmalloc 556.6 ev/s ± 1.8 ev/s
     
jemalloc, default 635.6 ev/s ± 0.6 ev/s
jemalloc, 5s 631.1 ev/s ± 2.0 ev/s
jemalloc, 2s 626.7 ev/s ± 1.3 ev/s
jemalloc, 1s 616.6 ev/s ± 3.1 ev/s
jemalloc, 0.5s 608.5 ev/s ± 1.2 ev/s
jemalloc, 0.2s 592.4 ev/s ± 1.4 ev/s
jemalloc, 0.1s 577.7 ev/s ± 4.3 ev/s
jemalloc, immediate 519.6 ev/s ± 0.5 ev/s

As can be seen, jemalloc with the default settings is significantly faster than glibc and tcmalloc

@fwyzard
Copy link
Contributor

fwyzard commented Aug 2, 2023

@VinInn
Copy link
Contributor Author

VinInn commented Aug 3, 2023

is this single process or full machine?

@fwyzard
Copy link
Contributor

fwyzard commented Aug 3, 2023

It's a full machine with the HLT production setup: 8 jobs, each with 32 threads and 24 streams.

@silviodonato
Copy link
Contributor

Thanks for the ping Andrea. Of course -12.5% is throughput at HLT is a lot. I hope that one of the intermediate solution (jemalloc with 5s, 2s, 1s ...) could be good to reduce the usage of RSS, maintaing a large throughput

@fwyzard
Copy link
Contributor

fwyzard commented Aug 3, 2023

HLT does not have any issues at all with memory usage, so I do not see any reason why we would change the settings at HLT.

@makortel
Copy link
Contributor

makortel commented Aug 8, 2023

@gartung ran some measurements at NERSC on Run3 reconstruction (with AsciiOutputModule) with node (128 logical cores) fully loaded N-thread processes

Throughput
image

Peak RSS / core
image

Peak VSIZE / core
image

So on <= 16 threads/process the tradeoff would be between "up to 5 % higher throughput" vs "4-8 % lower RSS" (and "20-30 % lower VSIZE").

On >= 32 threads/process the tradeoff would be between "10-15 % higher throughput" vs "15 % lower RSS" (and "40-50 % lower VSIZE").

@VinInn
Copy link
Contributor Author

VinInn commented Aug 8, 2023

My conclusion is that JeMalloc remains more cost effective if "crashes" due to RSS run-off are not disruptive.
Would be useful to run the full production job with real root I/O including DQM, and eventually a full Simulation w/f.
(root buffers and compression are a large component of the memory game)

These results are interesting in general and worth sharing to a wider audience (HEPIX?)
Reminder: ATLAS seems using TCMalloc, others just glibc.

@fwyzard
Copy link
Contributor

fwyzard commented Aug 8, 2023

@makortel @gartung do you have a simple way to monitor and collect the VSS, RSS and PSS usage of a process (from a python script) ?

"simple" with respect to parsing /proc/$PID/smaps etc repeatedly.

@VinInn
Copy link
Contributor Author

VinInn commented Aug 8, 2023

@fwyzard look to the companion issue in WMCORE
dmwm/WMCore#11667
https://github.com/dmwm/WMCore/pull/11677/files#diff-db9817d95d8582230765ec5ca335810c279d2c92544efbe4dc4f430c913b4ee9R211

@makortel
Copy link
Contributor

makortel commented Aug 9, 2023

I wonder if it would be worth of testing export MALLOC_CONF=thp:never?

I also came across this page https://www.evanjones.ca/hugepages-are-a-good-idea.html that lists several cases where transparent huge pages have caused performance problems in the past (of which many, but not all, in conjunction of jemalloc)

@VinInn
Copy link
Contributor Author

VinInn commented Aug 9, 2023

from the test I did in January switching (at kernel level) THP off makes digi and reco slower (at least with JeMalloc)
see
https://indico.cern.ch/event/1164136/contributions/5228823/attachments/2580782/4451207/Fragmentation.pdf
slide 15 and 16

@fwyzard
Copy link
Contributor

fwyzard commented Aug 9, 2023

I wonder if it would be worth of testing export MALLOC_CONF=thp:never?

I don't see these options documented here, maybe there were valid for jemalloc 4 ?

@makortel
Copy link
Contributor

makortel commented Aug 9, 2023

I wonder if it would be worth of testing export MALLOC_CONF=thp:never?

I don't see these options documented here, maybe there were valid for jemalloc 4 ?

The opt.thp is mentioned in https://jemalloc.net/jemalloc.3.html that claims to correspond to 5.3.0. My understanding from the TUNING section of the manpage was that the thp could be set via the MALLOC_CONF similarly to other opt.* options.

@fwyzard
Copy link
Contributor

fwyzard commented Aug 9, 2023

In the meantime, I've rerun the same HLT jobs with more configurations, and keeping track of the VSize, RSS and PSS:

executable throughput peak VSize peak RSS peak PSS
cmsRunGlibC 525.1 ± 0.7 ev/s 18.51 GB 10.15 GB 9.64 GB
cmsRunGlibC (4k mmap min) 522.3 ± 4.2 ev/s 18.58 GB 10.18 GB 9.81 GB
 
cmsRunTC (unlimited) 550.5 ± 1.6 ev/s 15.82 GB 8.70 GB 8.34 GB
cmsRunTC (0.5x release) 549.8 ± 1.8 ev/s 15.90 GB 8.77 GB 8.55 GB
cmsRunTC 549.9 ± 1.7 ev/s 16.06 GB 8.64 GB 8.20 GB
cmsRunTC (2x release) 546.5 ± 2.4 ev/s 15.87 GB 8.66 GB 8.25 GB
cmsRunTC (5x release) 546.9 ± 1.7 ev/s 15.91 GB 8.67 GB 8.21 GB
cmsRunTC (10x release) 544.1 ± 2.9 ev/s 15.88 GB 8.74 GB 8.22 GB
 
cmsRunJE (unlimited) 636.5 ± 1.4 ev/s 20.57 GB 12.01 GB 11.53 GB
cmsRunJE (20s release) 631.7 ± 0.4 ev/s 19.86 GB 9.79 GB 9.24 GB
cmsRunJE 629.6 ± 0.9 ev/s 19.12 GB 9.17 GB 8.60 GB
cmsRunJE (5s release) 626.3 ± 1.5 ev/s 19.63 GB 9.27 GB 8.68 GB
cmsRunJE (2s release) 619.8 ± 2.1 ev/s 20.25 GB 8.99 GB 8.46 GB
cmsRunJE (1s release) 611.3 ± 0.8 ev/s 20.38 GB 8.62 GB 8.15 GB
cmsRunJE (500ms release) 601.3 ± 0.9 ev/s 20.37 GB 8.53 GB 8.00 GB
cmsRunJE (200ms release) 586.0 ± 1.2 ev/s 20.01 GB 8.55 GB 7.97 GB
cmsRunJE (100ms release) 572.1 ± 1.0 ev/s 19.93 GB 8.66 GB 8.10 GB
cmsRunJE (immediate) 513.7 ± 1.7 ev/s 18.66 GB 8.08 GB 7.56 GB

@fwyzard
Copy link
Contributor

fwyzard commented Aug 9, 2023

The opt.thp is mentioned in https://jemalloc.net/jemalloc.3.html that claims to correspond to 5.3.0.

OK, I can rerun the same HLT tests with those options.

@VinInn
Copy link
Contributor Author

VinInn commented Aug 10, 2023

One can always switch off THP at system level as root

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

at least for test purposes
(check default first catting them)

one can also clean the machine defragmenting the memory before the test

sync
echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory

@makortel
Copy link
Contributor

@gartung ran more tests on Perlmutter by adding MiniAOD+DQM on top of the Run 3 reco mentioned in #42387 (comment). The job has now two PoolOutputModules and one DQM output module.

Throughput
throughput

Peak RSS
rss

Peak VSIZE
vsize

@fwyzard
Copy link
Contributor

fwyzard commented Aug 22, 2023

I wonder if it would be worth of testing export MALLOC_CONF=thp:never?

executable throughput peak VSize peak RSS peak PSS
cmsRunJE (default) 628.2 ± 2.4 ev/s 19.09 GB 9.09 GB 8.50 GB
cmsRunJE (THP always) 631.2 ± 0.9 ev/s 19.55 GB 9.15 GB 8.61 GB
cmsRunJE (THP never) 624.0 ± 0.2 ev/s 19.10 GB 8.92 GB 8.39 GB

@smuzaffar
Copy link
Contributor

@fwyzard @gartung , yesterday we have integrated a new version (2.11) of greftools/tcmalloc in 13.3.X IBs, can you please rerun performance comparison tests?

@fwyzard
Copy link
Contributor

fwyzard commented Aug 25, 2023

@smuzaffar all my tests were done in 13.0.x, I don't have a working HLT configuration for 13.3.x .

Can I just LD_PRELOAD the new tcmalloc library by hand and run with the old release ?

@smuzaffar
Copy link
Contributor

Can I just LD_PRELOAD the new tcmalloc library by hand and run with the old release ?

yes I think that should work

@fwyzard
Copy link
Contributor

fwyzard commented Aug 25, 2023

Here is the comparison of the old (gperftools 2.8.1, tcmalloc 4.5.9) and new (gperftools 2.11, tcmalloc 4.5.11) releases:

executable throughput peak VSize peak RSS peak PSS
cmsRunTC (old) 549.9 ± 1.3 ev/s 15.90 GB 8.68 GB 8.61 GB
cmsRunTC (new) 547.4 ± 1.1 ev/s 15.84 GB 8.74 GB 8.74 GB

The difference seems negligible.

@gartung
Copy link
Member

gartung commented Aug 25, 2023

The IBs are not available at NERSC where I ran the measurements.

@smuzaffar
Copy link
Contributor

@gartung , can you install and use local IB? If yes then I can send you the instructions

@gartung
Copy link
Member

gartung commented Aug 25, 2023

Yes, and I now how to do a local install.

@smuzaffar
Copy link
Contributor

@makortel , given that the old and new tcmalloc show nearly identical results , should we revert the tcmalloc change and go back on using jemalloc?

@makortel
Copy link
Contributor

@makortel , given that the old and new tcmalloc show nearly identical results , should we revert the tcmalloc change and go back on using jemalloc?

I agree, let's go back to jemalloc.

@makortel
Copy link
Contributor

Given that the revertion to jemalloc as the default #42671 was merged, I think we can close the issue.

@makortel
Copy link
Contributor

+core

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@makortel
Copy link
Contributor

@cmsbuild, please close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants