shall we make TCmalloc default (in place of JeMalloc)? #42387

VinInn · 2023-07-27T08:46:20Z

TCmalloc seems to be more robust in conditions of high fragmentation.

This was already observed earlier this year in the case of a huge malloc/free storm
(see https://indico.cern.ch/event/1164136/contributions/5228823/attachments/2580782/4451207/Fragmentation.pdf and
#40537 )

More recently in #40437 (in presence of high fragmentation) it was noted how RSS can suddenly grow of even more 2GB when a different process terminates. This is associated to an increase of HugePages (defragmentation).

On a large machine (not fragmented) JeMalloc and TCmalloc seem to behave similarly with JeMalloc always allocating more Virtual memory (zero-pages?)

 741203 innocent  20   0   12.4g   8.5g 731196 R 399.0   0.8 360:38.86 cmsRun
 746752 innocent  20   0 8358992   6.8g 735316 R 399.0   0.7  45:22.00 cmsRunTC
[innocent@olspx-01 rereco]$ cat /proc/741203/smaps_rollup
Rss:             8924524 kB
Anonymous:       8193328 kB
AnonHugePages:   5349376 kB
Swap:                  0 kB
[innocent@olspx-01 rereco]$ cat /proc/746752/smaps_rollup
00400000-7ffeaed74000 ---p 00000000 00:00 0                              [rollup]
Rss:             7081552 kB
Anonymous:       6346236 kB
AnonHugePages:   3268608 kB
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+  P   SWAP    DATA nMaj nDRT WCHAN      COMMAND
 109551 innocent  20   0 8121920   7.1g 619328 R 400.0  24.6  69:21.34  3      0 7150016 1009    0 -          cmsRunTC
 111064 innocent  20   0   18.1g   7.1g 619520 R 399.7  24.5  18:29.82  8      0   17.2g   13    0 -          cmsRun:   3268608 kB
Swap:                  0 kB

on a smaller (virtual machine) highly fragmented (essentially no large pages available)
JeMalloc seems to use a much larger amount of VSS

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+  P   SWAP    DATA nMaj nDRT WCHAN      COMMAND
 109551 innocent  20   0 8121920   7.1g 619328 R 400.0  24.6  69:21.34  3      0 7150016 1009    0 -          cmsRunTC
 111064 innocent  20   0   18.1g   7.1g 619520 R 399.7  24.5  18:29.82  8      0   17.2g   13    0 -          cmsRun

and when a different job is terminated it will promptly trigger defragmentation, increase the amount of "AnonHugePages" increasing RSS while a similar TCMalloc process will increase the number of HugePages without affecting RSS.
In reality in this condition of high fragmentation TCMalloc seems to use HugePages even more effectively than JeMalloc.

Should be noted that cmsRunTC is in general faster than cmsRunJE by some small amount (few percent).

My personal conclusion is that there is evidence of adverse effects in using TCMalloc while there are several incidents that can be connected to "misbehavior" of the OS due to JeMalloc.

The text was updated successfully, but these errors were encountered:

cmsbuild · 2023-07-27T08:46:39Z

A new Issue was created by @VinInn Vincenzo Innocente.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

VinInn · 2023-07-27T12:06:43Z

here are the two jobs (je and TC) at some point

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+  P   SWAP    DATA nMaj nDRT WCHAN      COMMAND
 116744 innocent  20   0 8960512   7.2g 183360 R 399.7  24.9 424:19.94  4 224768 7988480 3667    0 -          cmsRunTC
 111064 innocent  20   0   18.1g   7.6g 182144 R 398.3  26.3 558:51.76  1   1.1g   17.2g 4473    0 -          cmsRun

than cmsRun was terminated and TC reclaimed huge pages w/o any impact on RSS

[innocent@lxplus801 rereco]$ cat /proc/116744/smaps_rollup
Rss:             7572416 kB
Anonymous:       7387648 kB
AnonHugePages:         0 kB
Swap:             412160 kB
[innocent@lxplus801 rereco]$ cat /proc/116744/smaps_rollup
Rss:             7813568 kB
Anonymous:       7628800 kB
AnonHugePages:   1048576 kB
Swap:             175744 kB
[innocent@lxplus801 rereco]$ cat /proc/116744/smaps_rollup
Rss:             8172416 kB
Anonymous:       7987648 kB
AnonHugePages:   5767168 kB
Swap:             125824 kB

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+  P   SWAP    DATA nMaj nDRT WCHAN      COMMAND
 116744 innocent  20   0 9123968   7.8g 184640 R 396.7  26.9 835:46.57  1 126272 8151936 8794    0 -          cmsRunTC

compare with
#40437 (comment)

dan131riley · 2023-07-27T12:33:31Z

I'm not a fan of the way JeMalloc treats address space as cheaply disposable, especially as @VinInn details, on heavily loaded systems--so, at least in principle, I support this proposal.

VinInn · 2023-07-27T15:39:16Z

TCMalloc is designed to work with "standard system settings".
https://google.github.io/tcmalloc/tuning.html#system-level-optimizations
so one can hope it is less likely to trigger misbehaviour

fwyzard · 2023-07-29T09:22:18Z

From some tests I did a few months ago, it looks like the default behaviour of jemalloc has changed between v4 (or even v3?) and v5, and now jemalloc is more conservative before releasing memory to the OS.

However, this can be tuned at runtime via environment variables:

export MALLOC_CONF="dirty_decay_ms:1000,muzzy_decay_ms:1000"
cmsRun ...

The default is 10 seconds.
See https://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms for more info.

makortel · 2023-07-31T15:37:30Z

I suggest we discuss about this topic in the core software meeting tomorrow.

makortel · 2023-08-02T12:52:59Z

assign core

cmsbuild · 2023-08-02T12:53:21Z

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel · 2023-08-02T12:55:34Z

In core software meeting yesterday it was agreed that we switch the default cmsRun in master to use TCmalloc now (done in #42440) in order to gain experience in IBs and have it validated as part of 13_3_0_pre1. @fwyzard agreed to run performance tests for HLT comparing default jemalloc, TCmalloc, glibc malloc, and jemalloc with some tuning. If no problems surface, we'll continue with TCmalloc.

fwyzard · 2023-08-02T23:28:49Z

Here is a first look at the performance of a recent HLT menu with CMSSW 13.0.10, using

glibc malloc (RHEL 8.9, glibc 2.28)
tcmalloc 2.9.1
jemalloc 5.3.0 with the default settings, which return freed memory to the OS after about 10s
jemalloc, with different delays of 5s, 2s, 1s, 0.5s, 0.2s, 0.1s
jemalloc, with the the freed memory returned immediately to the OS

memory manager	HLT throughput	error
glibc	535.7 ev/s	± 1.0 ev/s

tcmalloc	556.6 ev/s	± 1.8 ev/s

jemalloc, default	635.6 ev/s	± 0.6 ev/s
jemalloc, 5s	631.1 ev/s	± 2.0 ev/s
jemalloc, 2s	626.7 ev/s	± 1.3 ev/s
jemalloc, 1s	616.6 ev/s	± 3.1 ev/s
jemalloc, 0.5s	608.5 ev/s	± 1.2 ev/s
jemalloc, 0.2s	592.4 ev/s	± 1.4 ev/s
jemalloc, 0.1s	577.7 ev/s	± 4.3 ev/s
jemalloc, immediate	519.6 ev/s	± 0.5 ev/s

As can be seen, jemalloc with the default settings is significantly faster than glibc and tcmalloc

fwyzard · 2023-08-02T23:30:04Z

@missirol @Martin-Grunewald @silviodonato FYI

VinInn · 2023-08-03T07:13:25Z

is this single process or full machine?

fwyzard · 2023-08-03T07:14:51Z

It's a full machine with the HLT production setup: 8 jobs, each with 32 threads and 24 streams.

silviodonato · 2023-08-03T08:39:12Z

Thanks for the ping Andrea. Of course -12.5% is throughput at HLT is a lot. I hope that one of the intermediate solution (jemalloc with 5s, 2s, 1s ...) could be good to reduce the usage of RSS, maintaing a large throughput

fwyzard · 2023-08-03T08:46:50Z

HLT does not have any issues at all with memory usage, so I do not see any reason why we would change the settings at HLT.

makortel · 2023-08-08T01:17:33Z

@gartung ran some measurements at NERSC on Run3 reconstruction (with AsciiOutputModule) with node (128 logical cores) fully loaded N-thread processes

Throughput

Peak RSS / core

Peak VSIZE / core

So on <= 16 threads/process the tradeoff would be between "up to 5 % higher throughput" vs "4-8 % lower RSS" (and "20-30 % lower VSIZE").

On >= 32 threads/process the tradeoff would be between "10-15 % higher throughput" vs "15 % lower RSS" (and "40-50 % lower VSIZE").

VinInn · 2023-08-08T06:30:36Z

My conclusion is that JeMalloc remains more cost effective if "crashes" due to RSS run-off are not disruptive.
Would be useful to run the full production job with real root I/O including DQM, and eventually a full Simulation w/f.
(root buffers and compression are a large component of the memory game)

These results are interesting in general and worth sharing to a wider audience (HEPIX?)
Reminder: ATLAS seems using TCMalloc, others just glibc.

fwyzard · 2023-08-08T06:56:25Z

@makortel @gartung do you have a simple way to monitor and collect the VSS, RSS and PSS usage of a process (from a python script) ?

"simple" with respect to parsing /proc/$PID/smaps etc repeatedly.

VinInn · 2023-08-08T08:18:33Z

@fwyzard look to the companion issue in WMCORE
dmwm/WMCore#11667
https://github.com/dmwm/WMCore/pull/11677/files#diff-db9817d95d8582230765ec5ca335810c279d2c92544efbe4dc4f430c913b4ee9R211

makortel · 2023-08-09T15:41:01Z

I wonder if it would be worth of testing export MALLOC_CONF=thp:never?

I also came across this page https://www.evanjones.ca/hugepages-are-a-good-idea.html that lists several cases where transparent huge pages have caused performance problems in the past (of which many, but not all, in conjunction of jemalloc)

VinInn · 2023-08-09T15:59:49Z

from the test I did in January switching (at kernel level) THP off makes digi and reco slower (at least with JeMalloc)
see
https://indico.cern.ch/event/1164136/contributions/5228823/attachments/2580782/4451207/Fragmentation.pdf
slide 15 and 16

fwyzard · 2023-08-09T17:28:43Z

I wonder if it would be worth of testing export MALLOC_CONF=thp:never?

I don't see these options documented here, maybe there were valid for jemalloc 4 ?

makortel · 2023-08-09T17:39:24Z

I wonder if it would be worth of testing export MALLOC_CONF=thp:never?

I don't see these options documented here, maybe there were valid for jemalloc 4 ?

The opt.thp is mentioned in https://jemalloc.net/jemalloc.3.html that claims to correspond to 5.3.0. My understanding from the TUNING section of the manpage was that the thp could be set via the MALLOC_CONF similarly to other opt.* options.

fwyzard · 2023-08-09T17:53:54Z

In the meantime, I've rerun the same HLT jobs with more configurations, and keeping track of the VSize, RSS and PSS:

executable	throughput	peak VSize	peak RSS	peak PSS
cmsRunGlibC	525.1 ± 0.7 ev/s	18.51 GB	10.15 GB	9.64 GB
cmsRunGlibC (4k mmap min)	522.3 ± 4.2 ev/s	18.58 GB	10.18 GB	9.81 GB

cmsRunTC (unlimited)	550.5 ± 1.6 ev/s	15.82 GB	8.70 GB	8.34 GB
cmsRunTC (0.5x release)	549.8 ± 1.8 ev/s	15.90 GB	8.77 GB	8.55 GB
cmsRunTC	549.9 ± 1.7 ev/s	16.06 GB	8.64 GB	8.20 GB
cmsRunTC (2x release)	546.5 ± 2.4 ev/s	15.87 GB	8.66 GB	8.25 GB
cmsRunTC (5x release)	546.9 ± 1.7 ev/s	15.91 GB	8.67 GB	8.21 GB
cmsRunTC (10x release)	544.1 ± 2.9 ev/s	15.88 GB	8.74 GB	8.22 GB

cmsRunJE (unlimited)	636.5 ± 1.4 ev/s	20.57 GB	12.01 GB	11.53 GB
cmsRunJE (20s release)	631.7 ± 0.4 ev/s	19.86 GB	9.79 GB	9.24 GB
cmsRunJE	629.6 ± 0.9 ev/s	19.12 GB	9.17 GB	8.60 GB
cmsRunJE (5s release)	626.3 ± 1.5 ev/s	19.63 GB	9.27 GB	8.68 GB
cmsRunJE (2s release)	619.8 ± 2.1 ev/s	20.25 GB	8.99 GB	8.46 GB
cmsRunJE (1s release)	611.3 ± 0.8 ev/s	20.38 GB	8.62 GB	8.15 GB
cmsRunJE (500ms release)	601.3 ± 0.9 ev/s	20.37 GB	8.53 GB	8.00 GB
cmsRunJE (200ms release)	586.0 ± 1.2 ev/s	20.01 GB	8.55 GB	7.97 GB
cmsRunJE (100ms release)	572.1 ± 1.0 ev/s	19.93 GB	8.66 GB	8.10 GB
cmsRunJE (immediate)	513.7 ± 1.7 ev/s	18.66 GB	8.08 GB	7.56 GB

fwyzard · 2023-08-09T18:01:43Z

The opt.thp is mentioned in https://jemalloc.net/jemalloc.3.html that claims to correspond to 5.3.0.

OK, I can rerun the same HLT tests with those options.

VinInn · 2023-08-10T09:20:08Z

One can always switch off THP at system level as root

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

at least for test purposes
(check default first catting them)

one can also clean the machine defragmenting the memory before the test

sync
echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory

makortel · 2023-08-15T18:13:34Z

@gartung ran more tests on Perlmutter by adding MiniAOD+DQM on top of the Run 3 reco mentioned in #42387 (comment). The job has now two PoolOutputModules and one DQM output module.

Throughput

Peak RSS

Peak VSIZE

fwyzard · 2023-08-22T16:07:55Z

I wonder if it would be worth of testing export MALLOC_CONF=thp:never?

executable	throughput	peak VSize	peak RSS	peak PSS
cmsRunJE (default)	628.2 ± 2.4 ev/s	19.09 GB	9.09 GB	8.50 GB
cmsRunJE (THP always)	631.2 ± 0.9 ev/s	19.55 GB	9.15 GB	8.61 GB
cmsRunJE (THP never)	624.0 ± 0.2 ev/s	19.10 GB	8.92 GB	8.39 GB

smuzaffar · 2023-08-25T08:04:59Z

@fwyzard @gartung , yesterday we have integrated a new version (2.11) of greftools/tcmalloc in 13.3.X IBs, can you please rerun performance comparison tests?

fwyzard · 2023-08-25T08:07:44Z

@smuzaffar all my tests were done in 13.0.x, I don't have a working HLT configuration for 13.3.x .

Can I just LD_PRELOAD the new tcmalloc library by hand and run with the old release ?

smuzaffar · 2023-08-25T08:26:15Z

Can I just LD_PRELOAD the new tcmalloc library by hand and run with the old release ?

yes I think that should work

fwyzard · 2023-08-25T12:40:12Z

Here is the comparison of the old (gperftools 2.8.1, tcmalloc 4.5.9) and new (gperftools 2.11, tcmalloc 4.5.11) releases:

executable	throughput	peak VSize	peak RSS	peak PSS
cmsRunTC (old)	549.9 ± 1.3 ev/s	15.90 GB	8.68 GB	8.61 GB
cmsRunTC (new)	547.4 ± 1.1 ev/s	15.84 GB	8.74 GB	8.74 GB

The difference seems negligible.

gartung · 2023-08-25T12:51:19Z

The IBs are not available at NERSC where I ran the measurements.

smuzaffar · 2023-08-25T12:54:21Z

@gartung , can you install and use local IB? If yes then I can send you the instructions

gartung · 2023-08-25T13:55:18Z

Yes, and I now how to do a local install.

smuzaffar · 2023-08-28T11:05:09Z

@makortel , given that the old and new tcmalloc show nearly identical results , should we revert the tcmalloc change and go back on using jemalloc?

makortel · 2023-08-28T14:31:40Z

@makortel , given that the old and new tcmalloc show nearly identical results , should we revert the tcmalloc change and go back on using jemalloc?

I agree, let's go back to jemalloc.

makortel · 2023-08-29T15:16:54Z

Given that the revertion to jemalloc as the default #42671 was merged, I think we can close the issue.

makortel · 2023-08-29T15:17:06Z

+core

cmsbuild · 2023-08-29T15:17:11Z

This issue is fully signed and ready to be closed.

makortel · 2023-08-29T15:17:12Z

@cmsbuild, please close

cmsbuild added the pending-assignment label Jul 27, 2023

VinInn changed the title ~~shall we make TCmalloc default (in place of JeMalloc)~~ shall we make TCmalloc default (in place of JeMalloc)? Jul 27, 2023

cmsbuild added core-pending pending-signatures and removed pending-assignment labels Aug 2, 2023

makortel mentioned this issue Aug 2, 2023

Failures in Run 3 data reprocessing #40437

Open

smuzaffar mentioned this issue Aug 28, 2023

Revert to JeMalloc as default for cmsRun #42671

Merged

cmsbuild added core-approved fully-signed and removed core-pending pending-signatures labels Aug 29, 2023

cmsbuild closed this as completed Aug 29, 2023

makortel mentioned this issue Oct 10, 2023

High RSS memory increase for Full (Fast) Simulation in EL8 compared to SLC7 #42929

Open

makortel mentioned this issue Sep 10, 2024

Memory Jump from 14_1_0_pre5 for Phase2 Workflows #45854

Open

makortel mentioned this issue Oct 9, 2024

Paused jobs in Prompt Reco due to MaxPSS reached #46040

Open

makortel mentioned this issue Dec 3, 2024

Extend SimpleMemoryCheck service to report jemalloc and smaps information, and on early termination signal #46859

Merged

shall we make TCmalloc default (in place of JeMalloc)? #42387

shall we make TCmalloc default (in place of JeMalloc)? #42387

Comments

VinInn commented Jul 27, 2023

cmsbuild commented Jul 27, 2023

VinInn commented Jul 27, 2023 • edited Loading

dan131riley commented Jul 27, 2023

VinInn commented Jul 27, 2023

fwyzard commented Jul 29, 2023 • edited Loading

makortel commented Jul 31, 2023

makortel commented Aug 2, 2023

cmsbuild commented Aug 2, 2023

makortel commented Aug 2, 2023

fwyzard commented Aug 2, 2023

fwyzard commented Aug 2, 2023

VinInn commented Aug 3, 2023

fwyzard commented Aug 3, 2023

silviodonato commented Aug 3, 2023

fwyzard commented Aug 3, 2023

makortel commented Aug 8, 2023

VinInn commented Aug 8, 2023

fwyzard commented Aug 8, 2023

VinInn commented Aug 8, 2023

makortel commented Aug 9, 2023

VinInn commented Aug 9, 2023

fwyzard commented Aug 9, 2023

makortel commented Aug 9, 2023 • edited Loading

fwyzard commented Aug 9, 2023

fwyzard commented Aug 9, 2023

VinInn commented Aug 10, 2023

makortel commented Aug 15, 2023

fwyzard commented Aug 22, 2023

smuzaffar commented Aug 25, 2023

fwyzard commented Aug 25, 2023

smuzaffar commented Aug 25, 2023

fwyzard commented Aug 25, 2023 • edited Loading

gartung commented Aug 25, 2023

smuzaffar commented Aug 25, 2023

gartung commented Aug 25, 2023

smuzaffar commented Aug 28, 2023

makortel commented Aug 28, 2023

makortel commented Aug 29, 2023

makortel commented Aug 29, 2023

cmsbuild commented Aug 29, 2023

makortel commented Aug 29, 2023

VinInn commented Jul 27, 2023 •

edited

Loading

fwyzard commented Jul 29, 2023 •

edited

Loading

makortel commented Aug 9, 2023 •

edited

Loading

fwyzard commented Aug 25, 2023 •

edited

Loading