Investigate memory usage increase for merge jobs #44679

yanfr0818 · 2024-04-10T00:49:13Z

Many merge jobs require higher memory usage during runtime, for example, this WF.

Merge jobs are not typically memory demanding, thus an increase in memory usage would often fail the jobs. Is there any tool in CMSSW that I can use to debug this issue?

cmsbuild · 2024-04-10T00:49:36Z

cms-bot internal usage

cmsbuild · 2024-04-10T00:49:36Z

A new Issue was created by @yanfr0818.

@sextonkennedy, @Dr15Jones, @smuzaffar, @makortel, @antoniovilela, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel · 2024-04-10T20:52:45Z

assign core

cmsbuild · 2024-04-10T20:53:07Z

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel · 2024-04-10T21:01:42Z

IgProf on slc7 is the "best" general-purpose tool for detailed memory profiling.

Could you provide logs and the PSet of a merge job that fails because of too much memory? (to make it easier for others to investigate)

The example

this WF.

is merging NanoAOD (in EDM file format). I'm somewhat surprised to see that to take a lot of memory. Are the other failing merge jobs also about NanoAOD, or a mixture of data tiers?

Given the NanoAOD let's anyway tag @cms-sw/xpog-l2 .

yanfr0818 · 2024-04-10T23:45:34Z

@makortel Thanks for looking into this.

Could you provide logs and the PSet of a merge job that fails because of too much memory?

Please check this out. This is the job log from our Unified database. Let me know if you want more information to investigate.

Are the other failing merge jobs also about NanoAOD, or a mixture of data tiers?

So far we have only seen such pattern in DataProcessingMergeNANOEDMAODoutput tasks.
To give you more examples, I'm listing a couple of other WFs and their PSets:
cmsunified_ACDC5_r-0-Run2022F_ZeroBias_JMENano12p5_240313_165230_699
cmsunified_ACDC2_r-0-Run2022F_MinimumBias_JMENano12p5_240201_014113_6904

vlimant · 2024-04-11T07:09:20Z

can one please secure some of the unmarked files somewhere on eos/afs so that we can access them before they get removed/cleanedup ?

makortel · 2024-04-11T18:19:14Z

We took a look with @Dr15Jones, and the high memory usage is caused by serialization of ParameterSets. The underlying cause why that gets this bad (i.e. many ParameterSets) is in the PromptReco, that was noticed in https://cms-talk.web.cern.ch/t/high-memory-usage-in-merge-alcareco-job-for-promptreco-run373536-zerobias17/29655, and fixed in #42833.

@Dr15Jones is looking into a workaround for NanoAODOutputModule to reduce the memory need.

A separate issue is that because of the many RECO ParameterSets being stored, even the merged file contains mostly ParameterSets. In the case of

Please check this out.

the job merges 73 files, the output file size is 144 MB, and the compressed size of all the ParameterSets is 133 MB, implying ~92 % of the file is in ParameterSets. Even if the problem was confined to 2022 and partially 2023, and maybe only ZeroBias and MinimumBias PDs (link), we should probably look into trying to make the ParameterSet storage smaller in this kind of "lot's of repetition" case.

Dr15Jones · 2024-04-11T18:31:35Z

The easiest and quickest way to proceed is to submit these jobs with a higher memory requirement (say 5GB). Then we will try to find a 'fix' for any future cases where this might happen and add those fixes to the present development release.

yanfr0818 · 2024-04-11T20:32:08Z

@makortel Thanks for checking. A few questions here:

many ParameterSets

How do we end up getting multiple PSets? I think one CMSSW job requires only one PSet to run, right? In this task, we saw only one PSet. Does the pkl object PSet become multiple PSets after serialization?

the compressed size of all the ParameterSets is 133 MB

Could you elaborate how you find or estimate this number? I found ROOT-tfile-write-totalMegabytes to be around 143, roughly lining up with 144 MB output file size you mentioned.

makortel · 2024-04-11T20:43:35Z

A few questions here:

many ParameterSets

How do we end up getting multiple PSets? I think one CMSSW job requires only one PSet to run, right?

We store (parts) of the ParameterSets of earlier processing steps in the provenance. In case of merges we don't store duplicates, i.e. in the ideal case where e.g. the RECO job configuration would be the the same for all the LuminosityBlocks of a Run, usually also for many Runs, there would be only one ParameterSet from RECO stored. The CMS Talk thread https://cms-talk.web.cern.ch/t/high-memory-usage-in-merge-alcareco-job-for-promptreco-run373536-zerobias17/29655 should explain pretty well why with these data there can be many RECO ParameterSets in the provenance.

the compressed size of all the ParameterSets is 133 MB

Could you elaborate how you find or estimate this number?

I ran the merge job locally, opened the resulting ROOT file with root, called ParameterSets->Print(), and took the

File  Size =  140051678

part of the printout.

makortel · 2024-04-12T21:42:31Z

@Dr15Jones took a deep dive, and found the culprit for the memory hoarding not to be in the ParameterSet storage per se, but to a "performance bug" in the framework ParameterSet code. #44727 fixes that behavior.

We probably need to backport the fix at least to some earlier release cycles.

makortel · 2024-04-15T14:06:28Z

We'll backport the fix down to 13_2_X.

cmsbuild added the pending-assignment label Apr 10, 2024

cmsbuild added core-pending pending-signatures and removed pending-assignment labels Apr 10, 2024

yanfr0818 mentioned this issue Apr 11, 2024

Memory increase not applied for resubmitted merge tasks dmwm/WMCore#11953

Open

Dr15Jones mentioned this issue Apr 12, 2024

Fix memory use when writing ParameterSets to files #44727

Merged

cmsbuild closed this as completed in #44727 Apr 14, 2024

makortel mentioned this issue Apr 15, 2024

Extend SimpleMemoryCheck to monitor memory behavior during endJob cms-sw/framework-team#885

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate memory usage increase for merge jobs #44679

Investigate memory usage increase for merge jobs #44679

yanfr0818 commented Apr 10, 2024

cmsbuild commented Apr 10, 2024 •

edited

Loading

cmsbuild commented Apr 10, 2024

makortel commented Apr 10, 2024

cmsbuild commented Apr 10, 2024

makortel commented Apr 10, 2024

yanfr0818 commented Apr 10, 2024 •

edited

Loading

vlimant commented Apr 11, 2024

makortel commented Apr 11, 2024

Dr15Jones commented Apr 11, 2024

yanfr0818 commented Apr 11, 2024

makortel commented Apr 11, 2024

makortel commented Apr 12, 2024

makortel commented Apr 15, 2024

Investigate memory usage increase for merge jobs #44679

Investigate memory usage increase for merge jobs #44679

Comments

yanfr0818 commented Apr 10, 2024

cmsbuild commented Apr 10, 2024 • edited Loading

cmsbuild commented Apr 10, 2024

makortel commented Apr 10, 2024

cmsbuild commented Apr 10, 2024

makortel commented Apr 10, 2024

yanfr0818 commented Apr 10, 2024 • edited Loading

vlimant commented Apr 11, 2024

makortel commented Apr 11, 2024

Dr15Jones commented Apr 11, 2024

yanfr0818 commented Apr 11, 2024

makortel commented Apr 11, 2024

makortel commented Apr 12, 2024

makortel commented Apr 15, 2024

cmsbuild commented Apr 10, 2024 •

edited

Loading

yanfr0818 commented Apr 10, 2024 •

edited

Loading