Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate memory usage increase for merge jobs #44679

Closed
yanfr0818 opened this issue Apr 10, 2024 · 13 comments · Fixed by #44727
Closed

Investigate memory usage increase for merge jobs #44679

yanfr0818 opened this issue Apr 10, 2024 · 13 comments · Fixed by #44727

Comments

@yanfr0818
Copy link

Many merge jobs require higher memory usage during runtime, for example, this WF.

Merge jobs are not typically memory demanding, thus an increase in memory usage would often fail the jobs. Is there any tool in CMSSW that I can use to debug this issue?

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 10, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @yanfr0818.

@sextonkennedy, @Dr15Jones, @smuzaffar, @makortel, @antoniovilela, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign core

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

IgProf on slc7 is the "best" general-purpose tool for detailed memory profiling.

Could you provide logs and the PSet of a merge job that fails because of too much memory? (to make it easier for others to investigate)

The example

this WF.

is merging NanoAOD (in EDM file format). I'm somewhat surprised to see that to take a lot of memory. Are the other failing merge jobs also about NanoAOD, or a mixture of data tiers?

Given the NanoAOD let's anyway tag @cms-sw/xpog-l2 .

@yanfr0818
Copy link
Author

yanfr0818 commented Apr 10, 2024

@makortel Thanks for looking into this.

Could you provide logs and the PSet of a merge job that fails because of too much memory?

Please check this out. This is the job log from our Unified database. Let me know if you want more information to investigate.

Are the other failing merge jobs also about NanoAOD, or a mixture of data tiers?

So far we have only seen such pattern in DataProcessingMergeNANOEDMAODoutput tasks.
To give you more examples, I'm listing a couple of other WFs and their PSets:
cmsunified_ACDC5_r-0-Run2022F_ZeroBias_JMENano12p5_240313_165230_699
cmsunified_ACDC2_r-0-Run2022F_MinimumBias_JMENano12p5_240201_014113_6904

@vlimant
Copy link
Contributor

vlimant commented Apr 11, 2024

can one please secure some of the unmarked files somewhere on eos/afs so that we can access them before they get removed/cleanedup ?

@makortel
Copy link
Contributor

We took a look with @Dr15Jones, and the high memory usage is caused by serialization of ParameterSets. The underlying cause why that gets this bad (i.e. many ParameterSets) is in the PromptReco, that was noticed in https://cms-talk.web.cern.ch/t/high-memory-usage-in-merge-alcareco-job-for-promptreco-run373536-zerobias17/29655, and fixed in #42833.

@Dr15Jones is looking into a workaround for NanoAODOutputModule to reduce the memory need.


A separate issue is that because of the many RECO ParameterSets being stored, even the merged file contains mostly ParameterSets. In the case of

Please check this out.

the job merges 73 files, the output file size is 144 MB, and the compressed size of all the ParameterSets is 133 MB, implying ~92 % of the file is in ParameterSets. Even if the problem was confined to 2022 and partially 2023, and maybe only ZeroBias and MinimumBias PDs (link), we should probably look into trying to make the ParameterSet storage smaller in this kind of "lot's of repetition" case.

@Dr15Jones
Copy link
Contributor

The easiest and quickest way to proceed is to submit these jobs with a higher memory requirement (say 5GB). Then we will try to find a 'fix' for any future cases where this might happen and add those fixes to the present development release.

@yanfr0818
Copy link
Author

@makortel Thanks for checking. A few questions here:

many ParameterSets

How do we end up getting multiple PSets? I think one CMSSW job requires only one PSet to run, right? In this task, we saw only one PSet. Does the pkl object PSet become multiple PSets after serialization?

the compressed size of all the ParameterSets is 133 MB

Could you elaborate how you find or estimate this number? I found ROOT-tfile-write-totalMegabytes to be around 143, roughly lining up with 144 MB output file size you mentioned.

@makortel
Copy link
Contributor

A few questions here:

many ParameterSets

How do we end up getting multiple PSets? I think one CMSSW job requires only one PSet to run, right?

We store (parts) of the ParameterSets of earlier processing steps in the provenance. In case of merges we don't store duplicates, i.e. in the ideal case where e.g. the RECO job configuration would be the the same for all the LuminosityBlocks of a Run, usually also for many Runs, there would be only one ParameterSet from RECO stored. The CMS Talk thread https://cms-talk.web.cern.ch/t/high-memory-usage-in-merge-alcareco-job-for-promptreco-run373536-zerobias17/29655 should explain pretty well why with these data there can be many RECO ParameterSets in the provenance.

the compressed size of all the ParameterSets is 133 MB

Could you elaborate how you find or estimate this number?

I ran the merge job locally, opened the resulting ROOT file with root, called ParameterSets->Print(), and took the

File  Size =  140051678

part of the printout.

@makortel
Copy link
Contributor

@Dr15Jones took a deep dive, and found the culprit for the memory hoarding not to be in the ParameterSet storage per se, but to a "performance bug" in the framework ParameterSet code. #44727 fixes that behavior.

We probably need to backport the fix at least to some earlier release cycles.

@makortel
Copy link
Contributor

We'll backport the fix down to 13_2_X.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants