-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate memory usage increase for merge jobs #44679
Comments
cms-bot internal usage |
A new Issue was created by @yanfr0818. @sextonkennedy, @Dr15Jones, @smuzaffar, @makortel, @antoniovilela, @rappoccio can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign core |
New categories assigned: core @Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks |
IgProf on slc7 is the "best" general-purpose tool for detailed memory profiling. Could you provide logs and the PSet of a merge job that fails because of too much memory? (to make it easier for others to investigate) The example is merging NanoAOD (in EDM file format). I'm somewhat surprised to see that to take a lot of memory. Are the other failing merge jobs also about NanoAOD, or a mixture of data tiers? Given the NanoAOD let's anyway tag @cms-sw/xpog-l2 . |
@makortel Thanks for looking into this.
Please check this out. This is the job log from our Unified database. Let me know if you want more information to investigate.
So far we have only seen such pattern in |
can one please secure some of the unmarked files somewhere on eos/afs so that we can access them before they get removed/cleanedup ? |
We took a look with @Dr15Jones, and the high memory usage is caused by serialization of ParameterSets. The underlying cause why that gets this bad (i.e. many ParameterSets) is in the PromptReco, that was noticed in https://cms-talk.web.cern.ch/t/high-memory-usage-in-merge-alcareco-job-for-promptreco-run373536-zerobias17/29655, and fixed in #42833. @Dr15Jones is looking into a workaround for A separate issue is that because of the many RECO ParameterSets being stored, even the merged file contains mostly ParameterSets. In the case of
the job merges 73 files, the output file size is 144 MB, and the compressed size of all the ParameterSets is 133 MB, implying ~92 % of the file is in ParameterSets. Even if the problem was confined to 2022 and partially 2023, and maybe only ZeroBias and MinimumBias PDs (link), we should probably look into trying to make the ParameterSet storage smaller in this kind of "lot's of repetition" case. |
The easiest and quickest way to proceed is to submit these jobs with a higher memory requirement (say 5GB). Then we will try to find a 'fix' for any future cases where this might happen and add those fixes to the present development release. |
@makortel Thanks for checking. A few questions here:
How do we end up getting multiple PSets? I think one CMSSW job requires only one PSet to run, right? In this task, we saw only one PSet. Does the pkl object PSet become multiple PSets after serialization?
Could you elaborate how you find or estimate this number? I found |
We store (parts) of the ParameterSets of earlier processing steps in the provenance. In case of merges we don't store duplicates, i.e. in the ideal case where e.g. the RECO job configuration would be the the same for all the LuminosityBlocks of a Run, usually also for many Runs, there would be only one ParameterSet from RECO stored. The CMS Talk thread https://cms-talk.web.cern.ch/t/high-memory-usage-in-merge-alcareco-job-for-promptreco-run373536-zerobias17/29655 should explain pretty well why with these data there can be many RECO ParameterSets in the provenance.
I ran the merge job locally, opened the resulting ROOT file with
part of the printout. |
@Dr15Jones took a deep dive, and found the culprit for the memory hoarding not to be in the ParameterSet storage per se, but to a "performance bug" in the framework ParameterSet code. #44727 fixes that behavior. We probably need to backport the fix at least to some earlier release cycles. |
We'll backport the fix down to 13_2_X. |
Many merge jobs require higher memory usage during runtime, for example, this WF.
Merge jobs are not typically memory demanding, thus an increase in memory usage would often fail the jobs. Is there any tool in CMSSW that I can use to debug this issue?
The text was updated successfully, but these errors were encountered: