-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfaults in workflow 11834.21 step 2 #36336
Comments
assign simulation |
New categories assigned: simulation @mdhildreth,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks |
FYI @cms-sw/ecal-dpg-l2 |
A new Issue was created by @makortel Matti Kortelainen. @Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
The workflow does not crash in other IB architectures or flavors, so it looks like a "random crash". |
In CMSSW_12_2_ROOT6_X_2021-12-05-0000 workflow 11834.21 step 2 segfaults in
|
(changed the title to more general) More crashes from CMSSW_12_3_ROOT624_X_2021-12-10-2300
and from CMSSW_12_3_ROOT6_X_2021-12-10-2300
|
CMSSW_12_3_X_2021-12-13-2300 slc7_amd64_gcc10 reports a segfault, but the log has no stack trace, just ends abruptly |
CMSSW_12_3_X_2021-12-14-2300 cs8_amd64_gcc10 workflow 11834.0 step 2 had an assertion failure that could be connected (has two instances of
|
Here is another one from 11834.13 step 2 in CMSSW_12_3_X_2021-12-21-2300 (slc7_amd64_gcc10)
|
Another one appearing from 11834.0 in IB CMSSW_12_3 2021-12-23-1100 #23 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const () Current Modules: Module: CkfTrajectoryMaker:hltL3NoFiltersTrackCandidateFromL2IOHitNoVtx (crashed) |
Adding full stack traces for future reference (the logs disappear "too quickly")
|
Another one in workflow 11834.0 in CMSSW_12_3_X_2021-12-24-1100
|
Another one in workflow 11834.21 step 2 in CMSSW_12_3_X_2021-12-28-2300
|
Another one in 11834.21 step 2 on cs8_amd64_gcc10
|
Another one in 11834.21 step 2 on cs8_amd64_gcc11
|
it looks like the crash happens in events where there is a problem of also, several of the crashes reported here are not always reproducible, despite using same CMSSW version and same TTbar GEN-SIM file, is this an expected behaviour? |
@swagata87 Good catch! I checked one random succeeding 11834.21 step 2 in the IBs and that did not show
11834 is a pileup workflow, and the pileup mixing is not fully reproducible when run with multiple threads (because the MinBias files are processed per edm stream, and assignment of input GEN events to streams is not deterministic; MixingModule has a specific playback mode that allows to rerun the pileup mixing in exactly the same way, but it needs an already mixed file, i.e. that from step2, as an input). |
Another one in wf 11834.0 (step2) in As far as I understood, this is only seen for HLT, and its frequency is not so small; otoh, there seems to be no way to reproduce it. By the way, a fix/workaround is in #36669. [*]
|
Another one in wf 11834.21 (step2) in
|
Some partial progress: after a few dozen tries, I managed to reproduce the error for wf With the steps below, I could then reproduce the seg-fault at least once: # ssh lxplus8.cern.ch
export SCRAM_ARCH=cs8_amd64_gcc10
cmsrel CMSSW_12_3_X_2022-01-14-2300
cd CMSSW_12_3_X_2022-01-14-2300/src
cmsenv
scram b
cp -pr /afs/cern.ch/work/m/missirol/public/cmssw36336 .
pushd cmssw36336
cmsRun step2_11834p0_inPlaybackMode_cfg.py &> step2_11834p0_inPlaybackMode_cfg.log
popd On the other hand, the issue still does not seem to be fully reproducible (I ran these same steps 4 times, and I got the seg-fault only 2 out of 4 times), so I'm certainly missing something. With #36669 included, the error message I happened to get was
|
Maybe obvious, but.. one way to (really) reproduce the crash in the HLT step is to run the latter alone using the 'new' input file (the one used as input for 'playback mode' in the previous comment), so without remaking the RAW collection on-the-fly (afaiu). # (see recipe in previous comments for context)
cp -pr /afs/cern.ch/work/m/missirol/public/cmssw36336 .
cmsDriver.py step2 -s HLT:@relval2021 --processName HLTX --conditions auto:phase1_2021_realistic \
--datatier GEN-SIM-DIGI-RAW -n 1 --eventcontent FEVTDEBUGHLT --geometry DB:Extended \
--era Run3 --filein file:cmssw36336/step2_forPlaybackInput.root --fileout file:tmp.root \
--customise_commands "process.source.skipEvents = cms.untracked.uint32(7)" |
@missirol Thanks for looking into this!
Just to clarify further (since it is not really obvious), the pileup mixing in step 2 is not fully reproducible when run with multiple threads. This happens because each stream has its own random number sequence and source of pileup files, and the assignment of input events to streams is not deterministic. The "playback mode" of MixingModule is effectively the only way to reproduce exactly what happened in the mixing. |
Thanks, @makortel. Maybe I was not clear in my last comment. Let me clarify, and you can tell me what I'm missing (you can also check the configs in the In that latest test (the "2 out of 4"), I think I did run in "playback mode". In order, what I did was to (1) merge #36669 hoping to avoid the seg-fault, (2) run repeatedly one of the relevant wfs until 1 job showed the NaN error message provided by #36669; #36669 proved to be a good workaround, this job completed and created a valid step2 output. Then, I used that step2 output as input for a new job with Beyond this, and even after the discussion up to #36669 (comment), what remains unclear to me is whether the root cause of the problem is in the HLT step, or before that. |
I suggest to try
a bunch of |
Thanks, @VinInn. With that change, the crash from #36336 (comment) [*] disappears. [*] cmsDriver.py step2 -s HLT:@relval2021 --processName HLTX --conditions auto:phase1_2021_realistic \
--datatier GEN-SIM-DIGI-RAW -n 1 --eventcontent FEVTDEBUGHLT --geometry DB:Extended --era Run3 \
--filein file:/afs/cern.ch/work/m/missirol/public/cmssw36336/step2_forPlaybackInput.root \
--fileout file:tmp.root --customise_commands "process.source.skipEvents = cms.untracked.uint32(7)" |
New categories assigned: hlt @missirol,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks |
@makortel @cms-sw/simulation-l2 do I understand correctly from #36336 (comment) that this issue is now fixed? Please close it, if so. |
Workflow 11834.21 step 2 crashes in CMSSW_12_2_X_2021-12-01-2300
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc900/CMSSW_12_2_X_2021-12-01-2300/pyRelValMatrixLogs/run/11834.21_TTbar_14TeV+2021PU_ProdLike+TTbar_14TeV_TuneCP5_GenSimINPUT+DigiPU+RecoNanoPU+MiniAODPU+NanoPU/step2_TTbar_14TeV+2021PU_ProdLike+TTbar_14TeV_TuneCP5_GenSimINPUT+DigiPU+RecoNanoPU+MiniAODPU+NanoPU.log#/
The text was updated successfully, but these errors were encountered: