Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in visualization DQM clients during SPLASH test #35634

Closed
francescobrivio opened this issue Oct 12, 2021 · 13 comments · Fixed by #35639
Closed

Crash in visualization DQM clients during SPLASH test #35634

francescobrivio opened this issue Oct 12, 2021 · 13 comments · Fixed by #35639

Comments

@francescobrivio
Copy link
Contributor

francescobrivio commented Oct 12, 2021

A crash in the vizualization DQM clients was reported by @pmandrik during the SPLASH test. The clients crashing are:
visualization-live_cfg.py
visualization-live-secondInstance_cfg.py
And the crash error is:

----- Begin Fatal Exception 12-Oct-2021 16:07:52 CEST-----------------------
An exception of category 'NoRecord' occurred while
   [0] Processing  Event run: 345570 lumi: 1 event: 4035 stream: 2
   [1] Running path 'FEVToutput_step'
   [2] Prefetching for module JsonWritingTimeoutPoolOutputModule/'FEVToutput'
   [3] Prefetching for module ReducedRecHitCollectionProducer/'reducedEcalRecHitsEB'
   [4] Prefetching for module InterestingDetIdCollectionProducer/'interestingEcalDetIdPFEB'
   [5] Prefetching for module PFECALSuperClusterProducer/'particleFlowSuperClusterECAL'
   [6] Calling method for module BeamSpotOnlineProducer/'offlineBeamSpot'
Exception Message:
No "BeamSpotTransientObjectsRcd" record found in the EventSetup.n
 Please add an ESSource or ESProducer that delivers such a record.
----- End Fatal Exception -------------------------------------------------

The crash can be easily riproduced by running the unitTest:

cmsrel CMSSW_12_0_2_patch1
cd CMSSW_12_0_2_patch1/src
cmsenv
git cms-addpkg DQM/Integration
cmsRun DQM/Integration/python/clients/visualization-live_cfg.py unitTest=True

DQM online experts can also reproduce the crash in their playback system at P5.

This crash happens only when running with the scenario ppEra_Run3 and don't crash when running with scenario cosmicsEra_Run3.
IIUC this is because the reco requence in the two cases is different and in in the pp case the beamspot is required.
Looking into the configuration I can see that the offlineBeamSpot has been swapped with the onlineBeamSpot:

(Pdb) process.offlineBeamSpot
cms.EDProducer("BeamSpotOnlineProducer",
    changeToCMSCoordinates = cms.bool(False),
    gtEvmLabel = cms.InputTag("gtEvmDigis"),
    maxRadius = cms.double(2),
    maxZ = cms.double(40),
    setSigmaZ = cms.double(-1),
    src = cms.InputTag("scalersRawToDigi"),
    useTransientRecord = cms.bool(True)
)

and the only place this swapping happens is in:

def _swapOfflineBSwithOnline(process):
import RecoVertex.BeamSpotProducer.onlineBeamSpotESProducer_cfi as _mod
process.BeamSpotESProducer = _mod.onlineBeamSpotESProducer.clone(
timeThreshold = 999999 # for express allow >48h old payloads for replays. DO NOT CHANGE
)
from RecoVertex.BeamSpotProducer.BeamSpotOnline_cfi import onlineBeamSpotProducer
process.offlineBeamSpot = onlineBeamSpotProducer.clone()
return process

which is used to customize the express processing. So now the sequence will look for a BeamSpotTransientObjectsRcd but apparently that is not provided by any Source and this is not clear to me because the express customization automatically adds the BeamSpotESProducer.

A quick solution would then be to add the ESProducer to the DQM clients:

process.load("CondCore.CondDB.CondDB_cfi")
process.BeamSpotESProducer = cms.ESProducer("OnlineBeamSpotESProducer")

But I'm not sure if that is the correct way of doing it or if it's better to understand how the customization of the visualizationProcessing works and eventually modify it?

[EDIT]
The quick solution to add the ESProducer to the DQM clients was already somehow used in #35373

@francescobrivio
Copy link
Contributor Author

francescobrivio commented Oct 12, 2021

assign dqm,reconstruction

@cmsbuild
Copy link
Contributor

New categories assigned: dqm

@jfernan2,@ahmad3213,@rvenditti,@emanueleusai,@pbo0,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

A new Issue was created by @francescobrivio .

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@slava77,@jpata you have been requested to review this Pull request/Issue and eventually sign? Thanks

@francescobrivio
Copy link
Contributor Author

FYI @mmusich @ggovi @gennai

@gennai
Copy link
Contributor

gennai commented Oct 13, 2021

@mmasciov you may be interested in this as well

@mmusich
Copy link
Contributor

mmusich commented Oct 13, 2021

@francescobrivio
I think the customization in Configuration/DataProcessing/python/RecoTLR.py only takes care of explicitly customized workflows (such as 138.2)
For the general case, something is missing here:

import RecoVertex.BeamSpotProducer.BeamSpotOnline_cfi
_onlineBeamSpotProducer = RecoVertex.BeamSpotProducer.BeamSpotOnline_cfi.onlineBeamSpotProducer.clone()
mods.offlineToOnlineBeamSpotSwap.toReplaceWith(offlineBeamSpot, _onlineBeamSpotProducer)

see please #35639.

I managed to (almost) run your recipe above with it:

cmsrel CMSSW_12_0_2_patch1
cd CMSSW_12_0_2_patch1/src
cmsenv
git cms-addpkg DQM/Integration
cmsRun DQM/Integration/python/clients/visualization-live_cfg.py unitTest=True

though now it crashes differently with:

----- Begin Fatal Exception 13-Oct-2021 10:41:57 CEST-----------------------
An exception of category 'NoProxyException' occurred while
   [0] Processing  Event run: 334393 lumi: 1 event: 16188 stream: 0
   [1] Running path 'FEVToutput_step'
   [2] Prefetching for module JsonWritingTimeoutPoolOutputModule/'FEVToutput'
   [3] Prefetching for module ReducedRecHitCollectionProducer/'reducedEcalRecHitsEB'
   [4] Prefetching for module EleIsoDetIdCollectionProducer/'interestingGedEleIsoDetIdEB'
   [5] Calling method for module GEDGsfElectronFinalizer/'gedGsfElectrons'
Exception Message:
No data of type "GBRForestD" with label "electron_eb_ecalOnly_1To300_0p2To2_mean" in record "GBRDWrapperRcd"
 Please add an ESSource or ESProducer to your job which can deliver this data.
----- End Fatal Exception -------------------------------------------------

somewhat consistently with this comment:

<!-- No data of type "GBRForestD" with label "electron_eb_ecalOnly_1To300_0p2To2_mean" in record "GBRDWrapperRcd" -->
<!-- <test name="TestDQMOnlineClient-visualization" command="runtest.sh visualization-live_cfg.py" /> -->
<!-- <test name="TestDQMOnlineClient-visualization_secondInstance" command="runtest.sh visualization-live-secondInstance_cfg.py" /> -->

what's @cms-sw/alca-l2 plan to fix that?

@francescobrivio
Copy link
Contributor Author

francescobrivio commented Oct 13, 2021

@francescobrivio I think the customization in Configuration/DataProcessing/python/RecoTLR.py only takes care of explicitly customized workflows (such as 138.2) For the general case, something is missing here:

import RecoVertex.BeamSpotProducer.BeamSpotOnline_cfi
_onlineBeamSpotProducer = RecoVertex.BeamSpotProducer.BeamSpotOnline_cfi.onlineBeamSpotProducer.clone()
mods.offlineToOnlineBeamSpotSwap.toReplaceWith(offlineBeamSpot, _onlineBeamSpotProducer)

see please #35639.

Thanks a lot for the quick fix! I left a comment on your PR.

I managed to (almost) run your recipe above with it:

cmsrel CMSSW_12_0_2_patch1
cd CMSSW_12_0_2_patch1/src
cmsenv
git cms-addpkg DQM/Integration
cmsRun DQM/Integration/python/clients/visualization-live_cfg.py unitTest=True

though now it crashes differently with:

----- Begin Fatal Exception 13-Oct-2021 10:41:57 CEST-----------------------
An exception of category 'NoProxyException' occurred while
   [0] Processing  Event run: 334393 lumi: 1 event: 16188 stream: 0
   [1] Running path 'FEVToutput_step'
   [2] Prefetching for module JsonWritingTimeoutPoolOutputModule/'FEVToutput'
   [3] Prefetching for module ReducedRecHitCollectionProducer/'reducedEcalRecHitsEB'
   [4] Prefetching for module EleIsoDetIdCollectionProducer/'interestingGedEleIsoDetIdEB'
   [5] Calling method for module GEDGsfElectronFinalizer/'gedGsfElectrons'
Exception Message:
No data of type "GBRForestD" with label "electron_eb_ecalOnly_1To300_0p2To2_mean" in record "GBRDWrapperRcd"
 Please add an ESSource or ESProducer to your job which can deliver this data.
----- End Fatal Exception -------------------------------------------------

somewhat consistently with this comment:

<!-- No data of type "GBRForestD" with label "electron_eb_ecalOnly_1To300_0p2To2_mean" in record "GBRDWrapperRcd" -->
<!-- <test name="TestDQMOnlineClient-visualization" command="runtest.sh visualization-live_cfg.py" /> -->
<!-- <test name="TestDQMOnlineClient-visualization_secondInstance" command="runtest.sh visualization-live-secondInstance_cfg.py" /> -->

what's @cms-sw/alca-l2 plan to fix that?

This is already fixed if you force to use the latest 120X GT.
Running the unittest I see that the GT picked up is still the old one:

> Using hardcoded GT: "113X_dataRun3_Express_v4"

and this ends up in the error that you report.

@mmusich
Copy link
Contributor

mmusich commented Oct 13, 2021

Thanks a lot for the quick fix! I left a comment on your PR.

I don't think the comment is relevant.

This is already fixed if you force to use the latest 120X GT.

I am not sure.
The error happens also in 12.1.X using 121X_dataRun3_Express_v5 what's the most updated GT to use?

@mmusich
Copy link
Contributor

mmusich commented Oct 13, 2021

The error happens also in 12.1.X using 121X_dataRun3_Express_v5 what's the most updated GT to use?

OK, apparently #35593 got in the way extremely recently (not yet in an IB)

@mmusich
Copy link
Contributor

mmusich commented Oct 13, 2021

@cms-sw/dqm-l2 please see #35642, so that the integration tests do something actually useful and catch this sort of issues earlier.

@francescobrivio
Copy link
Contributor Author

@pmandrik could you run again the playback including PR #35653 (backport to 120X kindly provided by @mmusich) ?

@pmandrik
Copy link
Contributor

Hello, we checked that Event Display clients run fine with this PR for ppEra_Run3 scenario at the playback, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants