-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to skip events which throw a cms::Exception
#41512
Comments
A new Issue was created by @missirol Marino Missiroli. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign core |
New categories assigned: core @Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Is there a specific reason for these consuming modules (whose |
I think so. The "consuming" modules that can throw a "ProductNotFound" in this case are instances of I'm thinking that for these DQM-oriented modules a |
Just to add something which is not obvious: the EndPaths that contain instances of |
It seems like you have explored the limits of the framework's exception handling mechanisms. I should note that the behavior of Another caveat is that are the relevant modules exception safe? I mean, we typically treat (most) exceptions as fatal and shut down the framework, in which case module e.g. managing resources in non-RAII way is not too bad in practice (as the OS cleans up the memory and file handles anyway). But continuing data processing with an exception thrown from a non-exception-safe can result undesirable behavior later in the process. (I didn't check either of the modules in question here). In the short term I'd look into making the An alternative could be to extend |
Just to clarify: the module is not destroyed, it "just" exits from
Do you mean to change |
Correct.
Yes, or produce an empty data product. |
For For
Thanks, that's an interesting suggestion (I hadn't thought about it). |
Trigger{Bx,Rates}Monitor
if L1-uGT results are unavailable [13_1_X
]
#41569
Aside from the complications related to EndPaths, I think I still need to understand better what happens in I'm trying to understand what happens in the reproducer in [1].
I tried a 2nd reproducer [4], which is like [1] except for the fact that the 2nd Path is removed (so, there is only 1 Path, and 1 EndPath). The output of the 2nd reproducer is in [5], and it is stranger.
I don't understand most of what happens in [1] import FWCore.ParameterSet.Config as cms
# Configuration file based on FWCore/Integration/test/testFrameworkExceptionHandling_cfg.py
badEventID = cms.untracked.EventID(1, 1, 2)
nStreams = 1
nRuns = 1
nLumisPerRun = 1
nEventsPerLumi = 3
nEventsPerRun = nLumisPerRun*nEventsPerLumi
nLumis = nRuns*nLumisPerRun
nEvents = nRuns*nEventsPerRun
process = cms.Process('TEST')
process.source = cms.Source('EmptySource',
firstRun = cms.untracked.uint32(1),
firstLuminosityBlock = cms.untracked.uint32(1),
firstEvent = cms.untracked.uint32(1),
numberEventsInLuminosityBlock = cms.untracked.uint32(nEventsPerLumi),
numberEventsInRun = cms.untracked.uint32(nEventsPerRun),
)
process.maxEvents = cms.untracked.PSet(
input = cms.untracked.int32(nEvents)
)
process.options = cms.untracked.PSet(
numberOfThreads = cms.untracked.uint32(nStreams),
numberOfStreams = cms.untracked.uint32(nStreams),
numberOfConcurrentRuns = cms.untracked.uint32(1),
numberOfConcurrentLuminosityBlocks = cms.untracked.uint32(2)
)
process.load('FWCore.PrescaleService.PrescaleService_cfi')
process.PrescaleService.lvl1Labels = ['Col0', 'Col1']
process.PrescaleService.lvl1DefaultLabel = 'Col0'
process.PrescaleService.prescaleTable = cms.VPSet(
cms.PSet(
pathName = cms.string('path1'),
prescales = cms.vuint32(1, 1)
)
)
process.throwException = cms.EDProducer('ExceptionThrowingProducer',
eventIDThrowOnEvent = badEventID
)
process.hltPrePath1 = cms.EDFilter( 'HLTPrescaler',
offset = cms.uint32( 0 ),
L1GtReadoutRecordTag = cms.InputTag( 'hltGtStage2Digis' )
)
process.filterAlwaysTrue1 = cms.EDFilter( 'HLTBool', result = cms.bool( True ))
process.filterAlwaysTrue2 = process.filterAlwaysTrue1.clone()
process.filterAlwaysTrue3 = process.filterAlwaysTrue1.clone()
process.intProducer1 = cms.EDProducer('ManyIntProducer', ivalue = cms.int32( 1 ))
process.intProducer2 = process.intProducer1.clone( ivalue = 2 )
process.path1 = cms.Path(
process.throwException
+ process.filterAlwaysTrue1
+ process.hltPrePath1
+ process.filterAlwaysTrue2
+ process.intProducer1
+ process.filterAlwaysTrue3
)
process.path2 = cms.Path(
process.throwException
+ process.intProducer2
)
process.theOutputModule = cms.OutputModule('PoolOutputModule',
fileName = cms.untracked.string( 'cmssw41512_testException1.root' ),
SelectEvents = cms.untracked.PSet( SelectEvents = cms.vstring(
'path1',
'path2',
)),
outputCommands = cms.untracked.vstring( 'keep *' )
)
process.endpath1 = cms.EndPath(
process.theOutputModule
)
process.options.wantSummary = True
process.options.SkipEvent = cms.untracked.vstring( 'IntentionalTestException' ) [2] diff --git a/HLTrigger/HLTcore/plugins/HLTPrescaler.cc b/HLTrigger/HLTcore/plugins/HLTPrescaler.cc
index 8841b0c3024..4c7e63e6646 100644
--- a/HLTrigger/HLTcore/plugins/HLTPrescaler.cc
+++ b/HLTrigger/HLTcore/plugins/HLTPrescaler.cc
@@ -68,6 +68,9 @@ void HLTPrescaler::beginLuminosityBlock(edm::LuminosityBlock const& lb, edm::Eve
bool HLTPrescaler::filter(edm::Event& iEvent, const edm::EventSetup&) {
// during the first event of a LumiSection, read from the GT the prescale index for this
// LumiSection and get the corresponding prescale factor from the PrescaleService
+
+ edm::LogPrint("HLTPrescaler") << "HLTPrescaler -- " << iEvent.id() << ", module = " << moduleDescription().moduleLabel() << ", path = " << iEvent.moduleCallingContext()->placeInPathContext()->pathContext()->pathName();
+
if (newLumi_) {
newLumi_ = false;
diff --git a/HLTrigger/HLTfilters/plugins/BuildFile.xml b/HLTrigger/HLTfilters/plugins/BuildFile.xml
index cb29742c3a0..12f2cba6bc7 100644
--- a/HLTrigger/HLTfilters/plugins/BuildFile.xml
+++ b/HLTrigger/HLTfilters/plugins/BuildFile.xml
@@ -18,6 +18,7 @@
<use name="FWCore/Framework"/>
<use name="FWCore/MessageLogger"/>
<use name="FWCore/ParameterSet"/>
+<use name="FWCore/ServiceRegistry"/>
<use name="FWCore/Utilities"/>
<use name="HLTrigger/HLTcore"/>
<flags EDM_PLUGIN="1"/>
diff --git a/HLTrigger/HLTfilters/plugins/HLTBool.cc b/HLTrigger/HLTfilters/plugins/HLTBool.cc
index 1d229804150..dafe18de3f5 100644
--- a/HLTrigger/HLTfilters/plugins/HLTBool.cc
+++ b/HLTrigger/HLTfilters/plugins/HLTBool.cc
@@ -12,6 +12,10 @@
#include "FWCore/ParameterSet/interface/ConfigurationDescriptions.h"
#include "FWCore/ParameterSet/interface/ParameterSetDescription.h"
+#include "FWCore/ServiceRegistry/interface/PathContext.h"
+#include "FWCore/ServiceRegistry/interface/PlaceInPathContext.h"
+#include "FWCore/ServiceRegistry/interface/ModuleCallingContext.h"
+
//
// constructors and destructor
//
@@ -32,4 +36,9 @@ void HLTBool::fillDescriptions(edm::ConfigurationDescriptions& descriptions) {
//
// ------------ method called to produce the data ------------
-bool HLTBool::filter(edm::StreamID, edm::Event& event, edm::EventSetup const& setup) const { return result_; }
+bool HLTBool::filter(edm::StreamID, edm::Event& event, edm::EventSetup const& setup) const {
+
+ edm::LogPrint("HLTBool") << "HLTBool -- " << event.id() << ", module = " << moduleDescription().moduleLabel() << ", path = " << event.moduleCallingContext()->placeInPathContext()->pathContext()->pathName();
+
+ return result_;
+} [3]
[4] import FWCore.ParameterSet.Config as cms
# Configuration file based on FWCore/Integration/test/testFrameworkExceptionHandling_cfg.py
badEventID = cms.untracked.EventID(1, 1, 2)
nStreams = 1
nRuns = 1
nLumisPerRun = 1
nEventsPerLumi = 3
nEventsPerRun = nLumisPerRun*nEventsPerLumi
nLumis = nRuns*nLumisPerRun
nEvents = nRuns*nEventsPerRun
process = cms.Process('TEST')
process.source = cms.Source('EmptySource',
firstRun = cms.untracked.uint32(1),
firstLuminosityBlock = cms.untracked.uint32(1),
firstEvent = cms.untracked.uint32(1),
numberEventsInLuminosityBlock = cms.untracked.uint32(nEventsPerLumi),
numberEventsInRun = cms.untracked.uint32(nEventsPerRun),
)
process.maxEvents = cms.untracked.PSet(
input = cms.untracked.int32(nEvents)
)
process.options = cms.untracked.PSet(
numberOfThreads = cms.untracked.uint32(nStreams),
numberOfStreams = cms.untracked.uint32(nStreams),
numberOfConcurrentRuns = cms.untracked.uint32(1),
numberOfConcurrentLuminosityBlocks = cms.untracked.uint32(2)
)
process.load('FWCore.PrescaleService.PrescaleService_cfi')
process.PrescaleService.lvl1Labels = ['Col0', 'Col1']
process.PrescaleService.lvl1DefaultLabel = 'Col0'
process.PrescaleService.prescaleTable = cms.VPSet(
cms.PSet(
pathName = cms.string('path1'),
prescales = cms.vuint32(1, 1)
)
)
process.throwException = cms.EDProducer('ExceptionThrowingProducer',
eventIDThrowOnEvent = badEventID
)
process.hltPrePath1 = cms.EDFilter( 'HLTPrescaler',
offset = cms.uint32( 0 ),
L1GtReadoutRecordTag = cms.InputTag( 'hltGtStage2Digis' )
)
process.filterAlwaysTrue1 = cms.EDFilter( 'HLTBool', result = cms.bool( True ))
process.filterAlwaysTrue2 = process.filterAlwaysTrue1.clone()
process.filterAlwaysTrue3 = process.filterAlwaysTrue1.clone()
process.intProducer1 = cms.EDProducer('ManyIntProducer', ivalue = cms.int32( 1 ))
process.intProducer2 = process.intProducer1.clone( ivalue = 2 )
process.path1 = cms.Path(
process.throwException
+ process.filterAlwaysTrue1
+ process.hltPrePath1
+ process.filterAlwaysTrue2
+ process.intProducer1
+ process.filterAlwaysTrue3
)
#process.path2 = cms.Path(
# process.throwException
# + process.intProducer2
#)
process.theOutputModule = cms.OutputModule('PoolOutputModule',
fileName = cms.untracked.string( 'cmssw41512_testException2.root' ),
SelectEvents = cms.untracked.PSet( SelectEvents = cms.vstring(
'path1',
# 'path2',
)),
outputCommands = cms.untracked.vstring( 'keep *' )
)
process.endpath1 = cms.EndPath(
process.theOutputModule
)
process.options.wantSummary = True
process.options.SkipEvent = cms.untracked.vstring( 'IntentionalTestException' ) [5]
|
The framework schedules all the producers and analyzers between two filters (say A and B) to be run concurrently, along with the latter filter (B). If the latter filter (B) accepts the event, the framework schedules the next group of producers and analyzers until (and including) the following filter, etc. The scheduled modules are run in some order that satisfies the data dependencies of the modules. In the case of Therefore in your example [1,3] the following module running order is valid
In the single thread case the module execution order gets dictated by the order the modules are scheduled, as the TBB's per-thread work queue acts like a stack. The paths are processed in the reverse order (i.e. modules in In the second example [4,5] the removal of
The |
Thanks for the clarification, @makortel ! (much needed) |
After playing a bit with a test configuration (in the context of #41645) it actually seems to that a module throwing an exception with category in the |
I think I noticed that, but somehow I still couldn't get my example with For the sake of documenting, I wrote in CMSHLT-2793 what I tried in the context of the problem that triggered this issue. In this particular case, I think one possibility would have been to use Luckily, a recent update of the L1T menu seems to have fixed the root cause of the problem, and we haven't seen HLT crashes online in the last few days. |
Following up on the "conceptual design" in #41645 (comment) with the next, more concrete iteration of the plan
All the new names are preliminary (suggestions for better names are welcome). |
PR #42441 implements the
|
@Dr15Jones @makortel I found the initial description of the behaviour of the output modules a bit confusing. In particular what does it mean that an output module depends on a module that throws an exception ? Could you summarise how the output modules should behave, also in view of later changes? |
Yes. It also means any modules needed by those modules which created the data products to 'keep'.
Not exactly. A Path which sees an exception (either because a module on the Path throws the exception OR an unscheduled module needed by the module on the Path throws an exception) will be marked as having an 'error' status. The OutputModule will only run if at least one of its Paths doesn't see the exception and the Path succeeds as normal. To illustrate, I've put together a small program using some dummy test modules. (NOTE: I modified AddIntsProducer so that it would ignore any data products which are missing from the Event). The following is a trivialized representation of the HLT.
import FWCore.ParameterSet.Config as cms
process = cms.Process("TEST")
process.source = cms.Source("EmptySource")
process.maxEvents.input = 3
#this is the type thrown by FailingProducer
process.options.TryToContinue = ["NotFound"]
process.options.wantSummary = True
process.trackingHits = cms.EDProducer("FailingProducer")
process.tracks = cms.EDProducer("AddIntsProducer", labels = cms.VInputTag("trackingHits"))
process.trackFilter = cms.EDFilter("IntProductFilter",
label = cms.InputTag("tracks"),
threshold = cms.int32(0),
shouldProduce = cms.bool(True)
)
process.caloClusters = cms.EDProducer("IntProducer", ivalue = cms.int32(1))
process.caloFilter = cms.EDFilter("IntProductFilter",
label = cms.InputTag("caloClusters"),
threshold = cms.int32(0),
shouldProduce = cms.bool(True)
)
process.globalTrigger = cms.EDProducer("AddIntsProducer", labels = cms.VInputTag("trackFilter","caloFilter"))
process.globalTrigger.shouldTryToContinue()
process.trackPath = cms.Path(process.trackingHits+process.tracks+process.trackFilter)
process.caloPath = cms.Path(process.caloClusters+process.caloFilter)
process.globalTriggerPath = cms.Path(process.globalTrigger)
outputTemplate_ = cms.OutputModule("AsciiOutputModule",
outputCommands = cms.untracked.vstring("drop *", "keep edmTriggerResults_*_*_*"),
SelectEvents = cms.untracked.PSet(SelectEvents = cms.vstring()))
process.trackOut = outputTemplate_.clone(SelectEvents = dict(SelectEvents=["trackPath"]))
process.caloOut = outputTemplate_.clone(SelectEvents = dict(SelectEvents=["caloPath"]))
process.trackAndCaloOut = outputTemplate_.clone(SelectEvents=dict(SelectEvents=["trackPath","caloPath"]))
process.globalTriggerOut = outputTemplate_.clone(outputCommands = ["drop *", "keep edmTriggerResults_*_*_*","keep *_globalTrigger__TEST"])
process.exceptionOut = outputTemplate_.clone(SelectEvents=dict(SelectEvents=["exception@*"]))
process.out = cms.EndPath(process.trackOut+process.caloOut+process.trackAndCaloOut+process.exceptionOut+process.globalTriggerOut) when run, all the OutputModules write the 3 events except for the 'trackOut' which writes no events as the only path it depends upon never succeeds (i.e. trackPath is set to the error state for each Event). From the summary we see
|
hi Chris, thanks, I think the example clarifies better the expected behaviour, at least for me. |
@fwyzard @missirol Matti and I chatted today and we are not convinced that the new Given the shouldTryToContinue is related to how a job responds to an exception (and not to how the algorithm for a module should be run) we thought that specifying the list of modules in the We were hoping to get your thoughts on either the |
Let me add explicitly @Martin-Grunewald and @mmusich to the discussion. |
Hi @Dr15Jones , thanks for these updates. I only have some naive feedback after reading the info above without testing #42441. Fwiw, I agree with
One reason is that this makes it easy to control this in
Regarding how to extend the |
I'd actually be interested to learn how the |
(Here's just what I think I know. Andrea or others can comment better.) By default, the HLT menus in If necessary, it is possible to define a global PSet named [*] One example is hltConfigFromDB --configName --adg /cdaq/test/missirol/test/2023/week18/CMSLITOPS_411/Test02/HLT/V5 > foo.py which was used to test the use of Another example is hltConfigFromDB --runNumber 355207 > bar.py # menu: /cdaq/physics/firstCollisions22/v4.0/HLT/V1 in which GPU offloading was disabled online by defining the following PSet in
|
Trying to give more concrete feedback on #42441. I took that PR, specifically
Tentative summary.
A few questions.
If you have suggestions/corrections, please let me know. [1] #!/bin/bash -ex
# cmsrel CMSSW_13_3_0_pre1
# cd CMSSW_13_3_0_pre1/src
# cmsenv
# git checkout -b test_cmssw42441
# git cms-merge-topic cms-sw:42441
# git cms-remote add missirol
# git cms-addpkg HLTrigger/HLTfilters
# git cp b67f107964f
# scram b -j 16
# run 366497, LS 196
INPUTFILE=root://eoscms.cern.ch//store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/run366497/run366497_ls0196_index000095_fu-c2b01-26-01_pid1955211.root
RUNNUM=366497
run_test() {
rm -rf run"${RUNNUM}"*
convertToRaw -f 100 -l 100 -r "${RUNNUM}":196 -o . -- "${INPUTFILE}"
edmConfigDump "${tmpfile}" > "${1}".py
cmsRun "${1}".py &> "${1}".log || true
}
tmpfile=$(mktemp)
hltConfigFromDB --runNumber "${RUNNUM}" > "${tmpfile}"
cat <<@EOF >> "${tmpfile}"
process.load('run${RUNNUM}_cff')
# customisations to adapt 13_0_X HLT menus to CMSSW_13_3_0_pre1
from HLTrigger.Configuration.customizeHLTforCMSSW import customizeHLTforCMSSW
process = customizeHLTforCMSSW(process)
# remove prescales, and set GlobalTag
del process.PrescaleService
process.GlobalTag.globaltag = '130X_dataRun3_HLT_v2'
# show statistics on decisions of modules and Paths
process.options.wantSummary = True
# number of threads/streams used online by HLT
process.options.numberOfThreads = 32
process.options.numberOfStreams = 24
@EOF
# hlt1: reproduce online crash seen in run 366497 (LS 196)
jobLabel="hlt1"
run_test "${jobLabel}"
# hlt2: hlt1 + "try to continue" upon exceptions of type 'InvalidGlobalAlgBlkBxCollection'
jobLabel="hlt2"
cat <<@EOF >> "${tmpfile}"
process.options.TryToContinue = cms.untracked.vstring( 'InvalidGlobalAlgBlkBxCollection' )
@EOF
run_test "${jobLabel}"
# hlt3: hlt2 + send events to a separate PrimaryDataset and stream
jobLabel="hlt3"
cat <<@EOF >> "${tmpfile}"
# New Path to select events in which at least one other Path went in Error state
process.HLTBeginSequenceAny = cms.Sequence( process.hltGtStage2Digis )
process.hltPrePathStatusError = cms.EDFilter( "HLTPrescaler",
offset = cms.uint32( 0 ),
L1GtReadoutRecordTag = cms.InputTag( "hltGtStage2Digis" )
)
process.hltPathStatusErrorFilter = cms.EDFilter( "HLTPathStatusErrorFilter",
ignoreInvalidPathNames = cms.bool(False),
# consider all Paths in the configuration ..
pathNames = cms.vstring( '*' ),
# .. except for the Path holding this module, as well as DatasetPaths
pathNamesToSkip = cms.vstring( 'HLT_PathStatusError_v*', 'Dataset_*' ),
)
process.HLT_PathStatusError_v1 = cms.Path(
process.HLTBeginSequenceAny
+ process.hltPrePathStatusError
+ process.hltPathStatusErrorFilter
+ process.HLTEndSequence
)
# "DatasetPath": Path using TriggerResultsFilter to select on other Paths
process.hltPreDatasetHLTError = cms.EDFilter( "HLTPrescaler",
offset = cms.uint32( 0 ),
L1GtReadoutRecordTag = cms.InputTag( "hltGtStage2Digis" )
)
process.hltDatasetHLTError = cms.EDFilter( "TriggerResultsFilter",
usePathStatus = cms.bool( True ),
hltResults = cms.InputTag( "" ),
l1tResults = cms.InputTag( "" ),
l1tIgnoreMaskAndPrescale = cms.bool( False ),
throw = cms.bool( True ),
triggerConditions = cms.vstring( 'HLT_PathStatusError_v1' )
)
process.Dataset_HLTError = cms.Path(
process.HLTDatasetPathBeginSequence
+ process.hltDatasetHLTError
+ process.hltPreDatasetHLTError
)
# "StreamPath": FinalPath with OutputModule selecting on DatasetPath
process.hltOutputHLTError = cms.OutputModule("GlobalEvFOutputModule",
SelectEvents = cms.untracked.PSet(
SelectEvents = cms.vstring(
'Dataset_HLTError'
)
),
compression_algorithm = cms.untracked.string('ZSTD'),
compression_level = cms.untracked.int32(3),
lumiSection_interval = cms.untracked.int32(0),
outputCommands = cms.untracked.vstring(
'drop *',
'keep FEDRawDataCollection_rawDataCollector_*_*',
'keep FEDRawDataCollection_source_*_*',
'keep GlobalObjectMapRecord_hltGtStage2ObjectMap_*_*',
'keep edmTriggerResults_*_*_*',
'keep triggerTriggerEvent_*_*_*'
),
psetMap = cms.untracked.InputTag("hltPSetMap"),
use_compression = cms.untracked.bool(True)
)
process.HLTErrorOutput = cms.FinalPath( process.hltOutputHLTError )
# update cms.Schedule adding new Path, DatasetPath, and StreamPath
process.schedule.extend([
process.HLT_PathStatusError_v1,
process.Dataset_HLTError,
process.HLTErrorOutput
])
# update the global PSets "datasets" and "streams"
# (just to mimic the db->python converter of ConfDB)
process.datasets.HLTError = cms.vstring( 'HLT_PathStatusError_v1' )
process.streams.HLTError = cms.vstring( 'HLTError' )
# prevent the Paths HLT_PathStatusError_v1 and Dataset_HLTError
# from going into Error state themselves
process.options.modulesToCallForTryToContinue = cms.untracked.vstring(
'hltPrePathStatusError',
'hltPreDatasetHLTError'
)
@EOF
run_test "${jobLabel}"
# hlt4: hlt3 but with an empty modulesToCallForTryToContinue
jobLabel="hlt4"
cat <<@EOF >> "${tmpfile}"
process.options.modulesToCallForTryToContinue = cms.untracked.vstring()
@EOF
run_test "${jobLabel}" [2] In the current HLT menus, every [3] (from #41512 (comment))
|
@missirol thanks very much for your very thorough report and attempting to use the TryToContinue mechanism. Evidently we haven't made well known that the SelectEvents specification of OutputModules has supported from the very beginning the ability to trigger on paths that have an error. This is done by adding Could you try an additional case where you get rid of your error path and try using |
Sure, I tried this in [1] ("hlt5"). The job completes, and it produces (among other things) a streamer file named
which contains the one event where the exception is thrown (checked after repacking it). So, it looks to me like it's working. The corresponding [1] #!/bin/bash -ex
# cmsrel CMSSW_13_3_0_pre1
# cd CMSSW_13_3_0_pre1/src
# cmsenv
# git checkout -b test_cmssw42441
# git cms-merge-topic cms-sw:42441
# git cms-remote add missirol
# git cms-addpkg HLTrigger/HLTfilters
# git cp b67f107964f
# scram b -j 16
# run 366497, LS 196
INPUTFILE=root://eoscms.cern.ch//store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/run366497/run366497_ls0196_index000095_fu-c2b01-26-01_pid1955211.root
RUNNUM=366497
run_test() {
rm -rf run"${RUNNUM}"*
convertToRaw -f 100 -l 100 -r "${RUNNUM}":196 -o . -- "${INPUTFILE}"
edmConfigDump "${tmpfile}" > "${1}".py
cmsRun "${1}".py &> "${1}".log || true
}
tmpfile=$(mktemp)
hltConfigFromDB --runNumber "${RUNNUM}" > "${tmpfile}"
cat <<@EOF >> "${tmpfile}"
process.load('run${RUNNUM}_cff')
# customisations to adapt 13_0_X HLT menus to CMSSW_13_3_0_pre1
from HLTrigger.Configuration.customizeHLTforCMSSW import customizeHLTforCMSSW
process = customizeHLTforCMSSW(process)
# remove prescales, and set GlobalTag
del process.PrescaleService
process.GlobalTag.globaltag = '130X_dataRun3_HLT_v2'
# show statistics on decisions of modules and Paths
process.options.wantSummary = True
# number of threads/streams used online by HLT
process.options.numberOfThreads = 32
process.options.numberOfStreams = 24
@EOF
# hlt5: send events with at least one Path in error state to a separate output file, using the syntax "exception@*"
jobLabel="hlt5"
cat <<@EOF >> "${tmpfile}"
process.options.TryToContinue = cms.untracked.vstring( 'InvalidGlobalAlgBlkBxCollection' )
# "StreamPath": FinalPath with OutputModule selecting on DatasetPath
process.hltOutputHLTError = cms.OutputModule("GlobalEvFOutputModule",
SelectEvents = cms.untracked.PSet(
SelectEvents = cms.vstring(
'exception@*'
)
),
compression_algorithm = cms.untracked.string('ZSTD'),
compression_level = cms.untracked.int32(3),
lumiSection_interval = cms.untracked.int32(0),
outputCommands = cms.untracked.vstring(
'drop *',
'keep FEDRawDataCollection_rawDataCollector_*_*',
'keep FEDRawDataCollection_source_*_*',
'keep GlobalObjectMapRecord_hltGtStage2ObjectMap_*_*',
'keep edmTriggerResults_*_*_*',
'keep triggerTriggerEvent_*_*_*'
),
psetMap = cms.untracked.InputTag("hltPSetMap"),
use_compression = cms.untracked.bool(True)
)
process.HLTErrorOutput = cms.FinalPath( process.hltOutputHLTError )
# update cms.Schedule adding (Final)Path with new OutputModule
process.schedule.append( process.HLTErrorOutput )
@EOF
run_test "${jobLabel}" |
@missirol Thanks! [I am trying to build an area based on your recipe from earlier but compilation will take a long time :) ] Is the |
@missirol to better understand the |
No, the module
Yes, with the caveat that I'm not doing any detailed checks, but rather just quickly scanning thousands of lines by eye looking for anything odd.
I'm mostly looking at lines that contain [1]
|
Another tiny example below. I think all are correct except for "hlt3", which is clearly off. > grep 'TrigReport ---------- Modules in Path: Dataset_' hlt1.log | wc -l
89
> grep 'TrigReport ---------- Modules in Path: Dataset_' hlt2.log | wc -l
89
> grep 'TrigReport ---------- Modules in Path: Dataset_' hlt3.log | wc -l
2
> grep 'TrigReport ---------- Modules in Path: Dataset_' hlt4.log | wc -l
90
> grep 'TrigReport ---------- Modules in Path: Dataset_' hlt5.log | wc -l
89 |
So for the weirdness in hlt3.log, the only way I can even beginning to think how that printout could happen is if multiple threads were simultaneously running I'll definitely look into. Be that as it may, does the configuration in hlt5 suit your needs? |
On a quick look, another thing different between hlt3.log and hlt5.log is the number of message logger messages that were skipped. For hlt3.log it was around 57,000 while hlt5.log only 150. It is possible that Maybe adding the Tracer service to the job might uncover the behavior. |
Not fully, in the sense that "hlt5" uses As I was trying to explain in #41512 (comment), "hlt3" follows the HLT "rules", it uses |
@missirol what are the odds that the ConfDB limitations could be lifted? Seems like those might be more of a hinderance than a help for dealing with special cases. |
In general, bad odds (as usual with The GUI and the db can often be limiting factors when it comes to supporting new features of CMSSW inside HLT menus, but here My current take is that what the framework will provide after #42441 is sufficient for HLT.
Let me add that, based on discussions had in TSG, the likelihood of HLT deciding to bypass exceptions in production is very low. Skipping events at HLT is seen as an extreme measure, because doing so removes pressure from finding the actual solution to the problem (where the problem is usually outside of HLT, like this year in the case of corrupted data from L1T causing crashes when unpacked at HLT). |
From internal discussions with the framework team, we also see any use of |
I think I probably determined why the TrigReport had a problem. It looks like messages are being dropped while the summary is being printed. I found that if I turned on INFO reporting, then It looks like some times the The message appears to be coming from here cmssw/EventFilter/Utilities/src/FastMonitoringService.cc Lines 821 to 838 in 0c9536a
|
So after modifying |
+core I think this issue can be closed by now |
@cmsbuild, please close |
cms-bot internal usage |
This issue is fully signed and ready to be closed. |
The reproducer in [1] (
CMSSW_13_0_5_patch1
, input file on lxplus) tries to useoptions.skipEvent
(documented, for example, in SWGuideEdmExceptionUse#Framework_Exception_Handling) in order skip an event which is known to throw an exception of type"InvalidGlobalAlgBlkBxCollection"
from the modulehltStage2GtDigis
.Naively, I was expecting the job to skip the event and succeed. Instead, I see that the job fails because a different module on one EndPath throws a different exception ("ProductNotFound") while attempting to access the products of
hltStage2GtDigis
(which are likely not produced becausehltStage2GtDigis
fails due to"InvalidGlobalAlgBlkBxCollection"
). The error message of the reproducer is in [2]. @fwyzard spotted that the message quotes"Begin IgnoreCompletely"
, and does not quote"Begin SkipEvent"
. A simple search leads me to this:One workaround is to include
ProductNotFound
inoptions.skipEvent
.Question: are there "better" ways ?
Context : this issue is related to #41489 (comment), as we look into the feasibility of using
options.skipEvent
to avoid the frequent HLT crashes seen online these days due to the L1T unpacker (CMSLITOPS-411).FYI: @silviodonato @cms-sw/hlt-l2
[1]
[2]
The text was updated successfully, but these errors were encountered: