-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opportunity to cleanup LogErrors and LogWarnings in 2023 data promptReco processing #41456
Comments
A new Issue was created by @slava77 Slava Krutelyov. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign reconstruction,dqm,l1,alca |
New categories assigned: dqm,alca,reconstruction,l1 @epalencia,@micsucmed,@rvenditti,@mandrenguyen,@emanueleusai,@syuvivida,@clacaputo,@aloeliger,@francescobrivio,@saumyaphor4252,@tvami,@cecilecaillol,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks |
an example log from one job in ZeroBias can be found here |
concerning
the message is of the type:
that happens because the prompt GT doesn't have the
for reference express has:
so it always enters this clause: cmssw/RecoVertex/BeamSpotProducer/plugins/OnlineBeamSpotESProducer.cc Lines 155 to 158 in ca3234a
which in turn triggers here: cmssw/RecoVertex/BeamSpotProducer/plugins/BeamSpotOnlineProducer.cc Lines 106 to 110 in ca3234a
|
I forgot that LogWarnings:
So, L1T actually "wins": 51223 from LogErrors:
|
since this is particularly noisy, I was having a look in the log file you posted above, but I don't seem to be able to find occurrences of that in it ...
(which I think is expected given that in Run-3, the luminosity from scaler is not available). |
hi @bsunanda can you please have a look at |
These show up in Muon0 |
thanks, and indeed:
|
@cms-sw/ctpps-dpg-l2 please take note of the prevalence of
(and similar) warnings in the recent prompt. |
@cms-sw/l1-l2
logs are littered with these. Is it possible to do something about it? |
Curiously, the first The warnings are kind of similar though
in the Muon0 log
|
I looked at some of the linked logs but I can't see any Warning related to CTPPSPixelDataFormatter module despite @slava77 quotes a few thousands (can you please point me to one of them?). |
There's plenty of them here: https://slava77sk.web.cern.ch/slava77sk/reco/log/PromptReco_Run366498_ZeroBias/WMTaskSpace/logCollect1/11735be0-80ac-4b03-9427-9428d624bfe3-1-0-logArchive/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log |
I looked at some of the linked logs but I can't see any Warning related to CTPPSPixelDataFormatter module despite @slava77 quotes a few thousands (can you please point me to one of them?).
Still, I see only errors related to CTPPSPixelDataFormat, no warning. As I said, it's correct that they are reported. If there is any warning (MSG-w not MSG-e) issued by CTPPSPixelDataFormat, please cut and paste the line cause I can't actually see them. |
ehm, apparently |
I now used python to analyze this in fwlite the updated toplist is Warnings
Errors
f = open("zb.ews.pyscan.txt", "a")
en = ROOT.TChain("Events")
n = en.Add("root://cms-xrd-global.cern.ch/LFN")
for ev in range(en.GetEntries()):
n = en.GetEntry(ev)
mv = en.edmErrorSummaryEntrys_logErrorHarvester__RECO.product()
e = en.EventAuxiliary
for mm in mv: print(e.run(), e.luminosityBlock(), e.event(), mm.severity.getName().data(), mm.count, mm.module, mm.category, file=f)
f.close |
@cms-sw/alca-l2 this looks potentially worrisome. You might want to contact ECAL DPG about it. |
@cms-sw/dqm-l2 is it possible to reach out to top dqm developers to silence this warning? |
These appear to be from changes that were made in #41972 but don't appear to have been backported. Ultimately I think some consideration was requested about how to get around the changing menu, but I can make the backport if you want. |
I have a question on #41972. |
actually #41972 (comment) says it all... I am not sure #41972 was the right approach. |
there are still thousands of LogError emissions per run from the CTPPS unpacker in recent runs (see #41456 (comment)), please comment on the status on the investigations on data corruption. |
The bits hardcoded into the uGT timing config before the change were never correct for 2023. They did not exist in any 2023 L1 menu. If we reprocess 2023 data now, the currently extant bits should be present in all 2023 menus I think. Ultimately, I think we were not following (and I guess currently are not following online without a backport) the right bits in the first part of the data that we took this year (They were likely just ignored in the DQM in this case?), but I do not believe this particular test to be a make or break decider for L1T certification. It is likely just a uGT debugging tool, and hardware health check by the sound of it. I opened #41972 to clean up the warnings, but if we want changes beyond that I think the requested solution is not under my purview? If it is, I can take a look at what it requires to change anything. |
based on #41972 (comment) did #41972 create new warnings in other years? |
Very probably. I don't know that it has been reported to me though. |
Is there any L1T DQM developers team you might reach to in order to suggest / implement a better fix? |
The investigation on the hardware instabilities that are generating (also) these errors have been ongoing since the start of data taking and are not yet over. Some problems have been fixed, some others not (yet). We are struggling to keep as many planes as possible online, this of course is being balanced with the disruption that those not-perfectly-working planes cause to the whole CMS DAQ and Offline processing. Maybe @AndreaBellora may add some more details, but my understanding so far is that we will very likely have to live with these error messages for the rest of this year data taking. |
given their prevalence - and the fact that at this point they are very well known in the CT-PPS community, would it be possible to limit the emission to few times per run? |
Indeed, I can only add that unfortunately there is no current way to fully remove this data corruption by acting on the hardware (neither it is strictly detrimental to the detector performance for what we know). I'd suggest looking for a way to suppress these messages after they reach a certain number of occurrences per run, for example. |
After (also) @AndreaBellora 's comment, I've nothing against limiting the error messages to a "certain number" per run. Whether the data is corrupted or not is spotted in the EventFilter that issues the LogError message. Practically speaking I have no idea of how to set a number of maximum issued messages. Any help, suggestions? |
@fabferro and @AndreaBellora I would suggest you to study the way that the SiStrip unpacker deals with this issue:
|
Otherwise, wouldn't it be possible to define a limit parameter in the MessageLogger for this module? I'm no expert on that, but I could do something similar in the past with some plugins... |
I am not sure you have control on that at the level of Tier-0 configuration. |
RecoTLR.customisePrompt and customiseExpress can be used
But wouldn't these be needed in reprocessing as well? In that case I'd configure limited issue of errors by default and enable elsewhere for the tests. Similar to the tracker, it may be more practical to produce error objects and rely on analysis of theese data products instead of text log files |
Here are some details from Errors (including average frequency per 1000 events):
FailedPropagation is like; seems like a candidate for a warning
Warnings (including average frequency per 1000 events, listing only those with rate per 1K events of >0.1):
|
TopMonitor is solved in #42808 |
This may be just for taxonomy, but could be used as for investigation of corresponding modules
I was looking at message logger details in 2023 data for run 366498
in Muon0 and ZeroBias LogErrorMonitor skim files
/store/data/Run2023B/Muon0/USER/LogErrorMonitor-PromptReco-v1/000/366/498/00000/0e204272-49fb-42de-a373-65747ff240b4.root
and/store/data/Run2023B/ZeroBias/USER/LogErrorMonitor-PromptReco-v1/000/366/498/00000/3f9f69f5-7933-47bb-a5a4-d1f9ffbf7331.root
have about 25K events (the run is fairly short)
Warnings
Errors
plain text printout in fwlite with (later post-processed to get totals):
Full outputs with
run lumi event sevLev count module category
are available for Muon0 and ZeroBias These excludeMemoryCheck
an example log from one job in ZeroBias can be found here
an example of Muon0 job log is here
The text was updated successfully, but these errors were encountered: