Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional patch based on 13_0_3 for avoiding both crashes and memory issues in prompt reco #41489

Closed
mandrenguyen opened this issue May 2, 2023 · 20 comments

Comments

@mandrenguyen
Copy link
Contributor

As just discussed in the ORP, and in light of the ongoing memory issue in prompt reconstruction, one possibility is to build a new release based on 13_0_3_patch1, including any bug fixes needed to avoid the crashes in prompt reco reported here:
#41397
#41442

I'm aware of the following two PRs:
#41473
#41454

Is anyone aware of anything else?

@cmsbuild
Copy link
Contributor

cmsbuild commented May 2, 2023

A new Issue was created by @mandrenguyen Matthew Nguyen.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

makortel commented May 2, 2023

assign reconstruction, operations

@cmsbuild
Copy link
Contributor

cmsbuild commented May 2, 2023

New categories assigned: operations,reconstruction

@fabiocos,@davidlange6,@rappoccio,@mandrenguyen,@clacaputo,@perrotta you have been requested to review this Pull request/Issue and eventually sign? Thanks

@tvami
Copy link
Contributor

tvami commented May 2, 2023

One more PR that I think we should add is
#41484
If we add it now, before we collected any collision data under 2023C, that means there is no need for proc version change.

@emanueleusai
Copy link
Member

While not specifically related to the memory issues, could we please add this PR needed for nanoDQMIO production?
#41400

@epalencia
Copy link
Contributor

This is the L1 related PR mentiones by @missirol : #41313

@epalencia
Copy link
Contributor

Currently discussing within L1T (@eyigitba , @elfontan , @aloeliger ) if it is really needed here

@malbouis
Copy link
Contributor

malbouis commented May 2, 2023

We would like to have this PR that introduces the new pp Scenario with updated HB thresholds to be used for Physics data-taking: PR #41368

@youyingli
Copy link
Contributor

Not relative to the memory issue, could we also add #41146 #41167 #41311 for 2023 promptReco skim production? I'm not sure if it's suitable here.

@missirol
Copy link
Contributor

missirol commented May 3, 2023

#41402 by @cms-sw/l1-l2 is not an urgent fix, but I think it would be useful to include it in this upcoming patch. (I should have included it in 13_0_3_patch1)

#41402 is related to the remaining HLT crashes currently happening online. It replaces one the crashes [1] with a proper exception message. It will not change the number of failed jobs, but it will help debugging the problem.

[1]

message: [2] Calling method for module L1TRawToDigi/'hltGtStage2Digis'
Exception Message:
A std::exception was thrown.
vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)

@missirol
Copy link
Contributor

missirol commented May 3, 2023

It will not change the number of failed jobs, but it will help debugging the problem.

There is one more reason to add #41402 in this next patch release.

@fwyzard explained to me that, since #41402 provides a proper exception for the current HLT crashes, we might be able to configure HLT to skip the problematic events (as opposed to crashing and losing all the events of that job, most of which do not crash). This could be done using process.options.skipEvent as documented in SWGuideEdmExceptionUse#Framework_Exception_Handling.

So, it is possible that having #41402 in the next patch release could give us a way to reduce HLT crashes.

@perrotta
Copy link
Contributor

perrotta commented May 3, 2023

I have prepared the PR #41497 on the branch CMSSW_13_0_3_patchHLT which includes the PRs that I believe more urgent to be included in the proposed patch release:

The other PRs listed in this issue cannot, in my opinion, be considered that urgent; or alternatively they imply some larger amount of updates than what is supposed for a quick fix that allows running at P5 during the (hopefully) few days needed to find a solution for the so far blocking issue #41457

If any one of the proposers of the other PRs listed in this issue believes that some more should be really and urgently added, please comment here: we can always update that PR

@saumyaphor4252
Copy link
Contributor

@perrotta
Just to mention here that as pointed out by some recent investigations (See Comment1 and Comment2 in #41457) that the memory issue may be popping up in the replays due to the choice of the run, we have launched another replay dmwm/T0#4812 to test CMSSW_13_0_5_patch1 with different run numbers and see if the memory issue still persists.

In case the replay turns out to be successful, we might consider switching to CMSSW_13_0_5_patch1, and CMSSW_13_0_3_patch2 may not be required in that case.

@perrotta
Copy link
Contributor

perrotta commented May 3, 2023

Thank you @saumyaphor4252 , that would be great!
Let keep this PR as a possible "life jacket": but the main and preferred option would certainly be to switch to the main 13_0_5_patchSomething instead

@perrotta
Copy link
Contributor

perrotta commented May 3, 2023

We would like to have this PR that introduces the new pp Scenario with updated HB thresholds to be used for Physics data-taking: PR #41368

Added also #41368: in the end, it only defines a new era, but it does not switch to it yet. It will be up to the users to decide whether to use it or not.

@perrotta
Copy link
Contributor

perrotta commented May 3, 2023

While not specifically related to the memory issues, could we please add this PR needed for nanoDQMIO production? #41400

@emanueleusai DQM PRs can be added by hand to the online DQM release: let do so, if needed, in order not to add too many updates to this temporary patch

@emanueleusai
Copy link
Member

@perrotta this is not a PR for DQM online, it is needed for NanoDQMIO production, which happens at T0. We need this in a release for it to be used in official production on NanoDQMIO.

@emanueleusai
Copy link
Member

The reason for the urgency of #41400 is that we need to start the official production of NanoDQMIO for data certification.

@perrotta
Copy link
Contributor

perrotta commented May 3, 2023

The reason for the urgency of #41400 is that we need to start the official production of NanoDQMIO for data certification.

Ok, added also the backport of #41400 to #41497
Hopefully that branch will not have to be merged, though

@mandrenguyen
Copy link
Contributor Author

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests