-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement support for concurrent runs in the Framework #38801
Conversation
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38801/31156
|
A new Pull Request was created by @wddgit (W. David Dagenhart) for master. It involves the following packages:
@smuzaffar, @civanch, @Dr15Jones, @makortel, @mdhildreth, @cmsbuild can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
please test |
-1 Failed Tests: ClangBuild Clang BuildI found compilation warning while trying to compile with clang. Command used:
See details on the summary page. |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38801/31160
|
Pull request #38801 was updated. @smuzaffar, @civanch, @Dr15Jones, @makortel, @mdhildreth, @cmsbuild can you please check and sign again. |
please test |
At least at first glance, the other 34 relVal failures look unrelated to this PR. |
Thanks David. I found that crash in two jobs so far (over all the flavors of two IBs) el8_amd64_gcc10/CMSSW_12_6_CLANG_X_2022-10-04-2300 workflow 10808.0 step 6
(other threads were just waiting for work) el8_amd64_gcc10/CMSSW_12_6_X_2022-10-05-1100 workflow 139.005 step 3
(the other threads are just waiting for work) |
I was finally able to catch the (correct) exception in gdb, here is the stack trace (for relevant threads)
|
The exception would be thrown from cmssw/CondCore/CondDB/src/DbConnectionString.cc Lines 52 to 53 in e572ac7
(although this detail probably doesn't matter much towards fixing the problem) |
I have prepared a revert pr just in case the issue reported in #38801 (comment) cannot get fixed quickly, and we want to cut 12_6_0_pre3 |
Full stack trace of #38801 (comment) on the request of @dan131riley
|
A segfault in el9_amd64_gcc11/CMSSW_12_6_X_2022-10-05-2300 workflow 136.851 step2
(other TBB threads are in Another one in slc7_amd64_gcc10/CMSSW_12_6_X_2022-10-05-2300 workflow 516.0 step 1
|
From UBSAN:
followed by
and finally
|
Thanks Dan! The crash occurs on line cmssw/FWCore/Framework/src/EventProcessor.cc Line 2274 in e120f54
The |
I see the cmssw/FWCore/Framework/src/EventProcessor.cc Line 1380 in e120f54
So somehow |
I think the problem is because beginLumiAsync (which calls handleNextEventForStreamAsync) is started at the same time as streamBeginRunAsync and there is a race to see which puts values into |
Should cmssw/FWCore/Framework/src/EventProcessor.cc Lines 1315 to 1319 in e120f54
( handleNextItemAfterMergingRunEntries() can call beginLumiAsync() , and streamBeginRunAsync() is queued few lines below)
|
I think maybe Chris is correct (not 100% sure, but probably). My first thought is to add a chain with the loop adding things streamQueuesInserter_ in the first link of the chain and the call to handleNextItemAfterMergingRunEntries in the second link of the chain. The problem with that is in the single threaded case streamBeginRun does not run before globalBeginLumi. That is not actually incorrect by the multi-threading design but it is a change in behavior and at the least will break some single threaded unit tests (which are fixable...). Still thinking. |
@Dr15Jones @makortel @dan131riley Below is a proposed fix. What do you think? Core tests pass. Changing the order so that handleNextItemAfterMergingRunEntries is Mildly annoying that we have to pause here, but it much better than it was
|
Thanks David, I think your proposal makes sense (and I can't think of how to do it better). I have one question though. Should |
If there is an exception, we want handleNextItemAfterMergingRunEntries() to be called. That function will notice an exception was thrown and saved in the WaitingTaskHolder. In that case, it will not call beginRunAsync, instead it will call endRunAsync that will try to cleanly shutdown everything, calling stream and global end run and eventually closing the file... This morning I'll generate the new PR will the commit from #38801 and the two bug fixes. |
Actually, I take it back, it will only call global end run because stream end run was never called. |
Sorry for the noise, I meant because stream begin run was never called. |
I'm still reviewing David's proposal for the fix. However, I think we might have another difficulty. It seems to me that if we had a failure while doing |
@Dr15Jones It is an interesting question. If there is an exception in streamBeginRun on one stream, then on other streams streamBeginLumi might or might not already have run and completed. I think that is the desired behavior. Events might or might not have already been processed in that lumi. I think that is the desired behavior. The interesting question is whether we want streamBeginLumi to check for an exception and bail out. Or do we want consistency and we run all streamBeginLumi's since some of them might have run already. Just reading the code I think that you are right that currently it will run all the streamBeginLumi's. It will not stop until it hits an event after the streamBeginRun exception occurs (which might not be the first event in the lumi, other streams might even finish all the events). I will run some tests to double check the behavior is what I think it is. I think I did that already, but I will try them again. |
@Dr15Jones The modified PR is ready. Should I submit it and we move the discussion there or would you rather I make some modifications to the fix before I submit the new PR? |
@wddgit |
I concur |
PR description:
Implement support for concurrent runs in the Framework.
The design is analogous to what was done to implement
concurrent luminosity blocks as much as possible, although
there are unavoidable differences.
One can configure how many concurrent runs are allowed.
The default is one. With that setting, almost nothing externally
visible should change in the behavior of cmsRun. There
are significant and complex changes in the Framework
implementation to support this new ability.
Even with the number of concurrent runs configured to 1 the
Framework will be able to execute some transitions concurrently
which could not be executed concurrently before:
If the number of concurrent runs is configured greater than one,
then global end run can run concurrently with any transitions from
another run and global begin run can run concurrently with any transitions
from another run except global begin run and global begin lumi.
This pull request does NOT upgrade modules and services outside
the Framework to support concurrent runs. We expect many of them
will fail if the number of concurrent runs is configured to be
more than one in an existing production configuration. We have not
surveyed existing code to see which modules and services cannot
support concurrent runs. Most should be OK because they do not
depend on run transitions. But for example, a module designed
to create per run histograms might have problems with concurrent
runs.
One configures the "numberOfConcurrentRuns" in the top level
options parameter set. If it is 0 or greater than the number
of concurrent lumis, then it will be reset to equal the
number of concurrent lumis.
If an EventSetup IOV changes at a run boundary, then one also
would need to configure concurrent IOVs for that record to two
to actually have the runs on both sides of that run boundary process
concurrently. Without that cmsRun would execute properly, but the
IOVs would block concurrent execution. In addition, it is technically
possible for the sequence of transitions beginLumi, endRun,
beginRun, and beginLumi to all have different IOVs. An EventSetup
record with such IOVs would need to be configured to allow
4 concurrent IOVs to process both runs concurrently across
such a run boundary.
It is the design intent that the rest of changes are transparent
to the user (beyond what is discussed above).
PR validation:
There are a few new unit tests. Existing unit tests pass. In fact
existing unit tests covered most of the features one might be
concerned about with this pull request. Of the new tests, these two
configurations are the most significant:
FWCore/Integration/test/testConcurrentIOVsAndRuns_cfg.py
FWCore/Integration/test/testConcurrentIOVsAndRunsRead_cfg.py