In Concurrent*Filters, upgrade to allow a stream to skip a lumi #43816

wddgit · 2024-01-30T21:51:05Z

PR description:

Improve ConcurrentHadronizerFilter and ConcurrentGeneratorFilter to handle the new Framework behavior where a stream can skip a lumi if before that stream starts at least one other stream already processed the lumi and other streams already started processing all events in the lumi.

PR #43522 implements the new Framework behavior. It is waiting for this and other pull requests that prepare modules outside the Framework for this behavior to be merged.

PR validation:

Relies on existing tests. The behavior of the modules should not change. The unit test TestGeneratorInterfacePythia8ConcurrentGeneratorFilter is most relevant because it has multiple lumis. It passes and I stepped through that one with a debugger and the correct things seem to be happening. These are used in a lot of runTheMatrix tests also (although usually with 1 lumi, which isn't as useful as a test). The changes are almost identical in the two modules.

cmsbuild · 2024-01-30T21:51:26Z

cms-bot internal usage

cmsbuild · 2024-01-30T21:57:52Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43816/38628

This PR adds an extra 24KB to repository

cmsbuild · 2024-01-30T21:58:15Z

A new Pull Request was created by @wddgit (W. David Dagenhart) for master.

It involves the following packages:

GeneratorInterface/Core (generators)

@bbilin, @menglu21, @GurpreetSinghChahal, @SiewYan, @cmsbuild, @alberto-sanchez, @mkirsano can you please review it and eventually sign? Thanks.
@alberto-sanchez, @mkirsano this is something you requested to watch as well.
@sextonkennedy, @rappoccio, @antoniovilela you are the release manager for this.

cms-bot commands are listed here

wddgit · 2024-01-30T21:58:33Z

enable threading

wddgit · 2024-01-30T21:58:42Z

please test

cmsbuild · 2024-01-31T00:30:26Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-644838/37118/summary.html
COMMIT: e1ace0a
CMSSW: CMSSW_14_0_X_2024-01-30-1100/el8_amd64_gcc12
Additional Tests: THREADING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/43816/37118/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

You potentially added 88 lines to the logs
Reco comparison results: 39 differences found in the comparisons
DQMHistoTests: Total files compared: 48
DQMHistoTests: Total histograms compared: 3248614
DQMHistoTests: Total failures: 3
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3248589
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
Checked 200 log files, 161 edm output root files, 48 DQM output files
TriggerResults: no differences found

wddgit · 2024-02-02T16:12:30Z

@makortel @Dr15Jones Since this is more a technical concurrency change than a real generator issue, it might be worth one or both of you reviewing this one in addition to the generator expert. It involves only about 20 line of code...

makortel · 2024-02-02T16:17:34Z

GeneratorInterface/Core/interface/ConcurrentGeneratorFilter.h

@@ -158,12 +158,13 @@ namespace edm {
    void initLumi(gen::GenStreamCache<HAD, DEC>* cache, LuminosityBlock const& index, EventSetup const& es) const;
    ParameterSet config_;
    mutable std::atomic<gen::GenStreamCache<HAD, DEC>*> useInLumi_{nullptr};
-    mutable std::atomic<unsigned long long> greatestNStreamEndLumis_{0};
+    mutable std::atomic<unsigned long long> nextNGlobalBeginLumis_{1};
    mutable std::atomic<bool> streamEndRunComplete_{true};
    // The next two data members are thread safe and can be safely mutable because


Should this become

Suggested change

// The next two data members are thread safe and can be safely mutable because

// The next three data members are thread safe and can be safely mutable because

?

Yes. Fixed the comment in both files. Thanks. I will push the changes momentarily.

makortel · 2024-02-02T16:29:53Z

GeneratorInterface/Core/interface/ConcurrentGeneratorFilter.h

-    if (greatestNStreamEndLumis_.compare_exchange_strong(expected, streamCachePtr->nStreamEndLumis_)) {
+    unsigned long long expected = this->luminosityBlockCache(lumi.index())->nGlobalBeginLumis_;
+    unsigned long long nextValue = expected + 1;
+    if (nextNGlobalBeginLumis_.compare_exchange_strong(expected, nextValue)) {
      streamEndRunComplete_ = false;
      useInLumi_ = streamCachePtr;
    }


Just to make sure I understood, is the idea that the first streamEndLuminosityBlock (of a Lumi) to arrive here would change the streamEndRunComplete_ and useInLumi_?

I added a comment at this point in both files.

Dr15Jones · 2024-02-02T17:27:38Z

GeneratorInterface/Core/interface/ConcurrentGeneratorFilter.h

@@ -158,12 +158,13 @@ namespace edm {
    void initLumi(gen::GenStreamCache<HAD, DEC>* cache, LuminosityBlock const& index, EventSetup const& es) const;
    ParameterSet config_;
    mutable std::atomic<gen::GenStreamCache<HAD, DEC>*> useInLumi_{nullptr};
-    mutable std::atomic<unsigned long long> greatestNStreamEndLumis_{0};
+    mutable std::atomic<unsigned long long> nextNGlobalBeginLumis_{1};
    mutable std::atomic<bool> streamEndRunComplete_{true};
    // The next two data members are thread safe and can be safely mutable because
    // they are only modified/read in globalBeginRun and globalBeginLuminosityBlock.


I'm beginning to wonder if this is actually a safe assumption going forward. Previously, because every stream had to see every luminosity block, it meant we could never have concurrent global concurrent LuminosityBlock and/or Runs as the global transition had to fully finish and then start the stream transitions before we'd ask the Source to see what is the next transition.

But now since we allow skipping a stream begin LuminosityBlock transition, it might be possible to change the scheduler to start the global begin LuminosityBlock transition, check the source and see there are no Events in that LuminosityBlock and then start another global begin LuminosityBlock transition concurrently.

I came to this conclusion after re-reading our global module documentation
https://twiki.cern.ch/twiki/bin/view/CMSPublic/FWMultithreadedFrameworkGlobalModuleInterface#edm_LuminosityBlockCacheT

where no guarantees are made that concurrent calls to global* transitions will not happen (except globalBegin always finish before streamBegin would happen, and the reverse for End transitions as well as some guarantees related to Summary calls).

My first thought on this comment is that we should go ahead with this PR as is. This PR is relatively small and accomplishes the immediate goal with minimal changes.

If we decide to add the ability to handle concurrent global begin transitions, then we should handle that as a separate effort. For this module to support that additional concurrency would require significantly more changes than the 20 or so lines changed here. For example, the useInLumi_ pointer would not work as is because multiple global lumi transitions could be using it in concurrently. So it would be significantly more development work to allow this module to support concurrent global begin lumi transitions.

And I think that assumption is likely built into other things as well. I know as I was working on EventProcessor implementing run concurrency I was taking that as an assumption I could rely on. Just figuring out what we would need to modify would take some non-trivial effort.

I also think we would need to consider whether that additional concurrency gives us a performance improvement that is significant enough to justify the effort to develop it.

At the least I think the code should document the assumptions about what it expects the frameworks behavior to be.

I added a comment on top of that section of six member variables in a row and removed the 2 line comment that started this discussion. Let me know if there is a wording that you would like better. Pushed and squashed.

Note that this module forces globalBeginLuminosityBlock to wait until one stream has proceeded all the way to the end stream lumi transition for the previous lumi before it starts...

Also there is a long comment at the top of the file describing the unusual behavior and why it is necessary and a couple alternatives.

cmsbuild · 2024-02-02T17:35:55Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43816/38688

This PR adds an extra 24KB to repository

cmsbuild · 2024-02-02T21:25:45Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43816/38690

This PR adds an extra 20KB to repository

cmsbuild · 2024-02-02T21:26:08Z

Pull request #43816 was updated. @mkirsano, @SiewYan, @GurpreetSinghChahal, @menglu21, @bbilin, @cmsbuild, @alberto-sanchez can you please check and sign again.

wddgit · 2024-02-02T21:38:17Z

please test

cmsbuild · 2024-02-03T00:30:02Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-644838/37194/summary.html
COMMIT: 86bd5d6
CMSSW: CMSSW_14_0_X_2024-02-02-1100/el8_amd64_gcc12
Additional Tests: THREADING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/43816/37194/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

You potentially removed 8 lines from the logs
Reco comparison results: 55 differences found in the comparisons
DQMHistoTests: Total files compared: 48
DQMHistoTests: Total histograms compared: 3248620
DQMHistoTests: Total failures: 1186
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3247412
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
Checked 200 log files, 161 edm output root files, 48 DQM output files
TriggerResults: no differences found

cmsbuild · 2024-02-06T10:11:22Z

Milestone for this pull request has been moved to CMSSW_14_1_X. Please open a backport if it should also go in to CMSSW_14_0_X.

makortel · 2024-02-08T19:35:58Z

From core point of view this PR looks good now. @cms-sw/generators-l2 could you review and sign? Thanks!

menglu21 · 2024-02-14T08:39:28Z

+1

cmsbuild · 2024-02-14T08:39:55Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @antoniovilela, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

antoniovilela · 2024-02-18T06:50:42Z

+1

cmsbuild added this to the CMSSW_14_0_X milestone Jan 30, 2024

cmsbuild added generators-pending pending-signatures tests-pending orp-pending code-checks-pending labels Jan 30, 2024

cmsbuild added code-checks-approved and removed code-checks-pending labels Jan 30, 2024

cmsbuild added tests-started and removed tests-pending labels Jan 30, 2024

wddgit mentioned this pull request Jan 30, 2024

Explore if anything else than DQM could break because of not all streams seeing all lumis cms-sw/framework-team#703

Closed

makortel mentioned this pull request Jan 30, 2024

Improve ConcurrentGeneratorFilter and ConcurrentHadronizerFilter cms-sw/framework-team#801

Closed

cmsbuild added tests-approved and removed tests-started labels Jan 31, 2024

makortel reviewed Feb 2, 2024

View reviewed changes

Dr15Jones reviewed Feb 2, 2024

View reviewed changes

wddgit force-pushed the streamSkipLumiConcurrentFilters branch from e1ace0a to 249f066 Compare February 2, 2024 17:28

cmsbuild added tests-pending code-checks-pending and removed tests-approved code-checks-approved labels Feb 2, 2024

cmsbuild removed the code-checks-pending label Feb 2, 2024

cmsbuild added code-checks-approved and removed code-checks-pending labels Feb 2, 2024

cmsbuild added tests-started and removed tests-pending labels Feb 2, 2024

cmsbuild added tests-approved and removed tests-started labels Feb 3, 2024

cmsbuild modified the milestones: CMSSW_14_0_X, CMSSW_14_1_X Feb 6, 2024

cmsbuild modified the milestones: CMSSW_14_0_X, CMSSW_14_1_X Feb 8, 2024

cmsbuild removed generators-pending pending-signatures labels Feb 14, 2024

cmsbuild added generators-approved fully-signed labels Feb 14, 2024

cmsbuild added orp-approved and removed orp-pending labels Feb 18, 2024

cmsbuild merged commit ff5ccb1 into cms-sw:master Feb 18, 2024
12 checks passed

cmsbuild mentioned this pull request Feb 18, 2024

Enable TF XLA, add tensorflow-xla-runtime. cms-sw/cmsdist#9005

Merged

wddgit deleted the streamSkipLumiConcurrentFilters branch March 11, 2024 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In Concurrent*Filters, upgrade to allow a stream to skip a lumi #43816

In Concurrent*Filters, upgrade to allow a stream to skip a lumi #43816

wddgit commented Jan 30, 2024

cmsbuild commented Jan 30, 2024 •

edited

Loading

cmsbuild commented Jan 30, 2024

cmsbuild commented Jan 30, 2024

wddgit commented Jan 30, 2024

wddgit commented Jan 30, 2024

cmsbuild commented Jan 31, 2024

wddgit commented Feb 2, 2024

makortel Feb 2, 2024

wddgit Feb 2, 2024

makortel Feb 2, 2024

wddgit Feb 2, 2024

wddgit Feb 2, 2024

Dr15Jones Feb 2, 2024 •

edited

Loading

wddgit Feb 2, 2024

Dr15Jones Feb 2, 2024

wddgit Feb 2, 2024

wddgit Feb 2, 2024

cmsbuild commented Feb 2, 2024

cmsbuild commented Feb 2, 2024

cmsbuild commented Feb 2, 2024

wddgit commented Feb 2, 2024

cmsbuild commented Feb 3, 2024

cmsbuild commented Feb 6, 2024

makortel commented Feb 8, 2024

menglu21 commented Feb 14, 2024

cmsbuild commented Feb 14, 2024

antoniovilela commented Feb 18, 2024

	// The next two data members are thread safe and can be safely mutable because
	// The next three data members are thread safe and can be safely mutable because

In Concurrent*Filters, upgrade to allow a stream to skip a lumi #43816

In Concurrent*Filters, upgrade to allow a stream to skip a lumi #43816

Conversation

wddgit commented Jan 30, 2024

PR description:

PR validation:

cmsbuild commented Jan 30, 2024 • edited Loading

cmsbuild commented Jan 30, 2024

cmsbuild commented Jan 30, 2024

wddgit commented Jan 30, 2024

wddgit commented Jan 30, 2024

cmsbuild commented Jan 31, 2024

Comparison Summary

wddgit commented Feb 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dr15Jones Feb 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmsbuild commented Feb 2, 2024

cmsbuild commented Feb 2, 2024

cmsbuild commented Feb 2, 2024

wddgit commented Feb 2, 2024

cmsbuild commented Feb 3, 2024

Comparison Summary

cmsbuild commented Feb 6, 2024

makortel commented Feb 8, 2024

menglu21 commented Feb 14, 2024

cmsbuild commented Feb 14, 2024

antoniovilela commented Feb 18, 2024

cmsbuild commented Jan 30, 2024 •

edited

Loading

Dr15Jones Feb 2, 2024 •

edited

Loading