Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multithreading in Geant4 and HitManager #694

Merged
merged 5 commits into from
Mar 27, 2023

Conversation

sethrj
Copy link
Member

@sethrj sethrj commented Mar 21, 2023

This adds thread safety to the HitCollector and fixes #613 .

After this fix, here's a comparison of 18 GeV $\pi^+$ in the ATLAS TileCal model:
18gev-piplus

@sethrj sethrj added bug Something isn't working external Dependencies and framework-oriented features labels Mar 21, 2023
@sethrj sethrj requested a review from whokion March 21, 2023 22:18
@sethrj
Copy link
Member Author

sethrj commented Mar 22, 2023

Sorry, I shouldn't have requested the review yet because this is still in draft and includes changes from #693 that don't need review here.

@whokion
Copy link
Contributor

whokion commented Mar 22, 2023

No problem. Let's me know when it is ready to review.

@sethrj sethrj force-pushed the stream-hitmanager branch from 6bbccf4 to cff8463 Compare March 24, 2023 12:13
@sethrj sethrj requested a review from whokion March 24, 2023 12:16
@sethrj sethrj marked this pull request as ready for review March 24, 2023 12:16
@sethrj sethrj force-pushed the stream-hitmanager branch from cff8463 to 0a0731d Compare March 24, 2023 12:17
@sethrj sethrj requested a review from amandalund March 24, 2023 14:18
@sethrj
Copy link
Member Author

sethrj commented Mar 24, 2023

@whokion This is ready to review whenever you have a moment. Specifically could you make sure that the way I'm using G4Threading is compatible with what you know about tasking in CMSSW? Given my cursory look (CMSSW changes the thread-local ID in geant to values in [0, N)) it should be compatible. Thanks!

src/accel/LocalTransporter.cc Show resolved Hide resolved
src/accel/LocalTransporter.hh Show resolved Hide resolved
/*!
* Ensure the local hit processor exists, and return it.
*/
HitProcessor& HitManager::get_local_hit_processor()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If HitManager is thread local, HitProcess can be private to HitManager (related to the early comment) instead of vector<HitProcess*>. Why is this the better approach?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the major event loop components are "global". Only the incoming state objects are "thread-local" 1, and normally their stream IDs should be used to access per-stream state data that (for now) has to live in the shared object. Here, instead of using G4Threading, I could (and probably should) add the StreamId to the StepState so we can access it directly...

Footnotes

  1. I put global and thread-local in quotes above because nothing in Celeritas inherently requires streams and threads to match. Streams could be shared within a thread pool, or you could have multiple streams per CPU thread, etc. Similarly you could have two entirely different event loops running simultaneously within the same application.

src/accel/detail/HitProcessor.cc Show resolved Hide resolved
* action manager.
* generating hit objects. It \b must be thread-local because the sensitive
* detectors it stores are thread-local, and additionally Geant4 makes
* assumptions about object allocations that cause crashes if the HitProcessor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A worker has its own event loops. So, this is not an assumption as the Geant4 MT is an event-level multithreading, but a consequence.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "assumption" is that objects like G4Navigator, G4Step, and G4Track can only be allocated by a worker thread, and that they're deallocated by the same worker thread. Those objects aren't necessarily associated with a specific event (in the G4 fast simulation some of those are reused across multiple events), nor are they inherently associated with a specific thread (except by the invisible implementation of the thread allocator).

@@ -230,26 +273,21 @@ void HitProcessor::operator()(DetectorStepOutput const& out) const

if (navi_)
{
CELER_ASSERT(out.detector[i] < detector_volumes_.size());
CELER_ASSERT(out.detector[i] < detectors_.size());
bool success = this->update_touchable(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very expensive. Do we really need this check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The touchable update is necessary for detectors like tilecal and hgcal that use the detailed navigation state to determine a subdetector identifier. Since their sensitive detectors require the navigator to be initialized and correct, we don't really have a choice if we're calling back to those routines. My hope is that for our initial implementation this won't completely kill the performance, and doing it this way will give us a "baseline" performance number to justify the effort of implementing the actual detector logic on GPU.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still do not understand why HitProcess is responsible for updating the navigation state (touchable) as a hit is processed at the end of stepping (only once per each step) - isn't the subdetector a physical/logical volume (so can be a tracking volume with a boundary)? Isn't it an independent readout channel? I guess that there may be a confusion between what simulation information needs to be known (for MC "study") and what need to be recorded as hit (to be used for digitization).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be misunderstanding what the confusion is too, so let me try to restate from a high level what's going on.

  • At the end of an event on a local thread, the LocalTransporter steps through all the offloaded EM tracks until they die.
  • Each step calls the (shared) explicit actions using the shared "params" state and the (thread-local) track states.
  • The celeritas::StepCollector gathers the position and volume ID at the pre-step and the energy deposition at post-step (though the exact selection of outputs can be changed by the user).
  • At post-step, the (shared) HitManager takes the thread-local gathered step data, and calls the thread-local HitProcessor.
  • The HitProcessor loops through all the hits, translates them into a thread-local G4Step objects, uses the VecGeom logical volume ID to look up the G4VSensitiveDetector, and calls sd->Hit(step). If the SD needs to query hit->GetPreStepPoint()->GetTouchableHandle() then before calling Hit we use a thread-local (i.e., owned by HitProcessor) navigator to update the thread-local touchable for the thread-local step.

When I said "subdetector" I meant "readout channel": I guess HGCal is the subdetector, and the codeneeds the navigation state and volume information to figure out what channel it's in.

@whokion
Copy link
Contributor

whokion commented Mar 25, 2023

Can you add a code snippet how to access the collected hit information and retrieve some information (for an example, the total energy deposition or the size of hit collection from at the end of EventAction (or after each Flush?) into the demo-geant-integration? For CMS integration, we need to access the hit (StepInfo) collection from GPU at least in the event level to merge hits from CPU, and send them to the next stage of the simulation pipeline (i.e., digitization) or write them out to the disk in an event unit.

@sethrj
Copy link
Member Author

sethrj commented Mar 25, 2023

@whokion For our initial CMS implementation, we're not going to merge the hits on the GPU; we're using the HitProcessor to call back to the existing sensitive detectors. That way we don't have to reimplement the complex hit processing in CMSSW, and because we can call to the other SDs the same way, we can track EM particles everywhere rather than in a particular region. The existing demo-geant-integration does this too: the sensitive detector is getting hits from the GPU via the HitProcessor.

I am working on implementing a GPU-based calorimeter which will actually do the accumulation on GPU—is this what you're asking for?

@whokion
Copy link
Contributor

whokion commented Mar 25, 2023

I am working on implementing a GPU-based calorimeter which will actually do the accumulation on GPU—is this what you're asking for?

No, the current implementation/workflow is good enough for a demonstration/integration. What I am asking is to pipeline the output from GPU (to create a collection of equivalent G4Steps which will be merged through CPU sensitive detectors into the final hit data) into the demo-geant-integration workflow once the transporter complete stepping for the set of EM tracks on GPU (i.e. after Flush is called) - at least, an example how to access the output hit information from GPU in EventAction or TrackingAction will be good enough.

@sethrj
Copy link
Member Author

sethrj commented Mar 25, 2023

But this is already happening automatically through the HitManager during the stepping loop. The current demo app will have Steps sent back from GPU.

@whokion
Copy link
Contributor

whokion commented Mar 26, 2023

Then, I definitely missed something. So, are we calling demo_geant::SensitiveDetector::ProcessHits(G4Step* step, G4TouchableHistory*) with converted Steps from GPU? In demo-geant-intergration, where is the place receiving output of HitManager::selection() and converting StepSelection::points to G4Steps, and processing them with SensitiveDetector::ProcessHits inside a loop over the number of hits from GPU?

@sethrj
Copy link
Member Author

sethrj commented Mar 26, 2023

We are. There are a lot of moving pieces, so I'll draw up a diagram of who's constructing what and how the tracks and hits move between CPU and GPU 😄

@sethrj
Copy link
Member Author

sethrj commented Mar 27, 2023

@whokion Here's my attempt at a UML diagram, with background colors indicating the different libraries/layers of code, and a few classes that operate on device in green.

hit-processor-uml

The user tracking action (local) sends hits via G4Track to the LocalTransporter and the user event action (local) flushes it, which initiates the ActionSequence (stepping loop), managed by the Stepper (local). The ActionSequence is just a topologically sorted graph of ExplicitActionInterface (shared), each of which usually represents a single kernel.

During the stepping loop (which uses the stream-local CoreStateData and shared CoreParamsData) the StepGatherAction (shared) fills up a StepStateData (local) with track properties like position, logical volume, etc. At the end of the step, all of the StepInterface classes (shared) originally registered with the StepCollector (shared) are called with the StepStateData. The HitManager (shared) is a StepInterface class that has stream-local HitProcessor instances. The HitProcessor copies the GPU hits to CPU DetectorStepOutput (local) hits, converts each hit into a G4Step, and calls the associated (local) G4VSensitiveDetector.

@sethrj sethrj requested a review from whokion March 27, 2023 17:26
Copy link
Contributor

@amandalund amandalund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This looks good to me @sethrj.

@sethrj sethrj merged commit a0e5a90 into celeritas-project:develop Mar 27, 2023
@sethrj sethrj deleted the stream-hitmanager branch March 27, 2023 17:28
@whokion
Copy link
Contributor

whokion commented Mar 27, 2023

The user tracking action (local) sends hits via G4Track to the LocalTransporter and the user event action (local) flushes it, which initiates the ActionSequence (stepping loop), managed by the Stepper (local).

Technically, the tracking action actually flushes (through LocalTransporter::Push(G4Track), then ::Flush() and the event action flushes the remaining of tracks at the end of event.

Nevertheless, I am still not fully convinced why HitManager is globally shared as the Stepper/LocalTransporter are local even though HitManager behaviors like something similar to Geant4 Split class mechanism, but find nothing wrong either - may be a design choice. Maybe we may learn more how to directly use the HitManager/HitProcessor for the CMS integration as necessary.

Anyway, the diagram helps a lot and enlightens inter-connectivity and workflow with specific ownership/relations. Thanks for the nice work.

@sethrj
Copy link
Member Author

sethrj commented Mar 27, 2023

OK @whokion a minor update that I think improves even further:

hit-processor-uml

For clarity I switched StepCollector and StepStateData and fixed the ownership relationship between HitProcessor and DetectorsStepOutput. The dotted region are the classes that are shared across threads/tasks/streams because they operate only on the problem setup/parameter data rather than any state information. The HitManager is-a StepInterface derived class that is managed by the (shared) step gather action, so it has to be shared. The HitProcessor operates on the state data so it has to be stream-local.

Due to the way Geant4 currently manages memory, we have to make sure that our streams correspond to geant4 threads. There's a hidden "references" link here between the LocalTransporter and the HitManager which is necessary to ensure that the hit processor's temporary data is deallocated by the thread-local LocalTransporter, made necessary by the weird Geant4 allocation, which is part of the weirdness in the design that you may have noticed.

@sethrj sethrj added performance Changes for performance optimization and removed performance Changes for performance optimization labels Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working external Dependencies and framework-oriented features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

G4Navigator can only be used from within the thread that created it
3 participants