-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix multithreading in Geant4 and HitManager #694
Conversation
Sorry, I shouldn't have requested the review yet because this is still in draft and includes changes from #693 that don't need review here. |
No problem. Let's me know when it is ready to review. |
6bbccf4
to
cff8463
Compare
cff8463
to
0a0731d
Compare
@whokion This is ready to review whenever you have a moment. Specifically could you make sure that the way I'm using |
/*! | ||
* Ensure the local hit processor exists, and return it. | ||
*/ | ||
HitProcessor& HitManager::get_local_hit_processor() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If HitManager is thread local, HitProcess can be private to HitManager (related to the early comment) instead of vector<HitProcess*>. Why is this the better approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the major event loop components are "global". Only the incoming state objects are "thread-local" 1, and normally their stream IDs should be used to access per-stream state data that (for now) has to live in the shared object. Here, instead of using G4Threading, I could (and probably should) add the StreamId
to the StepState
so we can access it directly...
Footnotes
-
I put global and thread-local in quotes above because nothing in Celeritas inherently requires streams and threads to match. Streams could be shared within a thread pool, or you could have multiple streams per CPU thread, etc. Similarly you could have two entirely different event loops running simultaneously within the same application. ↩
* action manager. | ||
* generating hit objects. It \b must be thread-local because the sensitive | ||
* detectors it stores are thread-local, and additionally Geant4 makes | ||
* assumptions about object allocations that cause crashes if the HitProcessor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A worker has its own event loops. So, this is not an assumption as the Geant4 MT is an event-level multithreading, but a consequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "assumption" is that objects like G4Navigator, G4Step, and G4Track can only be allocated by a worker thread, and that they're deallocated by the same worker thread. Those objects aren't necessarily associated with a specific event (in the G4 fast simulation some of those are reused across multiple events), nor are they inherently associated with a specific thread (except by the invisible implementation of the thread allocator).
@@ -230,26 +273,21 @@ void HitProcessor::operator()(DetectorStepOutput const& out) const | |||
|
|||
if (navi_) | |||
{ | |||
CELER_ASSERT(out.detector[i] < detector_volumes_.size()); | |||
CELER_ASSERT(out.detector[i] < detectors_.size()); | |||
bool success = this->update_touchable( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems very expensive. Do we really need this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The touchable update is necessary for detectors like tilecal and hgcal that use the detailed navigation state to determine a subdetector identifier. Since their sensitive detectors require the navigator to be initialized and correct, we don't really have a choice if we're calling back to those routines. My hope is that for our initial implementation this won't completely kill the performance, and doing it this way will give us a "baseline" performance number to justify the effort of implementing the actual detector logic on GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still do not understand why HitProcess is responsible for updating the navigation state (touchable) as a hit is processed at the end of stepping (only once per each step) - isn't the subdetector a physical/logical volume (so can be a tracking volume with a boundary)? Isn't it an independent readout channel? I guess that there may be a confusion between what simulation information needs to be known (for MC "study") and what need to be recorded as hit (to be used for digitization).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be misunderstanding what the confusion is too, so let me try to restate from a high level what's going on.
- At the end of an event on a local thread, the
LocalTransporter
steps through all the offloaded EM tracks until they die. - Each step calls the (shared) explicit actions using the shared "params" state and the (thread-local) track states.
- The
celeritas::StepCollector
gathers the position and volume ID at the pre-step and the energy deposition at post-step (though the exact selection of outputs can be changed by the user). - At post-step, the (shared) HitManager takes the thread-local gathered step data, and calls the thread-local HitProcessor.
- The HitProcessor loops through all the hits, translates them into a thread-local
G4Step
objects, uses the VecGeom logical volume ID to look up theG4VSensitiveDetector
, and callssd->Hit(step)
. If the SD needs to queryhit->GetPreStepPoint()->GetTouchableHandle()
then before callingHit
we use a thread-local (i.e., owned by HitProcessor) navigator to update the thread-local touchable for the thread-local step.
When I said "subdetector" I meant "readout channel": I guess HGCal is the subdetector, and the codeneeds the navigation state and volume information to figure out what channel it's in.
Can you add a code snippet how to access the collected hit information and retrieve some information (for an example, the total energy deposition or the size of hit collection from at the end of EventAction (or after each Flush?) into the demo-geant-integration? For CMS integration, we need to access the hit (StepInfo) collection from GPU at least in the event level to merge hits from CPU, and send them to the next stage of the simulation pipeline (i.e., digitization) or write them out to the disk in an event unit. |
@whokion For our initial CMS implementation, we're not going to merge the hits on the GPU; we're using the I am working on implementing a GPU-based calorimeter which will actually do the accumulation on GPU—is this what you're asking for? |
No, the current implementation/workflow is good enough for a demonstration/integration. What I am asking is to pipeline the output from GPU (to create a collection of equivalent G4Steps which will be merged through CPU sensitive detectors into the final hit data) into the demo-geant-integration workflow once the transporter complete stepping for the set of EM tracks on GPU (i.e. after Flush is called) - at least, an example how to access the output hit information from GPU in EventAction or TrackingAction will be good enough. |
But this is already happening automatically through the HitManager during the stepping loop. The current demo app will have Steps sent back from GPU. |
Then, I definitely missed something. So, are we calling |
We are. There are a lot of moving pieces, so I'll draw up a diagram of who's constructing what and how the tracks and hits move between CPU and GPU 😄 |
@whokion Here's my attempt at a UML diagram, with background colors indicating the different libraries/layers of code, and a few classes that operate on device in green. The user tracking action (local) sends hits via G4Track to the During the stepping loop (which uses the stream-local |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This looks good to me @sethrj.
Technically, the tracking action actually flushes (through LocalTransporter::Push(G4Track), then ::Flush() and the event action flushes the remaining of tracks at the end of event. Nevertheless, I am still not fully convinced why HitManager is globally shared as the Stepper/LocalTransporter are local even though HitManager behaviors like something similar to Geant4 Split class mechanism, but find nothing wrong either - may be a design choice. Maybe we may learn more how to directly use the HitManager/HitProcessor for the CMS integration as necessary. Anyway, the diagram helps a lot and enlightens inter-connectivity and workflow with specific ownership/relations. Thanks for the nice work. |
OK @whokion a minor update that I think improves even further: For clarity I switched StepCollector and StepStateData and fixed the ownership relationship between HitProcessor and DetectorsStepOutput. The dotted region are the classes that are shared across threads/tasks/streams because they operate only on the problem setup/parameter data rather than any state information. The HitManager is-a StepInterface derived class that is managed by the (shared) step gather action, so it has to be shared. The HitProcessor operates on the state data so it has to be stream-local. Due to the way Geant4 currently manages memory, we have to make sure that our streams correspond to geant4 threads. There's a hidden "references" link here between the LocalTransporter and the HitManager which is necessary to ensure that the hit processor's temporary data is deallocated by the thread-local |
This adds thread safety to the HitCollector and fixes #613 .
After this fix, here's a comparison of 18 GeV$\pi^+$ in the ATLAS TileCal model: