-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multithreaded CPU using single GPU in demo loop #774
Conversation
app/demo-loop/Runner.cc
Outdated
// TODO: partition primaries among streams | ||
CELER_ASSERT(stream_id == StreamId{0}); | ||
return (*transport)(make_span(primaries_)); | ||
return (*transport)(make_span(events_[ids.event.get()])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For backward compatibility (so that we can keep comparing against our old regression results), can you add the ability to transport all events simultaneously? Maybe another operator()
with no arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was thinking the same, will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @amandalund ! Thanks.
@amandalund Ever since this pull request I'm getting out-of-memory errors for the cms2018 demo problems. It looks like the default "max streams" is 1, so I think it should still be creating one state and one thread right? |
Also in a probably unrelated question: CUDA's device context is thread-local, so they recommend resetting the device inside openmp parallel for loops, so maybe we need: diff --git a/app/demo-loop/demo-loop.cc b/app/demo-loop/demo-loop.cc
index 8ce6ca63..8daaee18 100644
--- a/app/demo-loop/demo-loop.cc
+++ b/app/demo-loop/demo-loop.cc
@@ -108,6 +108,10 @@ void run(std::istream* is, std::shared_ptr<celeritas::OutputRegistry> output)
#endif
for (size_type event = 0; event < run_stream.num_events(); ++event)
{
+ // Make sure cudaSetDevice is called on the local thread
+ using namespace celeritas;
+ activate_device(Device{device().device_id()});
+
// Run a single event on a single thread
CELER_TRY_HANDLE(result.events[event] = run_stream(
StreamId(get_openmp_thread()), EventId(event)), |
Hmm I'm not seeing an out of memory error for that casa (and with And good point about the thread-local device context. |
Aha, yes, it is indeed being set in the regression suite driver. I'll update it. Thanks! |
This adds OpenMP multithreading to a loop over events in the demo loop, with each event processed by a separate thread and running on a single GPU (see #553). For multithreaded CPU-only runs, nested parallel regions are disabled until we can understand why the performance there is so poor. The GPU performance is definitely not as good as running all events simultaneously on a single thread (about a factor of two slower for cms2018+field+msc-vecgeom-gpu), but we should get some of that back when we're able to launch kernels on different CUDA streams.