Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multithreaded CPU using single GPU in demo loop #774

Merged
merged 6 commits into from
May 26, 2023

Conversation

amandalund
Copy link
Contributor

This adds OpenMP multithreading to a loop over events in the demo loop, with each event processed by a separate thread and running on a single GPU (see #553). For multithreaded CPU-only runs, nested parallel regions are disabled until we can understand why the performance there is so poor. The GPU performance is definitely not as good as running all events simultaneously on a single thread (about a factor of two slower for cms2018+field+msc-vecgeom-gpu), but we should get some of that back when we're able to launch kernels on different CUDA streams.

@amandalund amandalund added enhancement New feature or request core Software engineering infrastructure labels May 25, 2023
@amandalund amandalund requested a review from sethrj May 25, 2023 01:37
app/CMakeLists.txt Show resolved Hide resolved
app/demo-loop/Runner.hh Outdated Show resolved Hide resolved
// TODO: partition primaries among streams
CELER_ASSERT(stream_id == StreamId{0});
return (*transport)(make_span(primaries_));
return (*transport)(make_span(events_[ids.event.get()]));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For backward compatibility (so that we can keep comparing against our old regression results), can you add the ability to transport all events simultaneously? Maybe another operator() with no arguments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking the same, will do.

app/demo-loop/demo-loop.cc Outdated Show resolved Hide resolved
app/demo-loop/demo-loop.cc Show resolved Hide resolved
app/demo-loop/demo-loop.cc Outdated Show resolved Hide resolved
app/demo-loop/demo-loop.cc Outdated Show resolved Hide resolved
src/celeritas/user/ActionDiagnostic.cc Show resolved Hide resolved
src/celeritas/user/ActionDiagnostic.hh Outdated Show resolved Hide resolved
Copy link
Member

@sethrj sethrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @amandalund ! Thanks.

@sethrj sethrj merged commit b590ae9 into celeritas-project:develop May 26, 2023
@amandalund amandalund deleted the mt-demo-loop branch May 26, 2023 11:06
@sethrj sethrj linked an issue May 27, 2023 that may be closed by this pull request
7 tasks
@sethrj
Copy link
Member

sethrj commented May 30, 2023

@amandalund Ever since this pull request I'm getting out-of-memory errors for the cms2018 demo problems. It looks like the default "max streams" is 1, so I think it should still be creating one state and one thread right?

@sethrj
Copy link
Member

sethrj commented May 30, 2023

Also in a probably unrelated question: CUDA's device context is thread-local, so they recommend resetting the device inside openmp parallel for loops, so maybe we need:

diff --git a/app/demo-loop/demo-loop.cc b/app/demo-loop/demo-loop.cc
index 8ce6ca63..8daaee18 100644
--- a/app/demo-loop/demo-loop.cc
+++ b/app/demo-loop/demo-loop.cc
@@ -108,6 +108,10 @@ void run(std::istream* is, std::shared_ptr<celeritas::OutputRegistry> output)
 #endif
         for (size_type event = 0; event < run_stream.num_events(); ++event)
         {
+            // Make sure cudaSetDevice is called on the local thread
+            using namespace celeritas;
+            activate_device(Device{device().device_id()});
+
             // Run a single event on a single thread
             CELER_TRY_HANDLE(result.events[event] = run_stream(
                                  StreamId(get_openmp_thread()), EventId(event)),

@amandalund
Copy link
Contributor Author

Hmm I'm not seeing an out of memory error for that casa (and with "initializer_capacity": 67108864, "max_num_tracks": 1048576). Is OMP_NUM_THREADS set (since that's how we're currently setting the number of streams)?

And good point about the thread-local device context.

@sethrj
Copy link
Member

sethrj commented May 31, 2023

Aha, yes, it is indeed being set in the regression suite driver. I'll update it. Thanks!

@sethrj sethrj added performance Changes for performance optimization and removed core Software engineering infrastructure labels Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Changes for performance optimization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add multithreaded CPU to single GPU in demo-loop
2 participants