Support multithreaded CPU using single GPU in demo loop #774

amandalund · 2023-05-25T01:37:57Z

This adds OpenMP multithreading to a loop over events in the demo loop, with each event processed by a separate thread and running on a single GPU (see #553). For multithreaded CPU-only runs, nested parallel regions are disabled until we can understand why the performance there is so poor. The GPU performance is definitely not as good as running all events simultaneously on a single thread (about a factor of two slower for cms2018+field+msc-vecgeom-gpu), but we should get some of that back when we're able to launch kernels on different CUDA streams.

app/CMakeLists.txt

app/demo-loop/Runner.hh

sethrj · 2023-05-25T12:52:36Z

app/demo-loop/Runner.cc

-    // TODO: partition primaries among streams
-    CELER_ASSERT(stream_id == StreamId{0});
-    return (*transport)(make_span(primaries_));
+    return (*transport)(make_span(events_[ids.event.get()]));


For backward compatibility (so that we can keep comparing against our old regression results), can you add the ability to transport all events simultaneously? Maybe another operator() with no arguments.

Was thinking the same, will do.

app/demo-loop/demo-loop.cc

src/celeritas/user/ActionDiagnostic.cc

src/celeritas/user/ActionDiagnostic.hh

sethrj

Looks great @amandalund ! Thanks.

sethrj · 2023-05-30T22:24:06Z

@amandalund Ever since this pull request I'm getting out-of-memory errors for the cms2018 demo problems. It looks like the default "max streams" is 1, so I think it should still be creating one state and one thread right?

sethrj · 2023-05-30T22:32:02Z

Also in a probably unrelated question: CUDA's device context is thread-local, so they recommend resetting the device inside openmp parallel for loops, so maybe we need:

diff --git a/app/demo-loop/demo-loop.cc b/app/demo-loop/demo-loop.cc
index 8ce6ca63..8daaee18 100644
--- a/app/demo-loop/demo-loop.cc
+++ b/app/demo-loop/demo-loop.cc
@@ -108,6 +108,10 @@ void run(std::istream* is, std::shared_ptr<celeritas::OutputRegistry> output)
 #endif
         for (size_type event = 0; event < run_stream.num_events(); ++event)
         {
+            // Make sure cudaSetDevice is called on the local thread
+            using namespace celeritas;
+            activate_device(Device{device().device_id()});
+
             // Run a single event on a single thread
             CELER_TRY_HANDLE(result.events[event] = run_stream(
                                  StreamId(get_openmp_thread()), EventId(event)),

amandalund · 2023-05-31T00:15:35Z

Hmm I'm not seeing an out of memory error for that casa (and with "initializer_capacity": 67108864, "max_num_tracks": 1048576). Is OMP_NUM_THREADS set (since that's how we're currently setting the number of streams)?

And good point about the thread-local device context.

sethrj · 2023-05-31T13:04:49Z

Aha, yes, it is indeed being set in the regression suite driver. I'll update it. Thanks!

Amanda Lund and others added 4 commits May 24, 2023 19:38

Add multithreaded demo loop

9630cfd

Disable OpenMP nested parallel regions

9e043cf

Store thread-local transporters in Runner

7a6ab57

Locked initialize shared params data in action diagostic

b115be9

amandalund added enhancement New feature or request core Software engineering infrastructure labels May 25, 2023

amandalund requested a review from sethrj May 25, 2023 01:37

sethrj reviewed May 25, 2023

View reviewed changes

amandalund added 2 commits May 26, 2023 01:13

Address review feedback

a9c6705

Fix when OpenMP is disabled

fe63a75

sethrj approved these changes May 26, 2023

View reviewed changes

sethrj merged commit b590ae9 into celeritas-project:develop May 26, 2023

amandalund deleted the mt-demo-loop branch May 26, 2023 11:06

sethrj linked an issue May 27, 2023 that may be closed by this pull request

Add multithreaded CPU to single GPU in demo-loop #553

Closed

7 tasks

sethrj mentioned this pull request May 31, 2023

Rename max_num_tracks to num_track_slots and divide by num_streams #785

Merged

amandalund mentioned this pull request Sep 21, 2023

Fix accumulated action times in celer-sim #951

Merged

sethrj added performance Changes for performance optimization and removed core Software engineering infrastructure labels Nov 14, 2023

sethrj mentioned this pull request May 7, 2024

Add configure-time CELERITAS_OPENMP switch to change threading #1222

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multithreaded CPU using single GPU in demo loop #774

Support multithreaded CPU using single GPU in demo loop #774

amandalund commented May 25, 2023

sethrj May 25, 2023

amandalund May 25, 2023

sethrj left a comment

sethrj commented May 30, 2023

sethrj commented May 30, 2023

amandalund commented May 31, 2023

sethrj commented May 31, 2023

Support multithreaded CPU using single GPU in demo loop #774

Support multithreaded CPU using single GPU in demo loop #774

Conversation

amandalund commented May 25, 2023

sethrj May 25, 2023

Choose a reason for hiding this comment

amandalund May 25, 2023

Choose a reason for hiding this comment

sethrj left a comment

Choose a reason for hiding this comment

sethrj commented May 30, 2023

sethrj commented May 30, 2023

amandalund commented May 31, 2023

sethrj commented May 31, 2023