Fix resetting of CUDA streams when running through accel #927

sethrj · 2023-09-06T22:18:48Z

@esseivaju 's recent speedup diagram of Celeritas with Tilecal shows very poor multithread scaling:

He tracked this down to the fact that all the kernels were using the default stream:

Further digging revealed that the stream IDs are being passed correctly through accel and to the kernels, but the CUDA streams themselves were being reset. It turns out that this is because we're calling activate_device on the Geant4 worker threads after calling set_num_streams on the main device. The subsequent calls were overwriting the device.

This additionally changes the behavior of activate_device so that it can be called at most once to activate one. There's too much code in Celeritas that uses the global device so it's likely that changing or resetting it could break something, as it did with the streams here. The accel setup code now uses activate_device_local.

After fixing the default stream, 16-core performance improved 20%:

sethrj · 2023-09-06T22:19:59Z

P.S. the unrelated changes are to get wildstyle builds to work and be quiet... i can move those to the release branch from here.

This reverts commit 06a5605.

amandalund

Great catch! Alternatively, would calling activate_device_local() on the worker threads (like we do in celer-sim) instead of overwriting the device work?

sethrj · 2023-09-06T23:46:58Z

Actually that's a good point... why aren't we doing that 🤔

amandalund · 2023-09-06T23:57:12Z

I'm guessing we added support for streams/local device activation after the accel stuff and missed that we needed to update it... should have been done in #805.

amandalund

Thanks @sethrj and @esseivaju! 20% improvement sounds about right.. that's about the max speedup I saw using multiple streams vs. the default stream when I originally tested with celer-sim.

amandalund · 2023-09-07T02:37:21Z

src/celeritas/global/detail/ActionSequence.cc

@@ -101,7 +110,7 @@ void ActionSequence::execute(CoreParams const& params, CoreState<M>& state)
            actions_[i]->execute(params, state);
            if (M == MemSpace::device)
            {
-                CELER_DEVICE_CALL_PREFIX(DeviceSynchronize());
+                CELER_DEVICE_CALL_PREFIX(StreamSynchronize(stream));


Have you looked at how this impacts performance? I recall having actually seen slightly better performance with a device sync than without when running with multiple streams, and doing a quick test just now (celer-sim with cms2018+field+msc, 8 events/threads) I see an even bigger improvement with the stream sync (like 1.2x faster). Curious if you see something similar, because it's not obvious to me why this might be.

Interesting, I don't see a difference (tested with 32 threads and 32/64 events), sync doesn't affect the wall time.

Weird... these are the timing results I'm getting with no sync:

real 3m58.476s user 7m30.713s sys 0m2.825s $ jq '.result["runner"]["time"]' no-sync.out.json { "actions": {}, "setup": 31.052430465, "steps": [], "total": 206.873054614 }

and with sync:

real 3m2.136s user 14m19.799s sys 0m3.653s $ jq '.result["runner"]["time"]' sync.out.json { "actions": {}, "setup": 30.891581001, "steps": [], "total": 150.659916742 }

and the input I'm using, with OMP_NUM_THREADS=8:

JSON input

{ "geometry_filename": "/home/alund/celeritas_project/regression/input/cms2018.gdml", "physics_filename": "/home/alund/celeritas_project/regression/input/cms2018.gdml", "primary_gen_options": { "seed": 0, "direction": {"distribution": "isotropic", "params": []}, "energy": 10000, "num_events": 8, "pdg": 11, "position": [ 0, 0, 0 ], "primaries_per_event": 1300 }, "geant_options": { "eloss_fluctuation": false, "em_bins_per_decade": 56, "msc": "urban_extended", "physics": "em_basic" }, "mag_field": [ 0.0, 0.0, 1.0 ], "initializer_capacity": 8388608, "max_events": 8, "num_track_slots": 131072, "max_steps": 1000000, "secondary_stack_factor": 3.0, "seed": 20220904, "merge_events": false, "sync": false, "use_device": true }

@amandalund Interesting that the user time doubled even though the real time decreased. Is this on consumer hardware or HPC? It could be that the sync is causing the CPU threads to spinlock rather than context switch while waiting...

Yeah I thought that was odd... this is on HPC (Intel Xeon Gold 6152 CPU 22c 2.10GHz + NVIDIA Tesla V100 SXM2 w/32GB HBM2).

It could also be some weird interaction between OpenMP and CUDA? Even though CUDA nominally supports OpenMP, the latter prohibits any kind of thread interaction outside of OpenMP, and CUDA is definitely using pthread under the hood. Maybe that's why @esseivaju didn't see any difference: he's not using OpenMP? Still, I would expect the net effect to be a slowdown rather than speedup.

Interesting, I also see it with celer-g4.
no sync:

real 2m7.980s user 3m43.651s sys 0m1.806s

sync:

real 1m46.656s user 8m40.084s sys 0m2.068s

with G4FORCENUMBEROFTHREADS=8 and input:

JSON input

{ "geometry_file": "/home/alund/celeritas_project/regression/input/cms2018.gdml", "output_file": "cms2018-celer-g4.out.json", "primary_options": { "seed": 0, "direction": {"distribution": "isotropic", "params": []}, "energy": 10000, "num_events": 8, "pdg": 11, "position": [ 0, 0, 0 ], "primaries_per_event": 1300 }, "physics_list": "geant_physics_list", "physics_options": { "eloss_fluctuation": false, "em_bins_per_decade": 56, "msc": "urban_extended", "physics": "em_basic" }, "field_type": "uniform", "field": [ 0.0, 0.0, 1.0 ], "write_sd_hits": false, "initializer_capacity": 8388608, "max_events": 128, "num_track_slots": 131072, "max_steps": 1000000, "secondary_stack_factor": 3.0, "seed": 20220904, "sync": true }

Here’s a relevant question with an answer that I think gives a nice example/explanation as to why we might see better performance with stream synchronization when using async copies with pageable memory: https://forums.developer.nvidia.com/t/performances-of-multi-thread-vs-multi-process-with-mps/64236/2

@amandalund #910 is still kinda broken (only compiles with CUDA) but I've done some profiling on it and it significantly reduces the number of memcpyasync to pageable memory. Before I had many ~25kb transfers, now the only memcpyasync to pageable memory left comes from thrust (exclusive_scan_counts, remove_if_alive in TrackInitAlgorithms, and copy_if in DetectorSteps) and are 4B (probably the return value).

If you can compile that branch, I'd be curious to know if it helps reduce the timing for the non-sync version.

@esseivaju I tried out the pinned allocator branch, but looks like the time is still about the same as with no stream synchronization.

* Add output for creating and destroying streas * Fix accidental destruction of streams when setting device locally * Implement a hackish way to reset streams at the end of a geant4 run * Use async copies for detector step collection and track sorting * Use `activate_device_local` instead of initializing another device * Require that activate_device be set only once * Use stream synchronize in action sequence for diagnostics * Mark stream as possibly unused

sethrj added 3 commits September 6, 2023 18:04

Silence warnings on wildstyle

06a5605

Add output for creating and destroying streas

0073b01

Fix accidental destruction of streams when setting device locally

c2b8a1d

sethrj added bug Something isn't working external Dependencies and framework-oriented features labels Sep 6, 2023

sethrj requested review from amandalund and esseivaju September 6, 2023 22:18

sethrj added 2 commits September 6, 2023 18:29

Revert "Silence warnings on wildstyle"

90ba027

This reverts commit 06a5605.

Implement a hackish way to reset streams at the end of a geant4 run

f2d5d54

amandalund reviewed Sep 6, 2023

View reviewed changes

Use async copies for detector step collection and track sorting

a4afd6f

sethrj added 3 commits September 6, 2023 20:53

Use activate_device_local instead of initializing another device

e4b86de

Require that activate_device be set only once

3bac786

Use stream synchronize in action sequence for diagnostics

2a0b40c

amandalund approved these changes Sep 7, 2023

View reviewed changes

Mark stream as possibly unused

7f4555d

sethrj merged commit 37373be into celeritas-project:develop Sep 7, 2023

sethrj deleted the fix-accel-stream-reset branch September 7, 2023 13:00

sethrj added the backport Pull request duplicated across version branches label Sep 14, 2023

sethrj added performance Changes for performance optimization and removed external Dependencies and framework-oriented features labels Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix resetting of CUDA streams when running through accel #927

Fix resetting of CUDA streams when running through accel #927

sethrj commented Sep 6, 2023 •

edited

Loading

sethrj commented Sep 6, 2023

amandalund left a comment

sethrj commented Sep 6, 2023

amandalund commented Sep 6, 2023

amandalund left a comment

amandalund Sep 7, 2023

esseivaju Sep 7, 2023

amandalund Sep 7, 2023

sethrj Sep 7, 2023

amandalund Sep 7, 2023

sethrj Sep 7, 2023

amandalund Sep 8, 2023

amandalund Sep 8, 2023

esseivaju Sep 8, 2023

amandalund Sep 9, 2023

Fix resetting of CUDA streams when running through accel #927

Fix resetting of CUDA streams when running through accel #927

Conversation

sethrj commented Sep 6, 2023 • edited Loading

sethrj commented Sep 6, 2023

amandalund left a comment

Choose a reason for hiding this comment

sethrj commented Sep 6, 2023

amandalund commented Sep 6, 2023

amandalund left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sethrj commented Sep 6, 2023 •

edited

Loading