Make CPU-GPU memory copy highly asynchronous #1082

AlexanderSinn · 2024-03-14T04:42:49Z

This PR adds the ability to make the cpu-to-gpu and gpu-to-cpu memory copy of the MPI-communications data highly asynchronous. Up to one slice of computation can be done while the memory copy is active. This is achieved by changing the gpu stream, queuing up the memory copy and changing the stream back again.

comms_buffer.on_gpu = false
comms_buffer.async_memcpy = true

Test:
Note that even with the async memcopy, the buffer on cpu version is not as fast as buffer on gpu. This is because the MPI data transfer between ranks on the same node is quite slow with a profile suggesting only 5 GB/s. Furthermore, this transfer is done in the MPI_Isend function and not after it or in MPI_Wait, resulting in blocking behavior. Between the three HPC platforms available to me, this one had the least broken cpu-cpu and gpu-gpu MPI data transfer.

Buffer on CPU, with async memcopy:

TinyProfiler total time across processes [min...avg...max]: 138.6 ... 140.6 ... 142.1

--------------------------------------------------------------------------------------------------
Name                                               NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
--------------------------------------------------------------------------------------------------
MultiBuffer::put_data()                              2000      21.57      50.67         68  47.84%
MultiBuffer::get_data()                              2000    0.01044      18.55      51.87  36.49%
hpmg::MultiGrid::solve1()                            2000       14.7      14.96      15.34  10.79%
hpmg::MultiGrid::solve2()                            2000      13.94      14.18      14.35  10.10%
DepositCurrent_PlasmaParticleContainer()             2001      4.459      6.528      9.424   6.63%
AnyDST::Execute()                                   12000      9.129      9.141      9.163   6.45%
ExplicitDeposition()                                 2000      7.431      7.479      7.515   5.29%
AdvancePlasmaParticles()                             2000      5.102      5.116      5.124   3.60%

Buffer on CPU, no async memcopy:

TinyProfiler total time across processes [min...avg...max]: 186.4 ... 189.9 ... 191.6

--------------------------------------------------------------------------------------------------
Name                                               NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
--------------------------------------------------------------------------------------------------
MultiBuffer::get_data()                              2000    0.01113      44.87      122.8  64.10%
MultiBuffer::put_data()                              2000      1.953      76.85      113.6  59.25%
hpmg::MultiGrid::solve1()                            2000      14.53      14.83      15.17   7.92%
hpmg::MultiGrid::solve2()                            2000      13.94      14.17      14.36   7.49%
AnyDST::Execute()                                   12000      9.102      9.127      9.159   4.78%
ExplicitDeposition()                                 2000      7.432      7.448      7.469   3.90%
AdvancePlasmaParticles()                             2000      5.103      5.108      5.114   2.67%
main()                                                  1   0.001078      1.221      4.661   2.43%

Buffer on GPU:

TinyProfiler total time across processes [min...avg...max]: 69.21 ... 69.42 ... 69.64

--------------------------------------------------------------------------------------------------
Name                                               NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
--------------------------------------------------------------------------------------------------
hpmg::MultiGrid::solve1()                            2000      14.54      14.83      15.16  21.78%
hpmg::MultiGrid::solve2()                            2000      13.94      14.18      14.35  20.61%
AnyDST::Execute()                                   12000      9.165      9.188      9.224  13.25%
ExplicitDeposition()                                 2000      7.429       7.46      7.483  10.75%
AdvancePlasmaParticles()                             2000        5.1      5.104       5.11   7.34%
DepositCurrent_PlasmaParticleContainer()             2001      4.435      4.443      4.451   6.39%
MultiLaser::ShiftLaserSlices()                       2000      2.876      2.951       2.99   4.29%
MultiBuffer::get_data()                              2000   0.008478      1.718      2.792   4.01%

Small enough (< few 100s of lines), otherwise it should probably be split into smaller PRs
Tested (describe the tests in the PR description)
Runs on GPU (basic: the code compiles and run well with the new module)
Contains an automated test (checksum and/or comparison with theory)
Documented: all elements (classes and their members, functions, namespaces, etc.) are documented
Constified (All that can be const is const)
Code is clean (no unwanted comments, )
Style and code conventions are respected at the bottom of https://github.com/Hi-PACE/hipace
Proper label and GitHub project, if applicable

MaxThevenet

Looks great, thanks for this PR!

AlexanderSinn and others added 4 commits March 13, 2024 18:30

Put input parameters for MPI in a separate category

f7c9b1a

Make CPU GPU memory highly asynchronous

f5ba80a

add async memcpy

ea364f1

Merge branch 'development' into Make_CPU_GPU_memory_highly_asynchronous

58e4c41

AlexanderSinn added GPU Related to GPU acceleration performance optimization, benchmark, profiling, etc. Parallelization Longitudinal and transverse MPI decomposition labels Mar 14, 2024

AlexanderSinn added 2 commits March 15, 2024 15:05

add ifdef

490eb52

add doc

33b109d

AlexanderSinn requested a review from MaxThevenet March 19, 2024 16:16

MaxThevenet approved these changes Mar 20, 2024

View reviewed changes

MaxThevenet merged commit 2c22a0f into Hi-PACE:development Mar 20, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make CPU-GPU memory copy highly asynchronous #1082

Make CPU-GPU memory copy highly asynchronous #1082

AlexanderSinn commented Mar 14, 2024 •

edited

Loading

MaxThevenet left a comment

Make CPU-GPU memory copy highly asynchronous #1082

Make CPU-GPU memory copy highly asynchronous #1082

Conversation

AlexanderSinn commented Mar 14, 2024 • edited Loading

MaxThevenet left a comment

Choose a reason for hiding this comment

AlexanderSinn commented Mar 14, 2024 •

edited

Loading