Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make CPU-GPU memory copy highly asynchronous #1082

Conversation

AlexanderSinn
Copy link
Member

@AlexanderSinn AlexanderSinn commented Mar 14, 2024

This PR adds the ability to make the cpu-to-gpu and gpu-to-cpu memory copy of the MPI-communications data highly asynchronous. Up to one slice of computation can be done while the memory copy is active. This is achieved by changing the gpu stream, queuing up the memory copy and changing the stream back again.

comms_buffer.on_gpu = false
comms_buffer.async_memcpy = true 

Test:
Note that even with the async memcopy, the buffer on cpu version is not as fast as buffer on gpu. This is because the MPI data transfer between ranks on the same node is quite slow with a profile suggesting only 5 GB/s. Furthermore, this transfer is done in the MPI_Isend function and not after it or in MPI_Wait, resulting in blocking behavior. Between the three HPC platforms available to me, this one had the least broken cpu-cpu and gpu-gpu MPI data transfer.

Buffer on CPU, with async memcopy:

TinyProfiler total time across processes [min...avg...max]: 138.6 ... 140.6 ... 142.1

--------------------------------------------------------------------------------------------------
Name                                               NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
--------------------------------------------------------------------------------------------------
MultiBuffer::put_data()                              2000      21.57      50.67         68  47.84%
MultiBuffer::get_data()                              2000    0.01044      18.55      51.87  36.49%
hpmg::MultiGrid::solve1()                            2000       14.7      14.96      15.34  10.79%
hpmg::MultiGrid::solve2()                            2000      13.94      14.18      14.35  10.10%
DepositCurrent_PlasmaParticleContainer()             2001      4.459      6.528      9.424   6.63%
AnyDST::Execute()                                   12000      9.129      9.141      9.163   6.45%
ExplicitDeposition()                                 2000      7.431      7.479      7.515   5.29%
AdvancePlasmaParticles()                             2000      5.102      5.116      5.124   3.60%

Buffer on CPU, no async memcopy:

TinyProfiler total time across processes [min...avg...max]: 186.4 ... 189.9 ... 191.6

--------------------------------------------------------------------------------------------------
Name                                               NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
--------------------------------------------------------------------------------------------------
MultiBuffer::get_data()                              2000    0.01113      44.87      122.8  64.10%
MultiBuffer::put_data()                              2000      1.953      76.85      113.6  59.25%
hpmg::MultiGrid::solve1()                            2000      14.53      14.83      15.17   7.92%
hpmg::MultiGrid::solve2()                            2000      13.94      14.17      14.36   7.49%
AnyDST::Execute()                                   12000      9.102      9.127      9.159   4.78%
ExplicitDeposition()                                 2000      7.432      7.448      7.469   3.90%
AdvancePlasmaParticles()                             2000      5.103      5.108      5.114   2.67%
main()                                                  1   0.001078      1.221      4.661   2.43%

Buffer on GPU:

TinyProfiler total time across processes [min...avg...max]: 69.21 ... 69.42 ... 69.64

--------------------------------------------------------------------------------------------------
Name                                               NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
--------------------------------------------------------------------------------------------------
hpmg::MultiGrid::solve1()                            2000      14.54      14.83      15.16  21.78%
hpmg::MultiGrid::solve2()                            2000      13.94      14.18      14.35  20.61%
AnyDST::Execute()                                   12000      9.165      9.188      9.224  13.25%
ExplicitDeposition()                                 2000      7.429       7.46      7.483  10.75%
AdvancePlasmaParticles()                             2000        5.1      5.104       5.11   7.34%
DepositCurrent_PlasmaParticleContainer()             2001      4.435      4.443      4.451   6.39%
MultiLaser::ShiftLaserSlices()                       2000      2.876      2.951       2.99   4.29%
MultiBuffer::get_data()                              2000   0.008478      1.718      2.792   4.01%

image

  • Small enough (< few 100s of lines), otherwise it should probably be split into smaller PRs
  • Tested (describe the tests in the PR description)
  • Runs on GPU (basic: the code compiles and run well with the new module)
  • Contains an automated test (checksum and/or comparison with theory)
  • Documented: all elements (classes and their members, functions, namespaces, etc.) are documented
  • Constified (All that can be const is const)
  • Code is clean (no unwanted comments, )
  • Style and code conventions are respected at the bottom of https://github.com/Hi-PACE/hipace
  • Proper label and GitHub project, if applicable

@AlexanderSinn AlexanderSinn added GPU Related to GPU acceleration performance optimization, benchmark, profiling, etc. Parallelization Longitudinal and transverse MPI decomposition labels Mar 14, 2024
Copy link
Member

@MaxThevenet MaxThevenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for this PR!

@MaxThevenet MaxThevenet merged commit 2c22a0f into Hi-PACE:development Mar 20, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GPU Related to GPU acceleration Parallelization Longitudinal and transverse MPI decomposition performance optimization, benchmark, profiling, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants