Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize packing and unpacking particles to pinned memory buffers. #308

Merged
merged 4 commits into from
Jan 13, 2021

Conversation

atmyers
Copy link
Contributor

@atmyers atmyers commented Jan 12, 2021

This PR makes two changes that increase the speed of packing and unpacking the particle data. First, when a transposition from SoA to AoS happens, it is done in shared memory, which is better than doing unordered reads/writes to global memory. Second, when doing a threaded copy from shared to global memory, we make sure each thread accesses memory addresses that are 8 bytes apart.

I have verified that this branch gives the expected error norm for rho on the 2-rank version of the linear wake test w/ GPUs.

Without these changes, on the debugging_particle_communication branch:

TinyProfiler total time across processes [min...avg...max]: 18.43 ... 20.64 ... 22.05

--------------------------------------------------------------------------------------------
Name                                         NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
--------------------------------------------------------------------------------------------
Hipace::Wait()                                   10  2.087e-05      5.299      16.69  75.66%
DEBUGGING: loop over ptd.packParticleData         7          0      8.907      13.43  60.90%
Hipace::Notify()                                 10  3.942e-05      1.957      7.769  35.23%
DEBUGGING: Loop over ptd.unpackParticleData       7          0      3.685       5.02  22.76%
main()                                            1   0.003096      0.288      1.062   4.81%

With these changes:

TinyProfiler total time across processes [min...avg...max]: 1.241 ... 1.301 ... 1.34

--------------------------------------------------------------------------------------------
Name                                         NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
--------------------------------------------------------------------------------------------
Hipace::Wait()                                   10  2.068e-05     0.4756     0.7457  55.63%
Hipace::Notify()                                 10  2.544e-05     0.1646     0.5969  44.53%
AnyDST::CreatePlan()                              2      0.411     0.4182     0.4266  31.83%
DEBUGGING: Loop over ptd.unpackParticleData       7          0    0.06304    0.08475   6.32%
DEBUGGING: loop over ptd.packParticleData         7          0    0.05925    0.07905   5.90%
main()                                            1   0.002925    0.03015    0.05634   4.20%
  • Small enough (< few 100s of lines), otherwise it should probably be split into smaller PRs
  • Tested (describe the tests in the PR description)
  • Runs on GPU (basic: the code compiles and run well with the new module)
  • Contains an automated test (checksum and/or comparison with theory)
  • Documented: all elements (classes and their members, functions, namespaces, etc.) are documented
  • Constified (All that can be const is const)
  • Code is clean (no unwanted comments, )
  • Style and code conventions are respected at the bottom of https://github.com/Hi-PACE/hipace
  • Proper label and GitHub project, if applicable

@atmyers atmyers requested a review from MaxThevenet January 12, 2021 16:29
@MaxThevenet MaxThevenet changed the title Optimize packing and unpacking particles to pinned memory buffers. [DO NOT MERGE] Optimize packing and unpacking particles to pinned memory buffers. Jan 12, 2021
@atmyers atmyers changed the title [DO NOT MERGE] Optimize packing and unpacking particles to pinned memory buffers. Optimize packing and unpacking particles to pinned memory buffers. Jan 12, 2021
Copy link
Member

@MaxThevenet MaxThevenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for this PR! Looking forward to seeing production tests. I added a few suggestions:

  1. Should we add a few const?
  2. I tried to add a few comments. Could you check they are accurate, and fix them if they aren't?

Thanks!

src/Hipace.cpp Outdated
amrex::ParallelFor(np, [=] AMREX_GPU_DEVICE (int i) noexcept
#ifdef AMREX_USE_GPU
if (amrex::Gpu::inLaunchRegion()) {
int np_per_block = 128;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give a hint of where this number comes from? Is this roughly the largest number of plasma particles that can fit in shared memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roughly, I think 256 would also work. In principle we could try to tune this.

src/Hipace.cpp Outdated Show resolved Hide resolved
src/Hipace.cpp Outdated Show resolved Hide resolved
src/Hipace.cpp Show resolved Hide resolved
src/Hipace.cpp Outdated Show resolved Hide resolved
src/Hipace.cpp Show resolved Hide resolved
atmyers and others added 2 commits January 12, 2021 12:19
Co-authored-by: MaxThevenet <maxence.thevenet@desy.de>
@SeverinDiederichs
Copy link
Member

I did a scaling for production run parameters: (1024**3 grid points, 1 plasma particle per cell).
Screen Shot 2021-01-13 at 17 47 33
Additionally, the packing and unpacking is now drastically shortened:

TinyProfiler total time across processes [min...avg...max]: 178.7 ... 178.7 ... 178.7

---------------------------------------------------------------------------------------------------
Name                                                NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
---------------------------------------------------------------------------------------------------
Hipace::Wait()                                         100  0.0001306      27.49      51.71  28.93%
FFTPoissonSolverDirichlet::SolvePoissonEquation()    64000      49.82      50.23      50.47  28.24%
UpdateForcePushParticles_PlasmaParticleContainer()   51200      14.77      21.89      38.81  21.71%
DepositCurrent_PlasmaParticleContainer()             25612      15.58      19.99      25.97  14.53%
Hipace::Evolve()                                         1      1.005      6.799      12.69   7.10%
Fields::ComputeRelBFieldError()                      25600      7.615      8.016      8.964   5.01%
Hipace::Notify()                                       100  0.0001339      2.892      7.127   3.99%
FillBoundary_nowait()                                76800      4.165      5.266      6.552   3.67%
DEBUGGING: Loop over ptd.unpackParticleData             87          0      2.738      3.956   2.21%
Fields::TransverseDerivative()                      102400      3.767      3.838       3.95   2.21%
DEBUGGING: Loop over ptd.packParticleData               87          0      3.037      3.905   2.18%
MultiFab::LinComb()                                  76800      3.327      3.376      3.467   1.94%
FabArray::setVal()                                   64507      3.293      3.346      3.411   1.91%
MultiFab::Subtract()                                 51200      2.239      2.277      2.347   1.31%
Hipace::PredictorCorrectorLoopToSolveBxBy()          12800      1.961      2.063      2.236   1.25%
Fields::SolveExmByAndEypBx()                         12800      1.967      2.003      2.071   1.16%
ResetPlasmaParticles()                               12900      1.808      1.908      2.004   1.12%

@SeverinDiederichs SeverinDiederichs added GPU Related to GPU acceleration Parallelization Longitudinal and transverse MPI decomposition performance optimization, benchmark, profiling, etc. labels Jan 13, 2021
Copy link
Member

@MaxThevenet MaxThevenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks!

@MaxThevenet MaxThevenet merged commit 90f0ab3 into development Jan 13, 2021
@MaxThevenet MaxThevenet deleted the optimize_pack_unpack branch January 13, 2021 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GPU Related to GPU acceleration Parallelization Longitudinal and transverse MPI decomposition performance optimization, benchmark, profiling, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants