Optimize packing and unpacking particles to pinned memory buffers. #308

atmyers · 2021-01-12T16:29:19Z

This PR makes two changes that increase the speed of packing and unpacking the particle data. First, when a transposition from SoA to AoS happens, it is done in shared memory, which is better than doing unordered reads/writes to global memory. Second, when doing a threaded copy from shared to global memory, we make sure each thread accesses memory addresses that are 8 bytes apart.

I have verified that this branch gives the expected error norm for rho on the 2-rank version of the linear wake test w/ GPUs.

Without these changes, on the debugging_particle_communication branch:

TinyProfiler total time across processes [min...avg...max]: 18.43 ... 20.64 ... 22.05

--------------------------------------------------------------------------------------------
Name                                         NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
--------------------------------------------------------------------------------------------
Hipace::Wait()                                   10  2.087e-05      5.299      16.69  75.66%
DEBUGGING: loop over ptd.packParticleData         7          0      8.907      13.43  60.90%
Hipace::Notify()                                 10  3.942e-05      1.957      7.769  35.23%
DEBUGGING: Loop over ptd.unpackParticleData       7          0      3.685       5.02  22.76%
main()                                            1   0.003096      0.288      1.062   4.81%

With these changes:

TinyProfiler total time across processes [min...avg...max]: 1.241 ... 1.301 ... 1.34

--------------------------------------------------------------------------------------------
Name                                         NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
--------------------------------------------------------------------------------------------
Hipace::Wait()                                   10  2.068e-05     0.4756     0.7457  55.63%
Hipace::Notify()                                 10  2.544e-05     0.1646     0.5969  44.53%
AnyDST::CreatePlan()                              2      0.411     0.4182     0.4266  31.83%
DEBUGGING: Loop over ptd.unpackParticleData       7          0    0.06304    0.08475   6.32%
DEBUGGING: loop over ptd.packParticleData         7          0    0.05925    0.07905   5.90%
main()                                            1   0.002925    0.03015    0.05634   4.20%

Small enough (< few 100s of lines), otherwise it should probably be split into smaller PRs
Tested (describe the tests in the PR description)
Runs on GPU (basic: the code compiles and run well with the new module)
Contains an automated test (checksum and/or comparison with theory)
Documented: all elements (classes and their members, functions, namespaces, etc.) are documented
Constified (All that can be const is const)
Code is clean (no unwanted comments, )
Style and code conventions are respected at the bottom of https://github.com/Hi-PACE/hipace
Proper label and GitHub project, if applicable

MaxThevenet

Looks great, thanks for this PR! Looking forward to seeing production tests. I added a few suggestions:

Should we add a few const?
I tried to add a few comments. Could you check they are accurate, and fix them if they aren't?

Thanks!

MaxThevenet · 2021-01-12T19:37:45Z

src/Hipace.cpp

-            amrex::ParallelFor(np, [=] AMREX_GPU_DEVICE (int i) noexcept
+#ifdef AMREX_USE_GPU
+            if (amrex::Gpu::inLaunchRegion()) {
+                int np_per_block = 128;


Could you give a hint of where this number comes from? Is this roughly the largest number of plasma particles that can fit in shared memory?

Roughly, I think 256 would also work. In principle we could try to tune this.

src/Hipace.cpp

Co-authored-by: MaxThevenet <maxence.thevenet@desy.de>

SeverinDiederichs · 2021-01-13T16:49:35Z

I did a scaling for production run parameters: (1024**3 grid points, 1 plasma particle per cell).

Additionally, the packing and unpacking is now drastically shortened:

TinyProfiler total time across processes [min...avg...max]: 178.7 ... 178.7 ... 178.7

---------------------------------------------------------------------------------------------------
Name                                                NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
---------------------------------------------------------------------------------------------------
Hipace::Wait()                                         100  0.0001306      27.49      51.71  28.93%
FFTPoissonSolverDirichlet::SolvePoissonEquation()    64000      49.82      50.23      50.47  28.24%
UpdateForcePushParticles_PlasmaParticleContainer()   51200      14.77      21.89      38.81  21.71%
DepositCurrent_PlasmaParticleContainer()             25612      15.58      19.99      25.97  14.53%
Hipace::Evolve()                                         1      1.005      6.799      12.69   7.10%
Fields::ComputeRelBFieldError()                      25600      7.615      8.016      8.964   5.01%
Hipace::Notify()                                       100  0.0001339      2.892      7.127   3.99%
FillBoundary_nowait()                                76800      4.165      5.266      6.552   3.67%
DEBUGGING: Loop over ptd.unpackParticleData             87          0      2.738      3.956   2.21%
Fields::TransverseDerivative()                      102400      3.767      3.838       3.95   2.21%
DEBUGGING: Loop over ptd.packParticleData               87          0      3.037      3.905   2.18%
MultiFab::LinComb()                                  76800      3.327      3.376      3.467   1.94%
FabArray::setVal()                                   64507      3.293      3.346      3.411   1.91%
MultiFab::Subtract()                                 51200      2.239      2.277      2.347   1.31%
Hipace::PredictorCorrectorLoopToSolveBxBy()          12800      1.961      2.063      2.236   1.25%
Fields::SolveExmByAndEypBx()                         12800      1.967      2.003      2.071   1.16%
ResetPlasmaParticles()                               12900      1.808      1.908      2.004   1.12%

MaxThevenet

Awesome, thanks!

Optimize packing and unpacking particles to pinned memory buffers.

f634c01

atmyers requested a review from MaxThevenet January 12, 2021 16:29

MaxThevenet changed the title ~~Optimize packing and unpacking particles to pinned memory buffers.~~ [DO NOT MERGE] Optimize packing and unpacking particles to pinned memory buffers. Jan 12, 2021

had src and dst backwards

3dbe721

atmyers changed the title ~~[DO NOT MERGE] Optimize packing and unpacking particles to pinned memory buffers.~~ Optimize packing and unpacking particles to pinned memory buffers. Jan 12, 2021

MaxThevenet approved these changes Jan 12, 2021

View reviewed changes

atmyers and others added 2 commits January 12, 2021 12:19

Apply suggestions from code review

f55ee29

Co-authored-by: MaxThevenet <maxence.thevenet@desy.de>

const

8f9d173

MaxThevenet approved these changes Jan 12, 2021

View reviewed changes

SeverinDiederichs added GPU Related to GPU acceleration Parallelization Longitudinal and transverse MPI decomposition performance optimization, benchmark, profiling, etc. labels Jan 13, 2021

MaxThevenet approved these changes Jan 13, 2021

View reviewed changes

MaxThevenet merged commit 90f0ab3 into development Jan 13, 2021

MaxThevenet deleted the optimize_pack_unpack branch January 13, 2021 16:58

atmyers mentioned this pull request Jan 14, 2021

Very slow particle un/packing in communication #284

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize packing and unpacking particles to pinned memory buffers. #308

Optimize packing and unpacking particles to pinned memory buffers. #308

atmyers commented Jan 12, 2021 •

edited

Loading

MaxThevenet left a comment

MaxThevenet Jan 12, 2021

atmyers Jan 12, 2021

SeverinDiederichs commented Jan 13, 2021

MaxThevenet left a comment

Optimize packing and unpacking particles to pinned memory buffers. #308

Optimize packing and unpacking particles to pinned memory buffers. #308

Conversation

atmyers commented Jan 12, 2021 • edited Loading

MaxThevenet left a comment

Choose a reason for hiding this comment

MaxThevenet Jan 12, 2021

Choose a reason for hiding this comment

atmyers Jan 12, 2021

Choose a reason for hiding this comment

SeverinDiederichs commented Jan 13, 2021

MaxThevenet left a comment

Choose a reason for hiding this comment

atmyers commented Jan 12, 2021 •

edited

Loading