-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize packing and unpacking particles to pinned memory buffers. #308
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks for this PR! Looking forward to seeing production tests. I added a few suggestions:
- Should we add a few
const
? - I tried to add a few comments. Could you check they are accurate, and fix them if they aren't?
Thanks!
src/Hipace.cpp
Outdated
amrex::ParallelFor(np, [=] AMREX_GPU_DEVICE (int i) noexcept | ||
#ifdef AMREX_USE_GPU | ||
if (amrex::Gpu::inLaunchRegion()) { | ||
int np_per_block = 128; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give a hint of where this number comes from? Is this roughly the largest number of plasma particles that can fit in shared memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roughly, I think 256 would also work. In principle we could try to tune this.
Co-authored-by: MaxThevenet <maxence.thevenet@desy.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks!
This PR makes two changes that increase the speed of packing and unpacking the particle data. First, when a transposition from SoA to AoS happens, it is done in shared memory, which is better than doing unordered reads/writes to global memory. Second, when doing a threaded copy from shared to global memory, we make sure each thread accesses memory addresses that are 8 bytes apart.
I have verified that this branch gives the expected error norm for
rho
on the 2-rank version of the linear wake test w/ GPUs.Without these changes, on the
debugging_particle_communication
branch:With these changes:
const
isconst
)