-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Discrete Sine Transformation #534
New Discrete Sine Transformation #534
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR, well done! In my tests I get a speedup > 2 for the DST.
See minor comments below
@@ -1,13 +1,16 @@ | |||
#include "AnyDST.H" | |||
#include "CuFFTUtils.H" | |||
#include "utils/HipaceProfilerWrapper.H" | |||
#include "utils/Constants.H" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#include "utils/Constants.H" |
I think this may have been used at some point, but is not used in the final implementation. Please correct me, if I am wrong.
if ( result != CUFFT_SUCCESS ) { | ||
amrex::Print() << " cufftplan failed! Error: " << | ||
CuFFTUtils::cufftErrorToString(result) << "\n"; | ||
amrex::ParmParse pp("fields"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amrex::ParmParse pp("fields"); | |
amrex::ParmParse pp("hipace"); |
I would prefer to parse for "hipace"
. We do this for other field related things like the bxby_solver
as well and this would be the only parsing for fields
. Maybe we can separate these at some point, but for convenience I would stick with hipace
for now.
docs/source/run/parameters.rst
Outdated
@@ -92,6 +92,10 @@ Modeling ion motion is not yet supported by the explicit solver | |||
Which solver to use. | |||
Possible values: ``predictor-corrector`` and ``explicit``. | |||
|
|||
* ``fields.use_small_dst`` (`bool`) optional (default `0`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* ``fields.use_small_dst`` (`bool`) optional (default `0`) | |
* ``hipace.use_small_dst`` (`bool`) optional (default `0`) |
see comment at the parsing call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks a lot for this PR! You will find a few minor comments below.
As discussed offline, it would be great if the code could propose a good default behaviour for parameter hipace.use_small_dst
, so users don't have to exactly understand this point and still get the best performance. A solution would be to run a simulation with n=5
to n=12
and calculate the time spent per call to the Poisson solver with the following options:
- box size
nx = ny = 2^n
or2^n-1
- single or double precision
and compare the small and large dst. Then you could store, for each configuration, the minimal number of grid points for which the small dst is faster, and use it to decide at runtime whether to use the small dst. Of course the user could still specify hipace.use_small_dst
, which would override the code defaults. Of course we can add any relevant parameter to this list. Note that this should probably be in another PR, not in this one.
const int n_data, const int n_batch) | ||
{ | ||
HIPACE_PROFILE("AnyDST::Transpose()"); | ||
constexpr int tile_dim = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a suggestion to make it a bit safer, in case we need to play with this parameter in the future.
constexpr int tile_dim = 32; | |
constexpr int tile_dim_power = 5; | |
const int tile_dim = std::pow(2, tile_dim_power); |
You would probably need to add
#include <cmath>
at the end of the includes at the beginning of this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at Compiler Explorer, I found that with constexpr int i = 32;
num>>5
and num/i
are exactly equivalent for gcc, clang and nvcc, as long as num
is an unsigned int
.
With default settings, gcc was also able to do the same thing with const int i = std::pow(2,5);
, but not clang which would actually call std::pow.
I think it’s best to replace the >>5
with a divide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me, thanks for this great PR!
This PR introduces an alternative GPU Discrete Sine Transformation for PoissonSolverDirichlet. Instead of R2C fft that is applied to an array 4 times the original problem size, a C2R fft is used that has about the same size as the problem, however in exchange some pre- and postprocessing is required. This makes the new dst much faster for large resolutions, but also slightly slower for low resolutions as more GPU kernels have to be launched.
Use
hipace.use_small_dst = 1
to use the new dst.old.txt
new.txt
This test is from a fp64 GPU simulation with n_cell = 2047 2047 4096. The new AnyDST::Execute() only takes about half the time compared to the old one while using 500MB less GPU memory. The new dst calculation is very accrued as can be seen by the relative B field error being almost the same as with the old dst.
The E-Field result along z of another Simulation with 1023 1023 10000 Cells shows again that between the old and new DSTs there is no observable difference in fp64, while in fp32 a small deviation can be seen when zooming in, however both transformations are still about equally distant from the fp64 result.
Currently there seems to be a performance issue with ToComplex() where for low resolutions its first call in Execute() takes significantly more time than the second one.
const
isconst
)