Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Discrete Sine Transformation #534

Merged
merged 10 commits into from
Jun 30, 2021

Conversation

AlexanderSinn
Copy link
Member

@AlexanderSinn AlexanderSinn commented Jun 16, 2021

This PR introduces an alternative GPU Discrete Sine Transformation for PoissonSolverDirichlet. Instead of R2C fft that is applied to an array 4 times the original problem size, a C2R fft is used that has about the same size as the problem, however in exchange some pre- and postprocessing is required. This makes the new dst much faster for large resolutions, but also slightly slower for low resolutions as more GPU kernels have to be launched.

Use hipace.use_small_dst = 1 to use the new dst.

old.txt
new.txt

This test is from a fp64 GPU simulation with n_cell = 2047 2047 4096. The new AnyDST::Execute() only takes about half the time compared to the old one while using 500MB less GPU memory. The new dst calculation is very accrued as can be seen by the relative B field error being almost the same as with the old dst.

new_dst_test

The E-Field result along z of another Simulation with 1023 1023 10000 Cells shows again that between the old and new DSTs there is no observable difference in fp64, while in fp32 a small deviation can be seen when zooming in, however both transformations are still about equally distant from the fp64 result.

Currently there seems to be a performance issue with ToComplex() where for low resolutions its first call in Execute() takes significantly more time than the second one.

  • Small enough (< few 100s of lines), otherwise it should probably be split into smaller PRs
  • Tested (describe the tests in the PR description)
  • Runs on GPU (basic: the code compiles and run well with the new module)
  • Contains an automated test (checksum and/or comparison with theory)
  • Documented: all elements (classes and their members, functions, namespaces, etc.) are documented
  • Constified (All that can be const is const)
  • Code is clean (no unwanted comments, )
  • Style and code conventions are respected at the bottom of https://github.com/Hi-PACE/hipace
  • Proper label and GitHub project, if applicable

@AlexanderSinn AlexanderSinn added component: fields About 3D fields and slices, field solvers etc. GPU Related to GPU acceleration labels Jun 16, 2021
Copy link
Member

@SeverinDiederichs SeverinDiederichs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR, well done! In my tests I get a speedup > 2 for the DST.

See minor comments below

@@ -1,13 +1,16 @@
#include "AnyDST.H"
#include "CuFFTUtils.H"
#include "utils/HipaceProfilerWrapper.H"
#include "utils/Constants.H"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include "utils/Constants.H"

I think this may have been used at some point, but is not used in the final implementation. Please correct me, if I am wrong.

if ( result != CUFFT_SUCCESS ) {
amrex::Print() << " cufftplan failed! Error: " <<
CuFFTUtils::cufftErrorToString(result) << "\n";
amrex::ParmParse pp("fields");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
amrex::ParmParse pp("fields");
amrex::ParmParse pp("hipace");

I would prefer to parse for "hipace". We do this for other field related things like the bxby_solver as well and this would be the only parsing for fields. Maybe we can separate these at some point, but for convenience I would stick with hipace for now.

@@ -92,6 +92,10 @@ Modeling ion motion is not yet supported by the explicit solver
Which solver to use.
Possible values: ``predictor-corrector`` and ``explicit``.

* ``fields.use_small_dst`` (`bool`) optional (default `0`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* ``fields.use_small_dst`` (`bool`) optional (default `0`)
* ``hipace.use_small_dst`` (`bool`) optional (default `0`)

see comment at the parsing call

Copy link
Member

@MaxThevenet MaxThevenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks a lot for this PR! You will find a few minor comments below.
As discussed offline, it would be great if the code could propose a good default behaviour for parameter hipace.use_small_dst, so users don't have to exactly understand this point and still get the best performance. A solution would be to run a simulation with n=5 to n=12 and calculate the time spent per call to the Poisson solver with the following options:

  • box size nx = ny = 2^n or 2^n-1
  • single or double precision

and compare the small and large dst. Then you could store, for each configuration, the minimal number of grid points for which the small dst is faster, and use it to decide at runtime whether to use the small dst. Of course the user could still specify hipace.use_small_dst, which would override the code defaults. Of course we can add any relevant parameter to this list. Note that this should probably be in another PR, not in this one.

const int n_data, const int n_batch)
{
HIPACE_PROFILE("AnyDST::Transpose()");
constexpr int tile_dim = 32;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion to make it a bit safer, in case we need to play with this parameter in the future.

Suggested change
constexpr int tile_dim = 32;
constexpr int tile_dim_power = 5;
const int tile_dim = std::pow(2, tile_dim_power);

You would probably need to add

#include <cmath>

at the end of the includes at the beginning of this file.

Copy link
Member Author

@AlexanderSinn AlexanderSinn Jun 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at Compiler Explorer, I found that with constexpr int i = 32; num>>5 and num/i are exactly equivalent for gcc, clang and nvcc, as long as num is an unsigned int.
With default settings, gcc was also able to do the same thing with const int i = std::pow(2,5);, but not clang which would actually call std::pow.
I think it’s best to replace the >>5 with a divide.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

src/fields/fft_poisson_solver/fft/WrapCuDST.cpp Outdated Show resolved Hide resolved
src/fields/fft_poisson_solver/fft/AnyDST.H Outdated Show resolved Hide resolved
Copy link
Member

@MaxThevenet MaxThevenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me, thanks for this great PR!

@MaxThevenet MaxThevenet merged commit 768b098 into Hi-PACE:development Jun 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: fields About 3D fields and slices, field solvers etc. GPU Related to GPU acceleration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants