Add OpenMP for FFTW #541

SeverinDiederichs · 2021-06-25T19:22:50Z

Add a heuristics that also works with PkgConfig to query OpenMP support in FFTW. Enable by default if we build with the OpenMP compute backend unless explicitly disabled.

Add a macro to control the source-code, since FFTW does not offer a public define for this.

Thanks to @ax3l FFTW with OpenMP now works:
On my laptop, using the beam in vacuum example without I/O and without beam at amr.n_cell = 1024 1024 50
I get the following run times:

OMP_NUM_THREADS=1
AnyDST::Execute()                                     500      29.68      29.68      29.68  80.17%
OMP_NUM_THREADS=2
AnyDST::Execute()                                     500      14.86      14.86      14.86  68.45%
OMP_NUM_THREADS=3
AnyDST::Execute()                                     500      10.05      10.05      10.05  59.75%
OMP_NUM_THREADS=4
AnyDST::Execute()                                     500      7.707      7.707      7.707  53.07%
OMP_NUM_THREADS=5
AnyDST::Execute()                                     500       6.23       6.23       6.23  47.70%
OMP_NUM_THREADS=6
AnyDST::Execute()                                     500      6.097      6.097      6.097  46.36%

This implementation seems to give the correct results. Using 2 threads, all tests pass locally:

      Start  1: blowout_wake.2Rank
 1/20 Test  #1: blowout_wake.2Rank ......................   Passed    6.49 sec
      Start  2: ionization.2Rank
 2/20 Test  #2: ionization.2Rank ........................   Passed    2.40 sec
      Start  3: from_file.normalized.1Rank
 3/20 Test  #3: from_file.normalized.1Rank ..............   Passed    3.12 sec
      Start  4: from_file.SI.1Rank
 4/20 Test  #4: from_file.SI.1Rank ......................   Passed    3.00 sec
      Start  5: restart.normalized.1Rank
 5/20 Test  #5: restart.normalized.1Rank ................   Passed    0.88 sec
      Start  6: blowout_wake_explicit.2Rank
 6/20 Test  #6: blowout_wake_explicit.2Rank .............   Passed    2.52 sec
      Start  7: beam_evolution.1Rank
 7/20 Test  #7: beam_evolution.1Rank ....................   Passed    3.04 sec
      Start  8: adaptive_time_step.1Rank
 8/20 Test  #8: adaptive_time_step.1Rank ................   Passed    5.32 sec
      Start  9: grid_current.1Rank
 9/20 Test  #9: grid_current.1Rank ......................   Passed    1.82 sec
      Start 10: linear_wake.normalized.1Rank
10/20 Test #10: linear_wake.normalized.1Rank ............   Passed    2.92 sec
      Start 11: linear_wake.SI.1Rank
11/20 Test #11: linear_wake.SI.1Rank ....................   Passed    2.91 sec
      Start 12: gaussian_linear_wake.normalized.1Rank
12/20 Test #12: gaussian_linear_wake.normalized.1Rank ...   Passed    3.00 sec
      Start 13: gaussian_linear_wake.SI.1Rank
13/20 Test #13: gaussian_linear_wake.SI.1Rank ...........   Passed    3.12 sec
      Start 14: reset.2Rank
14/20 Test #14: reset.2Rank .............................   Passed    3.08 sec
      Start 15: beam_in_vacuum.SI.1Rank
15/20 Test #15: beam_in_vacuum.SI.1Rank .................   Passed    5.50 sec
      Start 16: beam_in_vacuum.normalized.1Rank
16/20 Test #16: beam_in_vacuum.normalized.1Rank .........   Passed    4.94 sec
      Start 17: next_deposition_beam.2Rank
17/20 Test #17: next_deposition_beam.2Rank ..............   Passed   27.33 sec
      Start 18: slice_IO.1Rank
18/20 Test #18: slice_IO.1Rank ..........................   Passed    3.15 sec
      Start 19: gaussian_weight.1Rank
19/20 Test #19: gaussian_weight.1Rank ...................   Passed    5.00 sec
      Start 20: beam_in_vacuum.normalized.2Rank
20/20 Test #20: beam_in_vacuum.normalized.2Rank .........   Passed    5.35 sec

100% tests passed, 0 tests failed out of 20

As a comparison, here the run time on development on my laptop:

      Start  1: blowout_wake.2Rank
1/20 Test  #1: blowout_wake.2Rank ......................   Passed    6.95 sec
     Start  2: ionization.2Rank
2/20 Test  #2: ionization.2Rank ........................   Passed    2.61 sec
     Start  3: from_file.normalized.1Rank
3/20 Test  #3: from_file.normalized.1Rank ..............   Passed    4.15 sec
     Start  4: from_file.SI.1Rank
4/20 Test  #4: from_file.SI.1Rank ......................   Passed    3.65 sec
     Start  5: restart.normalized.1Rank
5/20 Test  #5: restart.normalized.1Rank ................   Passed    1.28 sec
     Start  6: blowout_wake_explicit.2Rank
6/20 Test  #6: blowout_wake_explicit.2Rank .............   Passed    2.71 sec
     Start  7: beam_evolution.1Rank
7/20 Test  #7: beam_evolution.1Rank ....................   Passed    2.88 sec
     Start  8: adaptive_time_step.1Rank
8/20 Test  #8: adaptive_time_step.1Rank ................   Passed    5.30 sec
     Start  9: grid_current.1Rank
9/20 Test  #9: grid_current.1Rank ......................   Passed    1.95 sec
     Start 10: linear_wake.normalized.1Rank
10/20 Test #10: linear_wake.normalized.1Rank ............   Passed    3.10 sec
     Start 11: linear_wake.SI.1Rank
11/20 Test #11: linear_wake.SI.1Rank ....................   Passed    4.18 sec
     Start 12: gaussian_linear_wake.normalized.1Rank
12/20 Test #12: gaussian_linear_wake.normalized.1Rank ...   Passed    3.83 sec
     Start 13: gaussian_linear_wake.SI.1Rank
13/20 Test #13: gaussian_linear_wake.SI.1Rank ...........   Passed    3.03 sec
     Start 14: reset.2Rank
14/20 Test #14: reset.2Rank .............................   Passed    3.12 sec
     Start 15: beam_in_vacuum.SI.1Rank
15/20 Test #15: beam_in_vacuum.SI.1Rank .................   Passed    6.73 sec
     Start 16: beam_in_vacuum.normalized.1Rank
16/20 Test #16: beam_in_vacuum.normalized.1Rank .........   Passed    5.32 sec
     Start 17: next_deposition_beam.2Rank
17/20 Test #17: next_deposition_beam.2Rank ..............   Passed    6.73 sec
     Start 18: slice_IO.1Rank
18/20 Test #18: slice_IO.1Rank ..........................   Passed    3.58 sec
     Start 19: gaussian_weight.1Rank
19/20 Test #19: gaussian_weight.1Rank ...................   Passed    6.52 sec
     Start 20: beam_in_vacuum.normalized.2Rank
20/20 Test #20: beam_in_vacuum.normalized.2Rank .........   Passed   10.55 sec

next_deposition_beam.2Rank test takes so much longer than running in serial. The reason lies in its extremely low resolution of 16 transverse grid points. Obviously, it does not make any sense to use more than 1 CPU for the FFT there. Using 32 grid points is slightly slower with 2 threads. Increasing the number of grid points to 64 yields in a speed up with more threads again. Therefore, a threshold of nx > 32 && ny > 32 was added. If it is not met, the FFT is executed with a single thread.

Small enough (< few 100s of lines), otherwise it should probably be split into smaller PRs
Tested (describe the tests in the PR description)
Runs on GPU (basic: the code compiles and run well with the new module)
Contains an automated test (checksum and/or comparison with theory)
Documented: all elements (classes and their members, functions, namespaces, etc.) are documented
Constified (All that can be const is const)
Code is clean (no unwanted comments, )
Style and code conventions are respected at the bottom of https://github.com/Hi-PACE/hipace
Proper label and GitHub project, if applicable

Add a heuristics that also works with PkgConfig to query OpenMP support in FFTW. Enable by default if we build with the OpenMP compute backend unless explicitly disabled. Add a macro to control the source-code, since FFTW does not offer a public define for this.

ax3l · 2021-06-28T06:15:17Z

Just documenting here:

for best results, use close-by pinning, esp. with MPI, and avoid oversubscription of cores, esp. if no hyperthreading is available:

export OMP_PROC_BIND=spread
export OMP_PLACES=threads
export OMP_NUM_THREADS=1 # 1,2,4,...

ax3l · 2021-06-28T06:17:13Z

@SeverinDiederichs I just realized I did not try single precision, did that work as well for you?

ax3l · 2021-06-28T06:19:40Z

src/fields/fft_poisson_solver/fft/WrapDSTW.cpp

+            fftwf_plan_with_nthreads(omp_get_max_threads());
+#   else
+            fftw_init_threads();
+            fftw_plan_with_nthreads(omp_get_max_threads());


We could also add, just to expose even more control, a runtime parameter that can overwrite the value passed to ..._nthreads() from the inputs file.

The default would be the heuristic you already added (1 of <32**2 cells and omp_get_max_threads() otherwise), but it could add a useful intermediate layer of control in case we want to set the FFT parallelism independent of the rest of the sum that is controlled by OMP_NUM_THREADS.

You are right, this will be an interesting addition. After an offline discussion with @MaxThevenet, I will merge this PR as is and add this feature as soon as we have other openMP acceleration. As it is the only function using openMP, we have currently full control with OMP_NUM_THREADS.

Oh right, I forgot this is the first OpenMP accelerated part 😅

MaxThevenet

Looks great, thanks for this PR!

MaxThevenet · 2021-06-28T12:57:06Z

cmake/dependencies/FFT.cmake

+        message(STATUS "FFTW: Found OpenMP support")
+        target_compile_definitions(HiPACE::thirdparty::FFT INTERFACE HIPACE_FFTW_OMP=1)
+    else()
+        message(STATUS "FFTW: Could NOT find OpenMP support")


SeverinDiederichs · 2021-06-28T14:28:06Z

@SeverinDiederichs I just realized I did not try single precision, did that work as well for you?

I also tested single precision, it works well and shows the same behaviour 👍

add openMP for FFTW

f63475b

ax3l force-pushed the topic-openmp_fftw branch from a145e94 to 79a53cf Compare June 25, 2021 20:16

ax3l changed the title ~~[WIP] add OpenMP for FFTW~~ Add OpenMP for FFTW Jun 25, 2021

ax3l force-pushed the topic-openmp_fftw branch 3 times, most recently from 3f3fce4 to 8b4ba1c Compare June 25, 2021 20:20

ax3l mentioned this pull request Jun 25, 2021

FFTW: OpenMP ECP-WarpX/WarpX#2036

Closed

1 task

ax3l force-pushed the topic-openmp_fftw branch from 8b4ba1c to 6762eb5 Compare June 25, 2021 20:37

SeverinDiederichs requested a review from ax3l June 25, 2021 21:57

set threshold to avoid multiple threads for small FFTs

44a7f4c

ax3l added the performance optimization, benchmark, profiling, etc. label Jun 28, 2021

ax3l reviewed Jun 28, 2021

View reviewed changes

MaxThevenet approved these changes Jun 28, 2021

View reviewed changes

SeverinDiederichs merged commit 087c1ad into Hi-PACE:development Jun 28, 2021

SeverinDiederichs deleted the topic-openmp_fftw branch June 28, 2021 14:30

This was referenced Jun 28, 2021

Add OpenMP for FFTW ECP-WarpX/WarpX#2040

Merged

HiPACE++: FFTW+OpenMP spack/spack#24575

Merged

ax3l mentioned this pull request Jul 6, 2021

CMake: FFTW Search w/ CMake Install #553

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OpenMP for FFTW #541

Add OpenMP for FFTW #541

SeverinDiederichs commented Jun 25, 2021 •

edited

Loading

ax3l commented Jun 28, 2021 •

edited

Loading

ax3l commented Jun 28, 2021

ax3l Jun 28, 2021 •

edited

Loading

SeverinDiederichs Jun 28, 2021

ax3l Jun 28, 2021

MaxThevenet left a comment

MaxThevenet Jun 28, 2021

SeverinDiederichs commented Jun 28, 2021

Add OpenMP for FFTW #541

Add OpenMP for FFTW #541

Conversation

SeverinDiederichs commented Jun 25, 2021 • edited Loading

ax3l commented Jun 28, 2021 • edited Loading

ax3l commented Jun 28, 2021

ax3l Jun 28, 2021 • edited Loading

Choose a reason for hiding this comment

SeverinDiederichs Jun 28, 2021

Choose a reason for hiding this comment

ax3l Jun 28, 2021

Choose a reason for hiding this comment

MaxThevenet left a comment

Choose a reason for hiding this comment

MaxThevenet Jun 28, 2021

Choose a reason for hiding this comment

SeverinDiederichs commented Jun 28, 2021

SeverinDiederichs commented Jun 25, 2021 •

edited

Loading

ax3l commented Jun 28, 2021 •

edited

Loading

ax3l Jun 28, 2021 •

edited

Loading