Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OpenMP for FFTW #541

Merged

Conversation

SeverinDiederichs
Copy link
Member

@SeverinDiederichs SeverinDiederichs commented Jun 25, 2021

Add a heuristics that also works with PkgConfig to query OpenMP support in FFTW. Enable by default if we build with the OpenMP compute backend unless explicitly disabled.

Add a macro to control the source-code, since FFTW does not offer a public define for this.

Thanks to @ax3l FFTW with OpenMP now works:
On my laptop, using the beam in vacuum example without I/O and without beam at amr.n_cell = 1024 1024 50
I get the following run times:

OMP_NUM_THREADS=1
AnyDST::Execute()                                     500      29.68      29.68      29.68  80.17%
OMP_NUM_THREADS=2
AnyDST::Execute()                                     500      14.86      14.86      14.86  68.45%
OMP_NUM_THREADS=3
AnyDST::Execute()                                     500      10.05      10.05      10.05  59.75%
OMP_NUM_THREADS=4
AnyDST::Execute()                                     500      7.707      7.707      7.707  53.07%
OMP_NUM_THREADS=5
AnyDST::Execute()                                     500       6.23       6.23       6.23  47.70%
OMP_NUM_THREADS=6
AnyDST::Execute()                                     500      6.097      6.097      6.097  46.36%

This implementation seems to give the correct results. Using 2 threads, all tests pass locally:

      Start  1: blowout_wake.2Rank
 1/20 Test  #1: blowout_wake.2Rank ......................   Passed    6.49 sec
      Start  2: ionization.2Rank
 2/20 Test  #2: ionization.2Rank ........................   Passed    2.40 sec
      Start  3: from_file.normalized.1Rank
 3/20 Test  #3: from_file.normalized.1Rank ..............   Passed    3.12 sec
      Start  4: from_file.SI.1Rank
 4/20 Test  #4: from_file.SI.1Rank ......................   Passed    3.00 sec
      Start  5: restart.normalized.1Rank
 5/20 Test  #5: restart.normalized.1Rank ................   Passed    0.88 sec
      Start  6: blowout_wake_explicit.2Rank
 6/20 Test  #6: blowout_wake_explicit.2Rank .............   Passed    2.52 sec
      Start  7: beam_evolution.1Rank
 7/20 Test  #7: beam_evolution.1Rank ....................   Passed    3.04 sec
      Start  8: adaptive_time_step.1Rank
 8/20 Test  #8: adaptive_time_step.1Rank ................   Passed    5.32 sec
      Start  9: grid_current.1Rank
 9/20 Test  #9: grid_current.1Rank ......................   Passed    1.82 sec
      Start 10: linear_wake.normalized.1Rank
10/20 Test #10: linear_wake.normalized.1Rank ............   Passed    2.92 sec
      Start 11: linear_wake.SI.1Rank
11/20 Test #11: linear_wake.SI.1Rank ....................   Passed    2.91 sec
      Start 12: gaussian_linear_wake.normalized.1Rank
12/20 Test #12: gaussian_linear_wake.normalized.1Rank ...   Passed    3.00 sec
      Start 13: gaussian_linear_wake.SI.1Rank
13/20 Test #13: gaussian_linear_wake.SI.1Rank ...........   Passed    3.12 sec
      Start 14: reset.2Rank
14/20 Test #14: reset.2Rank .............................   Passed    3.08 sec
      Start 15: beam_in_vacuum.SI.1Rank
15/20 Test #15: beam_in_vacuum.SI.1Rank .................   Passed    5.50 sec
      Start 16: beam_in_vacuum.normalized.1Rank
16/20 Test #16: beam_in_vacuum.normalized.1Rank .........   Passed    4.94 sec
      Start 17: next_deposition_beam.2Rank
17/20 Test #17: next_deposition_beam.2Rank ..............   Passed   27.33 sec
      Start 18: slice_IO.1Rank
18/20 Test #18: slice_IO.1Rank ..........................   Passed    3.15 sec
      Start 19: gaussian_weight.1Rank
19/20 Test #19: gaussian_weight.1Rank ...................   Passed    5.00 sec
      Start 20: beam_in_vacuum.normalized.2Rank
20/20 Test #20: beam_in_vacuum.normalized.2Rank .........   Passed    5.35 sec

100% tests passed, 0 tests failed out of 20

As a comparison, here the run time on development on my laptop:

      Start  1: blowout_wake.2Rank
1/20 Test  #1: blowout_wake.2Rank ......................   Passed    6.95 sec
     Start  2: ionization.2Rank
2/20 Test  #2: ionization.2Rank ........................   Passed    2.61 sec
     Start  3: from_file.normalized.1Rank
3/20 Test  #3: from_file.normalized.1Rank ..............   Passed    4.15 sec
     Start  4: from_file.SI.1Rank
4/20 Test  #4: from_file.SI.1Rank ......................   Passed    3.65 sec
     Start  5: restart.normalized.1Rank
5/20 Test  #5: restart.normalized.1Rank ................   Passed    1.28 sec
     Start  6: blowout_wake_explicit.2Rank
6/20 Test  #6: blowout_wake_explicit.2Rank .............   Passed    2.71 sec
     Start  7: beam_evolution.1Rank
7/20 Test  #7: beam_evolution.1Rank ....................   Passed    2.88 sec
     Start  8: adaptive_time_step.1Rank
8/20 Test  #8: adaptive_time_step.1Rank ................   Passed    5.30 sec
     Start  9: grid_current.1Rank
9/20 Test  #9: grid_current.1Rank ......................   Passed    1.95 sec
     Start 10: linear_wake.normalized.1Rank
10/20 Test #10: linear_wake.normalized.1Rank ............   Passed    3.10 sec
     Start 11: linear_wake.SI.1Rank
11/20 Test #11: linear_wake.SI.1Rank ....................   Passed    4.18 sec
     Start 12: gaussian_linear_wake.normalized.1Rank
12/20 Test #12: gaussian_linear_wake.normalized.1Rank ...   Passed    3.83 sec
     Start 13: gaussian_linear_wake.SI.1Rank
13/20 Test #13: gaussian_linear_wake.SI.1Rank ...........   Passed    3.03 sec
     Start 14: reset.2Rank
14/20 Test #14: reset.2Rank .............................   Passed    3.12 sec
     Start 15: beam_in_vacuum.SI.1Rank
15/20 Test #15: beam_in_vacuum.SI.1Rank .................   Passed    6.73 sec
     Start 16: beam_in_vacuum.normalized.1Rank
16/20 Test #16: beam_in_vacuum.normalized.1Rank .........   Passed    5.32 sec
     Start 17: next_deposition_beam.2Rank
17/20 Test #17: next_deposition_beam.2Rank ..............   Passed    6.73 sec
     Start 18: slice_IO.1Rank
18/20 Test #18: slice_IO.1Rank ..........................   Passed    3.58 sec
     Start 19: gaussian_weight.1Rank
19/20 Test #19: gaussian_weight.1Rank ...................   Passed    6.52 sec
     Start 20: beam_in_vacuum.normalized.2Rank
20/20 Test #20: beam_in_vacuum.normalized.2Rank .........   Passed   10.55 sec

next_deposition_beam.2Rank test takes so much longer than running in serial. The reason lies in its extremely low resolution of 16 transverse grid points. Obviously, it does not make any sense to use more than 1 CPU for the FFT there. Using 32 grid points is slightly slower with 2 threads. Increasing the number of grid points to 64 yields in a speed up with more threads again. Therefore, a threshold of nx > 32 && ny > 32 was added. If it is not met, the FFT is executed with a single thread.

  • Small enough (< few 100s of lines), otherwise it should probably be split into smaller PRs
  • Tested (describe the tests in the PR description)
  • Runs on GPU (basic: the code compiles and run well with the new module)
  • Contains an automated test (checksum and/or comparison with theory)
  • Documented: all elements (classes and their members, functions, namespaces, etc.) are documented
  • Constified (All that can be const is const)
  • Code is clean (no unwanted comments, )
  • Style and code conventions are respected at the bottom of https://github.com/Hi-PACE/hipace
  • Proper label and GitHub project, if applicable

@ax3l ax3l force-pushed the topic-openmp_fftw branch from a145e94 to 79a53cf Compare June 25, 2021 20:16
@ax3l ax3l changed the title [WIP] add OpenMP for FFTW Add OpenMP for FFTW Jun 25, 2021
@ax3l ax3l force-pushed the topic-openmp_fftw branch 3 times, most recently from 3f3fce4 to 8b4ba1c Compare June 25, 2021 20:20
@ax3l ax3l mentioned this pull request Jun 25, 2021
1 task
Add a heuristics that also works with PkgConfig to query
OpenMP support in FFTW. Enable by default if we build with the
OpenMP compute backend unless explicitly disabled.

Add a macro to control the source-code, since FFTW does not offer
a public define for this.
@ax3l ax3l force-pushed the topic-openmp_fftw branch from 8b4ba1c to 6762eb5 Compare June 25, 2021 20:37
@SeverinDiederichs SeverinDiederichs requested a review from ax3l June 25, 2021 21:57
@ax3l
Copy link
Member

ax3l commented Jun 28, 2021

Just documenting here:

for best results, use close-by pinning, esp. with MPI, and avoid oversubscription of cores, esp. if no hyperthreading is available:

export OMP_PROC_BIND=spread
export OMP_PLACES=threads
export OMP_NUM_THREADS=1 # 1,2,4,...

@ax3l ax3l added the performance optimization, benchmark, profiling, etc. label Jun 28, 2021
@ax3l
Copy link
Member

ax3l commented Jun 28, 2021

@SeverinDiederichs I just realized I did not try single precision, did that work as well for you?

Comment on lines +27 to +30
fftwf_plan_with_nthreads(omp_get_max_threads());
# else
fftw_init_threads();
fftw_plan_with_nthreads(omp_get_max_threads());
Copy link
Member

@ax3l ax3l Jun 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also add, just to expose even more control, a runtime parameter that can overwrite the value passed to ..._nthreads() from the inputs file.

The default would be the heuristic you already added (1 of <32**2 cells and omp_get_max_threads() otherwise), but it could add a useful intermediate layer of control in case we want to set the FFT parallelism independent of the rest of the sum that is controlled by OMP_NUM_THREADS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, this will be an interesting addition. After an offline discussion with @MaxThevenet, I will merge this PR as is and add this feature as soon as we have other openMP acceleration. As it is the only function using openMP, we have currently full control with OMP_NUM_THREADS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, I forgot this is the first OpenMP accelerated part 😅

Copy link
Member

@MaxThevenet MaxThevenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for this PR!

Comment on lines +11 to +14
message(STATUS "FFTW: Found OpenMP support")
target_compile_definitions(HiPACE::thirdparty::FFT INTERFACE HIPACE_FFTW_OMP=1)
else()
message(STATUS "FFTW: Could NOT find OpenMP support")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lovely!

@SeverinDiederichs
Copy link
Member Author

@SeverinDiederichs I just realized I did not try single precision, did that work as well for you?

I also tested single precision, it works well and shows the same behaviour 👍

@SeverinDiederichs SeverinDiederichs merged commit 087c1ad into Hi-PACE:development Jun 28, 2021
@SeverinDiederichs SeverinDiederichs deleted the topic-openmp_fftw branch June 28, 2021 14:30
@ax3l ax3l mentioned this pull request Jul 6, 2021
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance optimization, benchmark, profiling, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants