Skip to content
This repository has been archived by the owner on Oct 23, 2020. It is now read-only.

New and faster OpenMP implementation in atm_srk3 / threading of physics_get_tend #1259

Open
wants to merge 5 commits into
base: atmosphere/develop
Choose a base branch
from

Conversation

climbfuji
Copy link
Contributor

@climbfuji climbfuji commented Feb 28, 2017

This PR changes the OpenMP threading in atm_srk3 to the method developed during my KONWIHR project. In brief, it creates the OpenMP thread once at the beginning of atm_srk3 and keeps them alive until the end of the routine. Non-threaded parts of the code are enclosed in master-only blocks. At the same time, OpenMP support is added to the routine physics_get_tend. As it is the case for some other parts of the atmosphere code, multiple blocks per MPI task are not supported (this was discussed and decided to be ok a while ago).

The code in the PR produces bitwise identical results, independent of the numbers of MPI and OpenMP tasks. On my Macbook, I find that the following combinations of code / MPI / OpenMP tasks produce bitwise identical results:
- original atmosphere/develop code, 2 MPI x 1 OpenMP tasks
- original atmosphere/develop code, 2 MPI x 2 OpenMP tasks
- new KONWIHR code, 2 MPI x 1 OpenMP tasks
- new KONWIHR code, 2 MPI x 2 OpenMP tasks
- new KONWIHR code, 2 MPI x 3 OpenMP tasks

However, the following run produces different results!
- original atmosphere/develop code, 2 MPI x 3 OpenMP tasks

I compared the restart.*.nc files after 6 hours of integration using a 6min time step. Without diving deeper into it, it looks like that the original code in atmosphere/develop does not always give the same results (start to deviate after about 5h model integration in my test). The differences are tiny initially, but then build up quickly (as usual).

For small numbers of OpenMP tasks, the new implementation is only marginally faster (mainly thanks to the threading of physics_get_tend), for large numbers of threads, in particular across a few KNLs, it is substantially faster (by a factor of three when using 4 KNL nodes with 1 or 2 MPI tasks and 256 or 128 OpenMP threads each).

edit: The above paragraph is not correct and should say "For small numbers of OpenMP tasks, the new implementation is only marginally faster (mainly thanks to the threading of physics_get_tend), for large numbers of threads, in particular on many-core architectures such as the KNL, it is substantially faster (by a factor of three when using 1 MPI task and 256 OpenMP threads on one KNL node)."

The new implementation also uses more timers (I call them subtimers) in atm_srk3, which allow for a detailed profiling. They are only used if the preprocessor directive -DMPAS_TIME_INTEGRATION_SUBTIMERS is set, because using too many timers can slow down the entire routine and that is something we (when not developing/profiling) and the ordinary users (who don’t care about profiling) don’t want.

edit 2: These timers are now enabled all the time, the need to specify the preprocessor directive has been removed.

@mgduda
Copy link
Contributor

mgduda commented Feb 28, 2017

@climbfuji Can you give some specific performance numbers showing the improvements that this PR yields? Why are the improvements in performance particularly notable when spanning multiple KNL nodes (since OpenMP works purely within a node)?

edit: The numbers in Table 2 of the KONWIHR report are what I was after. For posterity, would it be alright if I included a few of the numbers in one of the comments in this PR, or perhaps in one of the commit messages?

@mgduda
Copy link
Contributor

mgduda commented Feb 28, 2017

I'd advocate for leaving the detailed timers in the code. With the 40962 mesh with 41 layers, when I ran with dynamics only, the time integration times that I got for three runs with 240 timesteps each without timers were:

57.91
57.27
57.97

With the timers re-enabled, the times that I got were:

57.51
57.89
57.84

These were using the Intel 15.0.1 compilers on yellowstone.

@climbfuji Unless you're seeing measurable performance hits from enabling the detailed timers, I'd propose to keep them. We use these in many of our experiments here at NCAR to see how expensive, e.g., different physics options are. I'd also argue that having a preprocessing macro to remove the timers is unnecessary given their apparently (based on my test) negligible impact.

@mgduda
Copy link
Contributor

mgduda commented Feb 28, 2017

Since the removal of the physics_addtend routine has nothing to do with the OpenMP changes in this PR, I'd suggest we avoid removing that routine here. Right now, there are about ~1900 changed lines in this PR, and anything we can do to reduce the change-set will help in reviewing the changes that are fundamental to this PR.

@mgduda
Copy link
Contributor

mgduda commented Mar 1, 2017

The compilers have been updated on our KNL nodes, and I'm in the (slow) process of rebuilding all I/O libraries there. Hopefully tomorrow, I can do a little more testing of the new OpenMP scheme. I suspect that some physics schemes still have problems with threading, but I would fully expect that dynamics-only runs would get bit-identical results for any combination of MPI task and OpenMP thread count, provided we compile with -fp-model precise (for ifort) or -mfpmath=sse (for gfortran). Perhaps physics are the cause of the not-bit-identical results?

@climbfuji climbfuji force-pushed the atmosphere-threading-konwihr branch from dc0118a to e8bd2b5 Compare March 1, 2017 09:34
@climbfuji
Copy link
Contributor Author

climbfuji commented Mar 1, 2017

@mgduda I have reverted removing the addtend routine, however the code doesn't compile anymore as the interface to subroutine tend_toEdges has changed with the implementation of threading. If you want to keep the addtend routine in the code, then the least intrusive version to the code would be to overload the tend_toEdges routine to provide backward compatibility with the calls in the addtend routine. I don't really want to modify the calls to tend_toEdges in the addtend routine itself, as this would require quite a few changes to a routine that isn't used and that I don't want to change at all. What is your opinion on this? Besides, is it possible to reply on a comment-by-comment basis to address the individual concerns raised?

edit 1: This issue was the reason why I removed the routine addtend in the first place.

edit 2: I have added a legacy routine tend_toEdges_legacy to allow the code to compile.

@climbfuji
Copy link
Contributor Author

konwihr_report_mpas_table2_threading.pdf
The attached PDF is an extract from the report of my KONWIHR project and contains numbers on the measured speedup of the new versus the original hybrid code in the time integration routine atm_srk3.

@climbfuji
Copy link
Contributor Author

@mgduda Regarding your question "Why are the improvements in performance particularly notable when spanning multiple KNL nodes (since OpenMP works purely within a node)?" That statement I made in the PR description is actually nonsense. The numbers in Table 2 attached above show that the differences were only investigated on one node, not across nodes. As you said, OpenMP works across one node only. The correct statement should be "For small numbers of OpenMP tasks, the new implementation is only marginally faster (mainly thanks to the threading of physics_get_tend), for large numbers of threads, in particular on many-core architectures as the KNL, it is substantially faster (by a factor of three when using 1 MPI task and 256 OpenMP threads on one KNL node)." I will add this as an edit to the PR description.

@climbfuji
Copy link
Contributor Author

Regarding the question of the impact of subtimers on the performance. I did a test on a KNL with 1 MPI tasks and 256 OpenMP threads for a lengthy-ish run with 576 calls to the time integration routine. Physics etc. all on. The differences with/without timers are around 3% in this test, i.e. removing the "subtimers" reduces the runtime by about 3%.

screen shot 2017-03-02 at 8 26 04 am

@climbfuji climbfuji force-pushed the atmosphere-threading-konwihr branch 2 times, most recently from ccadfcc to f35ad21 Compare March 2, 2017 08:10
@mgduda
Copy link
Contributor

mgduda commented Mar 3, 2017

Regarding the performance impact of the timers, I've done some tests on our KNL node with the 40962-cell mesh with the code in this PR. The mean time for time integration across six runs with 1 MPI task and 256 OpenMP tasks was 69.636667 with a standard deviation of 0.172240 with the detailed timers enabled. With the detailed timers disabled, the mean for six runs was 69.073333 with a standard deviation of 0.111116. From my testing, then, the performance hit due to timers amounts to 0.8% of the dynamics time; note that my runs were dynamics only (to emphasize any performance penalty of the timers in the dynamics), and the average time for a timestep was only ~0.575 s (compared with ~2.75 s in your test).

Is it possible that there was some other factor in your testing that led to a 3% (of a 2.75 s timestep) performance penalty when enabling the timers? Which compiler version and KNL parts are you using for your tests? As I mentioned, we do find the timers useful, and I would really prefer to have these be "opt-out" rather than "opt-in". Additionally, to avoid the need for #if directives throughout the code, I'd propose simply defining a pair of macros at the top of mpas_atm_time_integration.F for the timer start/stop calls.

@climbfuji
Copy link
Contributor Author

I am not at all against keeping the timers and removing the ifdefs (also, the MACRO definition instead of repeated timers makes a lot of sense). In my tests, I had the physics all on, and I used the standard intel compiler options ("-O3 -convert big_endian -FR" for Fortran, "-O3" for C/C++), otherwise I can't think of any differences. Let's remove the ifdefs altogether and keep the individual timers in there for simplicity. Will do that later tonight ...

…is changes the interface and the call to it in mpas_atm_time_integration.F. This also includes a threaded version of tend_toEdges in mpas_atm_todynamics.F. In order to keep the old, unused physics_addtend routine in the code, a legacy interface to the new, threaded version of tend_toEdges is provided.

A new OpenMP implementation is used in atm_srk3, which creates the threads only once for the entire time integration step and uses them on demand; non-threaded sections are enclosed in master-only blocks. Additional timers in atm_srk3 are added for most of the routines called to provide a finer granularity when profiling.
…be vectorised in physics_get_tend; not all of them might be needed, but they don't hurt.
@climbfuji climbfuji force-pushed the atmosphere-threading-konwihr branch from f35ad21 to 8416d50 Compare March 3, 2017 21:58
@climbfuji
Copy link
Contributor Author

Ok, timers now on all the time, directive removed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants