-
Notifications
You must be signed in to change notification settings - Fork 0
New and faster OpenMP implementation in atm_srk3 / threading of physics_get_tend #1259
base: atmosphere/develop
Are you sure you want to change the base?
New and faster OpenMP implementation in atm_srk3 / threading of physics_get_tend #1259
Conversation
@climbfuji Can you give some specific performance numbers showing the improvements that this PR yields? Why are the improvements in performance particularly notable when spanning multiple KNL nodes (since OpenMP works purely within a node)? edit: The numbers in Table 2 of the KONWIHR report are what I was after. For posterity, would it be alright if I included a few of the numbers in one of the comments in this PR, or perhaps in one of the commit messages? |
I'd advocate for leaving the detailed timers in the code. With the 40962 mesh with 41 layers, when I ran with dynamics only, the
With the timers re-enabled, the times that I got were:
These were using the Intel 15.0.1 compilers on yellowstone. @climbfuji Unless you're seeing measurable performance hits from enabling the detailed timers, I'd propose to keep them. We use these in many of our experiments here at NCAR to see how expensive, e.g., different physics options are. I'd also argue that having a preprocessing macro to remove the timers is unnecessary given their apparently (based on my test) negligible impact. |
Since the removal of the |
The compilers have been updated on our KNL nodes, and I'm in the (slow) process of rebuilding all I/O libraries there. Hopefully tomorrow, I can do a little more testing of the new OpenMP scheme. I suspect that some physics schemes still have problems with threading, but I would fully expect that dynamics-only runs would get bit-identical results for any combination of MPI task and OpenMP thread count, provided we compile with |
dc0118a
to
e8bd2b5
Compare
@mgduda I have reverted removing the addtend routine, however the code doesn't compile anymore as the interface to subroutine tend_toEdges has changed with the implementation of threading. If you want to keep the addtend routine in the code, then the least intrusive version to the code would be to overload the tend_toEdges routine to provide backward compatibility with the calls in the addtend routine. I don't really want to modify the calls to tend_toEdges in the addtend routine itself, as this would require quite a few changes to a routine that isn't used and that I don't want to change at all. What is your opinion on this? Besides, is it possible to reply on a comment-by-comment basis to address the individual concerns raised? edit 1: This issue was the reason why I removed the routine addtend in the first place. edit 2: I have added a legacy routine tend_toEdges_legacy to allow the code to compile. |
konwihr_report_mpas_table2_threading.pdf |
@mgduda Regarding your question "Why are the improvements in performance particularly notable when spanning multiple KNL nodes (since OpenMP works purely within a node)?" That statement I made in the PR description is actually nonsense. The numbers in Table 2 attached above show that the differences were only investigated on one node, not across nodes. As you said, OpenMP works across one node only. The correct statement should be "For small numbers of OpenMP tasks, the new implementation is only marginally faster (mainly thanks to the threading of physics_get_tend), for large numbers of threads, in particular on many-core architectures as the KNL, it is substantially faster (by a factor of three when using 1 MPI task and 256 OpenMP threads on one KNL node)." I will add this as an edit to the PR description. |
Regarding the question of the impact of subtimers on the performance. I did a test on a KNL with 1 MPI tasks and 256 OpenMP threads for a lengthy-ish run with 576 calls to the time integration routine. Physics etc. all on. The differences with/without timers are around 3% in this test, i.e. removing the "subtimers" reduces the runtime by about 3%. |
ccadfcc
to
f35ad21
Compare
Regarding the performance impact of the timers, I've done some tests on our KNL node with the 40962-cell mesh with the code in this PR. The mean time for Is it possible that there was some other factor in your testing that led to a 3% (of a 2.75 s timestep) performance penalty when enabling the timers? Which compiler version and KNL parts are you using for your tests? As I mentioned, we do find the timers useful, and I would really prefer to have these be "opt-out" rather than "opt-in". Additionally, to avoid the need for |
I am not at all against keeping the timers and removing the ifdefs (also, the MACRO definition instead of repeated timers makes a lot of sense). In my tests, I had the physics all on, and I used the standard intel compiler options ("-O3 -convert big_endian -FR" for Fortran, "-O3" for C/C++), otherwise I can't think of any differences. Let's remove the ifdefs altogether and keep the individual timers in there for simplicity. Will do that later tonight ... |
…is changes the interface and the call to it in mpas_atm_time_integration.F. This also includes a threaded version of tend_toEdges in mpas_atm_todynamics.F. In order to keep the old, unused physics_addtend routine in the code, a legacy interface to the new, threaded version of tend_toEdges is provided. A new OpenMP implementation is used in atm_srk3, which creates the threads only once for the entire time integration step and uses them on demand; non-threaded sections are enclosed in master-only blocks. Additional timers in atm_srk3 are added for most of the routines called to provide a finer granularity when profiling.
…be vectorised in physics_get_tend; not all of them might be needed, but they don't hurt.
…tually physics_get_tend is used
f35ad21
to
8416d50
Compare
Ok, timers now on all the time, directive removed. |
This PR changes the OpenMP threading in atm_srk3 to the method developed during my KONWIHR project. In brief, it creates the OpenMP thread once at the beginning of atm_srk3 and keeps them alive until the end of the routine. Non-threaded parts of the code are enclosed in master-only blocks. At the same time, OpenMP support is added to the routine physics_get_tend. As it is the case for some other parts of the atmosphere code, multiple blocks per MPI task are not supported (this was discussed and decided to be ok a while ago).
The code in the PR produces bitwise identical results, independent of the numbers of MPI and OpenMP tasks. On my Macbook, I find that the following combinations of code / MPI / OpenMP tasks produce bitwise identical results:
- original atmosphere/develop code, 2 MPI x 1 OpenMP tasks
- original atmosphere/develop code, 2 MPI x 2 OpenMP tasks
- new KONWIHR code, 2 MPI x 1 OpenMP tasks
- new KONWIHR code, 2 MPI x 2 OpenMP tasks
- new KONWIHR code, 2 MPI x 3 OpenMP tasks
However, the following run produces different results!
- original atmosphere/develop code, 2 MPI x 3 OpenMP tasks
I compared the restart.*.nc files after 6 hours of integration using a 6min time step. Without diving deeper into it, it looks like that the original code in atmosphere/develop does not always give the same results (start to deviate after about 5h model integration in my test). The differences are tiny initially, but then build up quickly (as usual).
For small numbers of OpenMP tasks, the new implementation is only marginally faster (mainly thanks to the threading of physics_get_tend), for large numbers of threads, in particular across a few KNLs, it is substantially faster (by a factor of three when using 4 KNL nodes with 1 or 2 MPI tasks and 256 or 128 OpenMP threads each).
edit: The above paragraph is not correct and should say "For small numbers of OpenMP tasks, the new implementation is only marginally faster (mainly thanks to the threading of physics_get_tend), for large numbers of threads, in particular on many-core architectures such as the KNL, it is substantially faster (by a factor of three when using 1 MPI task and 256 OpenMP threads on one KNL node)."
The new implementation also uses more timers (I call them subtimers) in atm_srk3, which allow for a detailed profiling. They are only used if the preprocessor directive -DMPAS_TIME_INTEGRATION_SUBTIMERS is set, because using too many timers can slow down the entire routine and that is something we (when not developing/profiling) and the ordinary users (who don’t care about profiling) don’t want.
edit 2: These timers are now enabled all the time, the need to specify the preprocessor directive has been removed.