-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLUBB/pdf_closure call optimizations #1031
Comments
Hi Pat, Your suggested short term modifications make sense to me, and I am very comfortable with OKing them. For the longer term, I think there are a few other issues to consider. First, we need to establish that CLUBB will be part of the next generations of the ACME model. @davidcbaderatllnl has indicated that Version2 will be primarily a performance enhancement, so if a rewrite in this code is important to get that performance enhancement, then we should go for it. Second, For version 3, other issues might be important to consider, and we may need guidance by the executive council (adding @lrleung, @mt5555) . Since the ESM program is supporting research on other codes that may be candidates for turbulence and convection, including MMF, and EDMF, and probably others that I am not aware of. Finally, some of the CMDV activities may be important to consider, including the CLUBB rewrites that the @agsalin team is doing. I dont know how to proceed for the long run, so I have mentioned some of the other people who may have opinions on this. |
@philrasch , thank you for the comments. I will go ahead and create a PR with these changes, and let the atmosphere group (and @amametjanov and @ndkeen, for performance at ALCF and NERSC) look at them and pass judgement. I will ask again if other optimization opportunities present themselves - my target is primarily V1, in particular to help determine what is affordable for V1. |
@philrasch , just to reiterate - my question is about whether there are any issues specific to modifying CLUBB, and my concern is with V1. If NCAR or others already have optimizations that we might consider, is there any mechanism to evaluate and adopt those now, rather than spending ACME resources on optimizing code that has already been optimized? Issues for V2 and V3 are conversations for others and in a different setting. Also CLUBB is by far from being the only target for optimization. It was on the top of my list because the PGI compiler in particular has issues with it. |
Thanks Pat, now I get it. There are two issues I can think of associated with adopting CESM modifications to CLUBB.
On a related note, I am also thinking about these issues with the suggestion by @cecilehannay that we adopt a more recent tag of CLUBB to address a bug in the supersaturation identified here [https://acme-climate.atlassian.net/browse/AG-444]. I would have preferred it if we could maintain consistency with NCAR from the get-go, but maybe it is too late for that this time around. If you or Balwinder have advice on this, I am open to suggestion. I realize my comments are probably not helping. |
Working around implicit array copies in CLUBB subroutine calls In the routine advance_clubb_core (in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops of the form: do k = 1, gr%nz, 1 call pdf_closure & (... zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in) wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in) sclrpthlp_zt(k,:), k, & ! intent(in) wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in) ... rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in) wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out) wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out) ... ) Each of the 15 arrays of the form XXX(k,:) is declared internally as an array of size XXX(:), and the compilers apparently are creating local temporaries and copying into and out of these. This is pretty low level (being inside loops over first chunks, then local columns, and then nadv). Explicitly allocating temporary arrays of the correct dimensions and copying into (for intent(in)) and out of (for intent(out)) external to the call to pdf_closure improves performance. For the Intel compiler on Titan, this drops the cost by around 15%. For the PGI compiler on Titan, this decreases the cost by a factor of 6. This modification only modifies two of the loops containing calls to pdf_closure, as these are the only two that are exercized in the current ACME test cases. There are two others that should be modified in analogous ways if l_use_ice_latent is true. Fixes #1031 [BFB] * worleyph/cam/pdf_closure_call_opt: Working around implicit array segment copies
Working around implicit array copies in CLUBB subroutine calls In the routine advance_clubb_core (in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops of the form: do k = 1, gr%nz, 1 call pdf_closure & (... zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in) wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in) sclrpthlp_zt(k,:), k, & ! intent(in) wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in) ... rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in) wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out) wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out) ... ) Each of the 15 arrays of the form XXX(k,:) is declared internally as an array of size XXX(:), and the compilers apparently are creating local temporaries and copying into and out of these. This is pretty low level (being inside loops over first chunks, then local columns, and then nadv). Explicitly allocating temporary arrays of the correct dimensions and copying into (for intent(in)) and out of (for intent(out)) external to the call to pdf_closure improves performance. For the Intel compiler on Titan, this drops the cost by around 15%. For the PGI compiler on Titan, this decreases the cost by a factor of 6. This modification only modifies two of the loops containing calls to pdf_closure, as these are the only two that are exercized in the current ACME test cases. There are two others that should be modified in analogous ways if l_use_ice_latent is true. Fixes #1031 [BFB] * worleyph/cam/pdf_closure_call_opt: Added an if condition for fixing ONE TRIP behavior of some compilers. Working around implicit array segment copies
Working around implicit array copies in CLUBB subroutine calls In the routine advance_clubb_core (in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops of the form: do k = 1, gr%nz, 1 call pdf_closure & (... zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in) wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in) sclrpthlp_zt(k,:), k, & ! intent(in) wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in) ... rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in) wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out) wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out) ... ) Each of the 15 arrays of the form XXX(k,:) is declared internally as an array of size XXX(:), and the compilers apparently are creating local temporaries and copying into and out of these. This is pretty low level (being inside loops over first chunks, then local columns, and then nadv). Explicitly allocating temporary arrays of the correct dimensions and copying into (for intent(in)) and out of (for intent(out)) external to the call to pdf_closure improves performance. For the Intel compiler on Titan, this drops the cost by around 15%. For the PGI compiler on Titan, this decreases the cost by a factor of 6. This modification only modifies two of the loops containing calls to pdf_closure, as these are the only two that are exercized in the current ACME test cases. There are two others that should be modified in analogous ways if l_use_ice_latent is true. Fixes #1031 [BFB] * worleyph/cam/pdf_closure_call_opt: Rearranged code so that it passes Skybridge testing
Working around implicit array copies in CLUBB subroutine calls In the routine advance_clubb_core (in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops of the form: do k = 1, gr%nz, 1 call pdf_closure & (... zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in) wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in) sclrpthlp_zt(k,:), k, & ! intent(in) wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in) ... rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in) wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out) wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out) ... ) Each of the 15 arrays of the form XXX(k,:) is declared internally as an array of size XXX(:), and the compilers apparently are creating local temporaries and copying into and out of these. This is pretty low level (being inside loops over first chunks, then local columns, and then nadv). Explicitly allocating temporary arrays of the correct dimensions and copying into (for intent(in)) and out of (for intent(out)) external to the call to pdf_closure improves performance. For the Intel compiler on Titan, this drops the cost by around 15%. For the PGI compiler on Titan, this decreases the cost by a factor of 6. This modification only modifies two of the loops containing calls to pdf_closure, as these are the only two that are exercized in the current ACME test cases. There are two others that should be modified in analogous ways if l_use_ice_latent is true. Fixes #1031 [BFB] * worleyph/cam/pdf_closure_call_opt: Rearranged code so that it passes Skybridge testing Added an if condition for fixing ONE TRIP behavior of some compilers. Working around implicit array segment copies
I'm creating an issue, so that there is something to refer to in a PR, and to ask about the process here.
I am pretty sure that NCAR has been making optimizations in the atmospheric physics, so anything we do may very well be duplicative. If we could address coordination in the near future, that would help in allocating scarce resources.
I also do not know what the procedure here is for making changes to CLUBB, since it is also an external package.
In any case, optimization number 1:
In the routine advance_clubb_core (in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops of the form:
Each of the 15 arrays of the form XXX(k,:) is declared internally as an array of size XXX(:), and the compilers apparently are creating local temporaries and copying into and out of these. This is pretty low level (being inside loops over first chunks, then local columns, and then nadv).
Explicitly allocating temporary arrays of the correct dimensions and copying into (for intent(in)) and out of (for intent(out)) external to the call to pdf_closure improves performance.
(After this modification, PGI is slightly faster than Intel, for this part of the code and when using MPI-only. OpenMP/PGI still has performance problems here: 4-way OpenMP is 2X slower than MPI-only for this part of the code).
I can create a PR with this modification and let the atmosphere group look at it. Like everything else in the physics, this will not change performance dramatically - still need to make lots of optimizations in many locations - but this was a taller nail than most.
Also, the "real" solution is to push the loop over gr%nz into the call to pdf_closure, for many reasons but including vectorization targeting systems like the Phi. This is much more intrusive and probably best done by a CLUBB developer.
( @singhbalwinder , please reassign to the appropriate person.)
The text was updated successfully, but these errors were encountered: