CLUBB/pdf_closure call optimizations #1031

worleyph · 2016-08-26T14:03:30Z

I'm creating an issue, so that there is something to refer to in a PR, and to ask about the process here.

I am pretty sure that NCAR has been making optimizations in the atmospheric physics, so anything we do may very well be duplicative. If we could address coordination in the near future, that would help in allocating scarce resources.
I also do not know what the procedure here is for making changes to CLUBB, since it is also an external package.

In any case, optimization number 1:

In the routine advance_clubb_core (in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops of the form:

     do k = 1, gr%nz, 1
       call pdf_closure &
         ( ...
           zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:),              & ! intent(in)
           wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),          & ! intent(in)
           sclrpthlp_zt(k,:), k,                                       & ! intent(in)
           wphydrometp_zt(k,:), wp2hmp(k,:),                           & ! intent(in)
           ...
           rtphmp_zt(k,:), thlphmp_zt(k,:),                            & ! intent(in)
           wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),          & ! intent(out)
           wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),          & ! intent(out)
          ...
       )

Each of the 15 arrays of the form XXX(k,:) is declared internally as an array of size XXX(:), and the compilers apparently are creating local temporaries and copying into and out of these. This is pretty low level (being inside loops over first chunks, then local columns, and then nadv).

Explicitly allocating temporary arrays of the correct dimensions and copying into (for intent(in)) and out of (for intent(out)) external to the call to pdf_closure improves performance.

For the Intel compiler on Titan, this drops the cost by around 15%.
For the PGI compiler on Titan, this decreases the cost by a factor of 6X !!!!

(After this modification, PGI is slightly faster than Intel, for this part of the code and when using MPI-only. OpenMP/PGI still has performance problems here: 4-way OpenMP is 2X slower than MPI-only for this part of the code).

I can create a PR with this modification and let the atmosphere group look at it. Like everything else in the physics, this will not change performance dramatically - still need to make lots of optimizations in many locations - but this was a taller nail than most.

Also, the "real" solution is to push the loop over gr%nz into the call to pdf_closure, for many reasons but including vectorization targeting systems like the Phi. This is much more intrusive and probably best done by a CLUBB developer.

( @singhbalwinder , please reassign to the appropriate person.)

The text was updated successfully, but these errors were encountered:

philrasch · 2016-08-27T22:26:59Z

Hi Pat,

Your suggested short term modifications make sense to me, and I am very comfortable with OKing them. For the longer term, I think there are a few other issues to consider.

First, we need to establish that CLUBB will be part of the next generations of the ACME model. @davidcbaderatllnl has indicated that Version2 will be primarily a performance enhancement, so if a rewrite in this code is important to get that performance enhancement, then we should go for it.

Second, For version 3, other issues might be important to consider, and we may need guidance by the executive council (adding @lrleung, @mt5555) . Since the ESM program is supporting research on other codes that may be candidates for turbulence and convection, including MMF, and EDMF, and probably others that I am not aware of.

Finally, some of the CMDV activities may be important to consider, including the CLUBB rewrites that the @agsalin team is doing.

I dont know how to proceed for the long run, so I have mentioned some of the other people who may have opinions on this.

worleyph · 2016-08-27T23:16:55Z

@philrasch , thank you for the comments. I will go ahead and create a PR with these changes, and let the atmosphere group (and @amametjanov and @ndkeen, for performance at ALCF and NERSC) look at them and pass judgement. I will ask again if other optimization opportunities present themselves - my target is primarily V1, in particular to help determine what is affordable for V1.

worleyph · 2016-08-27T23:36:05Z

@philrasch , just to reiterate - my question is about whether there are any issues specific to modifying CLUBB, and my concern is with V1. If NCAR or others already have optimizations that we might consider, is there any mechanism to evaluate and adopt those now, rather than spending ACME resources on optimizing code that has already been optimized? Issues for V2 and V3 are conversations for others and in a different setting. Also CLUBB is by far from being the only target for optimization. It was on the top of my list because the PGI compiler in particular has issues with it.

philrasch · 2016-08-28T02:11:22Z

Thanks Pat, now I get it. There are two issues I can think of associated with adopting CESM modifications to CLUBB.

the trivial one is that we have been modifying CLUBB to allow a bunch of tuning parameters, and would have to merge those again
we need to ascertain the impact to the model climate if we think it might change. I suspect your suggested code revisions for optimization would not change answers if the threading, optimization, etc is turned off, and that would a good start in reassuring us that the code would remain solid when it is run in an HPC mode. We are not yet at the stage where we are requiring that the answers dont change at all. But I dont have a clue whether changes that NCAR has been making since they gave us the code are climate changing.

On a related note, I am also thinking about these issues with the suggestion by @cecilehannay that we adopt a more recent tag of CLUBB to address a bug in the supersaturation identified here [https://acme-climate.atlassian.net/browse/AG-444]. I would have preferred it if we could maintain consistency with NCAR from the get-go, but maybe it is too late for that this time around. If you or Balwinder have advice on this, I am open to suggestion.

I realize my comments are probably not helping.

Working around implicit array copies in CLUBB subroutine calls In the routine advance_clubb_core (in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops of the form: do k = 1, gr%nz, 1 call pdf_closure & (... zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in) wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in) sclrpthlp_zt(k,:), k, & ! intent(in) wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in) ... rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in) wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out) wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out) ... ) Each of the 15 arrays of the form XXX(k,:) is declared internally as an array of size XXX(:), and the compilers apparently are creating local temporaries and copying into and out of these. This is pretty low level (being inside loops over first chunks, then local columns, and then nadv). Explicitly allocating temporary arrays of the correct dimensions and copying into (for intent(in)) and out of (for intent(out)) external to the call to pdf_closure improves performance. For the Intel compiler on Titan, this drops the cost by around 15%. For the PGI compiler on Titan, this decreases the cost by a factor of 6. This modification only modifies two of the loops containing calls to pdf_closure, as these are the only two that are exercized in the current ACME test cases. There are two others that should be modified in analogous ways if l_use_ice_latent is true. Fixes #1031 [BFB] * worleyph/cam/pdf_closure_call_opt: Working around implicit array segment copies

Working around implicit array copies in CLUBB subroutine calls In the routine advance_clubb_core (in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops of the form: do k = 1, gr%nz, 1 call pdf_closure & (... zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in) wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in) sclrpthlp_zt(k,:), k, & ! intent(in) wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in) ... rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in) wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out) wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out) ... ) Each of the 15 arrays of the form XXX(k,:) is declared internally as an array of size XXX(:), and the compilers apparently are creating local temporaries and copying into and out of these. This is pretty low level (being inside loops over first chunks, then local columns, and then nadv). Explicitly allocating temporary arrays of the correct dimensions and copying into (for intent(in)) and out of (for intent(out)) external to the call to pdf_closure improves performance. For the Intel compiler on Titan, this drops the cost by around 15%. For the PGI compiler on Titan, this decreases the cost by a factor of 6. This modification only modifies two of the loops containing calls to pdf_closure, as these are the only two that are exercized in the current ACME test cases. There are two others that should be modified in analogous ways if l_use_ice_latent is true. Fixes #1031 [BFB] * worleyph/cam/pdf_closure_call_opt: Added an if condition for fixing ONE TRIP behavior of some compilers. Working around implicit array segment copies

Working around implicit array copies in CLUBB subroutine calls In the routine advance_clubb_core (in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops of the form: do k = 1, gr%nz, 1 call pdf_closure & (... zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in) wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in) sclrpthlp_zt(k,:), k, & ! intent(in) wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in) ... rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in) wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out) wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out) ... ) Each of the 15 arrays of the form XXX(k,:) is declared internally as an array of size XXX(:), and the compilers apparently are creating local temporaries and copying into and out of these. This is pretty low level (being inside loops over first chunks, then local columns, and then nadv). Explicitly allocating temporary arrays of the correct dimensions and copying into (for intent(in)) and out of (for intent(out)) external to the call to pdf_closure improves performance. For the Intel compiler on Titan, this drops the cost by around 15%. For the PGI compiler on Titan, this decreases the cost by a factor of 6. This modification only modifies two of the loops containing calls to pdf_closure, as these are the only two that are exercized in the current ACME test cases. There are two others that should be modified in analogous ways if l_use_ice_latent is true. Fixes #1031 [BFB] * worleyph/cam/pdf_closure_call_opt: Rearranged code so that it passes Skybridge testing

Working around implicit array copies in CLUBB subroutine calls In the routine advance_clubb_core (in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops of the form: do k = 1, gr%nz, 1 call pdf_closure & (... zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in) wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in) sclrpthlp_zt(k,:), k, & ! intent(in) wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in) ... rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in) wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out) wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out) ... ) Each of the 15 arrays of the form XXX(k,:) is declared internally as an array of size XXX(:), and the compilers apparently are creating local temporaries and copying into and out of these. This is pretty low level (being inside loops over first chunks, then local columns, and then nadv). Explicitly allocating temporary arrays of the correct dimensions and copying into (for intent(in)) and out of (for intent(out)) external to the call to pdf_closure improves performance. For the Intel compiler on Titan, this drops the cost by around 15%. For the PGI compiler on Titan, this decreases the cost by a factor of 6. This modification only modifies two of the loops containing calls to pdf_closure, as these are the only two that are exercized in the current ACME test cases. There are two others that should be modified in analogous ways if l_use_ice_latent is true. Fixes #1031 [BFB] * worleyph/cam/pdf_closure_call_opt: Rearranged code so that it passes Skybridge testing Added an if condition for fixing ONE TRIP behavior of some compilers. Working around implicit array segment copies

worleyph added enhancement Atmosphere labels Aug 26, 2016

worleyph assigned philrasch and singhbalwinder Aug 26, 2016

worleyph mentioned this issue Aug 28, 2016

Working around implicit array copies in CLUBB subroutine calls #1033

Merged

singhbalwinder closed this as completed in #1033 Feb 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLUBB/pdf_closure call optimizations #1031

CLUBB/pdf_closure call optimizations #1031

worleyph commented Aug 26, 2016 •

edited

Loading

philrasch commented Aug 27, 2016

worleyph commented Aug 27, 2016

worleyph commented Aug 27, 2016

philrasch commented Aug 28, 2016

CLUBB/pdf_closure call optimizations #1031

CLUBB/pdf_closure call optimizations #1031

Comments

worleyph commented Aug 26, 2016 • edited Loading

philrasch commented Aug 27, 2016

worleyph commented Aug 27, 2016

worleyph commented Aug 27, 2016

philrasch commented Aug 28, 2016

worleyph commented Aug 26, 2016 •

edited

Loading