Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLUBB/pdf_closure call optimizations #1031

Closed
worleyph opened this issue Aug 26, 2016 · 4 comments
Closed

CLUBB/pdf_closure call optimizations #1031

worleyph opened this issue Aug 26, 2016 · 4 comments

Comments

@worleyph
Copy link
Contributor

worleyph commented Aug 26, 2016

I'm creating an issue, so that there is something to refer to in a PR, and to ask about the process here.

  1. I am pretty sure that NCAR has been making optimizations in the atmospheric physics, so anything we do may very well be duplicative. If we could address coordination in the near future, that would help in allocating scarce resources.

  2. I also do not know what the procedure here is for making changes to CLUBB, since it is also an external package.

In any case, optimization number 1:

In the routine advance_clubb_core (in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops of the form:

     do k = 1, gr%nz, 1
       call pdf_closure &
         ( ...
           zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:),              & ! intent(in)
           wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),          & ! intent(in)
           sclrpthlp_zt(k,:), k,                                       & ! intent(in)
           wphydrometp_zt(k,:), wp2hmp(k,:),                           & ! intent(in)
           ...
           rtphmp_zt(k,:), thlphmp_zt(k,:),                            & ! intent(in)
           wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),          & ! intent(out)
           wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),          & ! intent(out)
          ...
       )

Each of the 15 arrays of the form XXX(k,:) is declared internally as an array of size XXX(:), and the compilers apparently are creating local temporaries and copying into and out of these. This is pretty low level (being inside loops over first chunks, then local columns, and then nadv).

Explicitly allocating temporary arrays of the correct dimensions and copying into (for intent(in)) and out of (for intent(out)) external to the call to pdf_closure improves performance.

  • For the Intel compiler on Titan, this drops the cost by around 15%.
  • For the PGI compiler on Titan, this decreases the cost by a factor of 6X !!!!

(After this modification, PGI is slightly faster than Intel, for this part of the code and when using MPI-only. OpenMP/PGI still has performance problems here: 4-way OpenMP is 2X slower than MPI-only for this part of the code).

I can create a PR with this modification and let the atmosphere group look at it. Like everything else in the physics, this will not change performance dramatically - still need to make lots of optimizations in many locations - but this was a taller nail than most.

Also, the "real" solution is to push the loop over gr%nz into the call to pdf_closure, for many reasons but including vectorization targeting systems like the Phi. This is much more intrusive and probably best done by a CLUBB developer.

( @singhbalwinder , please reassign to the appropriate person.)

@philrasch
Copy link

Hi Pat,

Your suggested short term modifications make sense to me, and I am very comfortable with OKing them. For the longer term, I think there are a few other issues to consider.

First, we need to establish that CLUBB will be part of the next generations of the ACME model. @davidcbaderatllnl has indicated that Version2 will be primarily a performance enhancement, so if a rewrite in this code is important to get that performance enhancement, then we should go for it.

Second, For version 3, other issues might be important to consider, and we may need guidance by the executive council (adding @lrleung, @mt5555) . Since the ESM program is supporting research on other codes that may be candidates for turbulence and convection, including MMF, and EDMF, and probably others that I am not aware of.

Finally, some of the CMDV activities may be important to consider, including the CLUBB rewrites that the @agsalin team is doing.

I dont know how to proceed for the long run, so I have mentioned some of the other people who may have opinions on this.

@worleyph
Copy link
Contributor Author

@philrasch , thank you for the comments. I will go ahead and create a PR with these changes, and let the atmosphere group (and @amametjanov and @ndkeen, for performance at ALCF and NERSC) look at them and pass judgement. I will ask again if other optimization opportunities present themselves - my target is primarily V1, in particular to help determine what is affordable for V1.

@worleyph
Copy link
Contributor Author

@philrasch , just to reiterate - my question is about whether there are any issues specific to modifying CLUBB, and my concern is with V1. If NCAR or others already have optimizations that we might consider, is there any mechanism to evaluate and adopt those now, rather than spending ACME resources on optimizing code that has already been optimized? Issues for V2 and V3 are conversations for others and in a different setting. Also CLUBB is by far from being the only target for optimization. It was on the top of my list because the PGI compiler in particular has issues with it.

@philrasch
Copy link

Thanks Pat, now I get it. There are two issues I can think of associated with adopting CESM modifications to CLUBB.

  1. the trivial one is that we have been modifying CLUBB to allow a bunch of tuning parameters, and would have to merge those again
  2. we need to ascertain the impact to the model climate if we think it might change. I suspect your suggested code revisions for optimization would not change answers if the threading, optimization, etc is turned off, and that would a good start in reassuring us that the code would remain solid when it is run in an HPC mode. We are not yet at the stage where we are requiring that the answers dont change at all. But I dont have a clue whether changes that NCAR has been making since they gave us the code are climate changing.

On a related note, I am also thinking about these issues with the suggestion by @cecilehannay that we adopt a more recent tag of CLUBB to address a bug in the supersaturation identified here [https://acme-climate.atlassian.net/browse/AG-444]. I would have preferred it if we could maintain consistency with NCAR from the get-go, but maybe it is too late for that this time around. If you or Balwinder have advice on this, I am open to suggestion.

I realize my comments are probably not helping.

singhbalwinder added a commit that referenced this issue Sep 9, 2016
Working around implicit array copies in CLUBB subroutine calls

In the routine advance_clubb_core
(in cam/src/physics/clubb/advance_clubb_core_module.F90) there are
loops of the form:

do k = 1, gr%nz, 1
call pdf_closure &
(...
zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in)
wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in)
sclrpthlp_zt(k,:), k, & ! intent(in)
wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in)
...
rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in)
wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out)
wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out)
...
)

Each of the 15 arrays of the form XXX(k,:) is declared internally as
an array of size XXX(:), and the compilers apparently are creating
local temporaries and copying into and out of these. This is pretty
low level (being inside loops over first chunks, then local columns,
and then nadv).

Explicitly allocating temporary arrays of the correct dimensions and
copying into (for intent(in)) and out of (for intent(out)) external to
the call to pdf_closure improves performance.

For the Intel compiler on Titan, this drops the cost by around 15%.
For the PGI compiler on Titan, this decreases the cost by a factor of
6.

This modification only modifies two of the loops containing calls to
pdf_closure, as these are the only two that are exercized in the
current ACME test cases. There are two others that should be modified in
analogous ways if l_use_ice_latent is true.

Fixes #1031
[BFB]

* worleyph/cam/pdf_closure_call_opt:
  Working around implicit array segment copies
singhbalwinder added a commit that referenced this issue Feb 15, 2017
Working around implicit array copies in CLUBB subroutine calls

In the routine advance_clubb_core
(in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops
 of the form:

do k = 1, gr%nz, 1
call pdf_closure &
(...
zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in)
wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in)
sclrpthlp_zt(k,:), k, & ! intent(in)
wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in)
...
rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in)
wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out)
wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out)
...
)

Each of the 15 arrays of the form XXX(k,:) is declared internally as
an array of size XXX(:), and the compilers apparently are creating
local temporaries and copying into and out of these. This is pretty
low level (being inside loops over first chunks, then local columns,
and then nadv).

Explicitly allocating temporary arrays of the correct dimensions and
copying into (for intent(in)) and out of (for intent(out)) external to
the call to pdf_closure improves performance.

For the Intel compiler on Titan, this drops the cost by around 15%.
For the PGI compiler on Titan, this decreases the cost by a factor of
6.

This modification only modifies two of the loops containing calls to
pdf_closure, as these are the only two that are exercized in the
current ACME test cases. There are two others that should be modified in
analogous ways if l_use_ice_latent is true.

Fixes #1031
[BFB]

* worleyph/cam/pdf_closure_call_opt:
  Added an if condition for fixing ONE TRIP behavior of some compilers.
  Working around implicit array segment copies
singhbalwinder added a commit that referenced this issue Feb 21, 2017
Working around implicit array copies in CLUBB subroutine calls

In the routine advance_clubb_core
(in cam/src/physics/clubb/advance_clubb_core_module.F90)
there are loops of the form:

do k = 1, gr%nz, 1
call pdf_closure &
(...
zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in)
wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in)
sclrpthlp_zt(k,:), k, & ! intent(in)
wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in)
...
rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in)
wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out)
wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out)
...
)

Each of the 15 arrays of the form XXX(k,:) is declared internally as
an array of size XXX(:), and the compilers apparently are creating
local temporaries and copying into and out of these. This is pretty
low level (being inside loops over first chunks, then local columns,
and then nadv).

Explicitly allocating temporary arrays of the correct dimensions and
copying into (for intent(in)) and out of (for intent(out)) external to
the call to pdf_closure improves performance.

For the Intel compiler on Titan, this drops the cost by around 15%.
For the PGI compiler on Titan, this decreases the cost by a factor of
6.

This modification only modifies two of the loops containing calls to
pdf_closure, as these are the only two that are exercized in the
current ACME test cases. There are two others that should be modified in
analogous ways if l_use_ice_latent is true.

Fixes #1031
[BFB]

* worleyph/cam/pdf_closure_call_opt:
  Rearranged code so that it passes Skybridge testing
singhbalwinder added a commit that referenced this issue Feb 23, 2017
Working around implicit array copies in CLUBB subroutine calls

In the routine advance_clubb_core
(in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops
of the form:

do k = 1, gr%nz, 1
call pdf_closure &
(...
zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in)
wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in)
sclrpthlp_zt(k,:), k, & ! intent(in)
wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in)
...
rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in)
wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out)
wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out)
...
)

Each of the 15 arrays of the form XXX(k,:) is declared internally as
an array of size XXX(:), and the compilers apparently are creating
local temporaries and copying into and out of these. This is pretty
low level (being inside loops over first chunks, then local columns,
and then nadv).

Explicitly allocating temporary arrays of the correct dimensions and
copying into (for intent(in)) and out of (for intent(out)) external to
the call to pdf_closure improves performance.

For the Intel compiler on Titan, this drops the cost by around 15%.
For the PGI compiler on Titan, this decreases the cost by a factor of
6.

This modification only modifies two of the loops containing calls to
pdf_closure, as these are the only two that are exercized in the
current ACME test cases. There are two others that should be modified in
analogous ways if l_use_ice_latent is true.

Fixes #1031
[BFB]

* worleyph/cam/pdf_closure_call_opt:
  Rearranged code so that it passes Skybridge testing
  Added an if condition for fixing ONE TRIP behavior of some compilers.
  Working around implicit array segment copies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants