Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working around implicit array copies in CLUBB subroutine calls #1033

Merged
merged 3 commits into from
Feb 23, 2017

Conversation

worleyph
Copy link
Contributor

@worleyph worleyph commented Aug 28, 2016

In the routine advance_clubb_core (in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops of the form:

do k = 1, gr%nz, 1
call pdf_closure &
(...
zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in)
wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in)
sclrpthlp_zt(k,:), k, & ! intent(in)
wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in)
...
rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in)
wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out)
wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out)
...
)

Each of the 15 arrays of the form XXX(k,:) is declared internally as
an array of size XXX(:), and the compilers apparently are creating
local temporaries and copying into and out of these. This is pretty
low level (being inside loops over first chunks, then local columns,
and then nadv).

Explicitly allocating temporary arrays of the correct dimensions and
copying into (for intent(in)) and out of (for intent(out)) external to
the call to pdf_closure improves performance.

For the Intel compiler on Titan, this drops the cost by around 15%.
For the PGI compiler on Titan, this decreases the cost by a factor of
6.

This modification only modifies two of the loops containing calls to
pdf_closure, as these are the only two that are exercized in the
current ACME test cases. There are two others that should be modified in
analogous ways if l_use_ice_latent is true.

Fixes #1031
[BFB]

@worleyph
Copy link
Contributor Author

Tested using FC5AV1C-L on TItan wth both ne30 and ne120, both MPI-only and MPI/OpenMP and both Intel and PGI, and it ran.
It was BFB for PGI and ne30 and MPI_only on Titan (5 simulated days), looking at atm.log. It should be BFB for the other configurations, but I have not confirmed it.
Need to check that this does at least no harm at ALCF and NERSC.

@golaz golaz added this to the v1.0 Alpha.8 milestone Aug 29, 2016
@singhbalwinder
Copy link
Contributor

@worleyph : This PR is scheduled to be merged to next tomorrow. I will run some tests on it today to make sure everything looks good.

@singhbalwinder
Copy link
Contributor

This branch is branched off at a point where a buggy commit was committed to the master which was later reverted. As such, running tests on this branch would break ERS tests. I am going to rebase it off of the current master.

@singhbalwinder singhbalwinder force-pushed the worleyph/cam/pdf_closure_call_opt branch from 2c4e7ac to 5e0e2d4 Compare September 8, 2016 21:46
@singhbalwinder
Copy link
Contributor

Rebased it off of the current master and ran acme developer tests. Everything looks great.

singhbalwinder added a commit that referenced this pull request Sep 9, 2016
Working around implicit array copies in CLUBB subroutine calls

In the routine advance_clubb_core
(in cam/src/physics/clubb/advance_clubb_core_module.F90) there are
loops of the form:

do k = 1, gr%nz, 1
call pdf_closure &
(...
zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in)
wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in)
sclrpthlp_zt(k,:), k, & ! intent(in)
wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in)
...
rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in)
wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out)
wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out)
...
)

Each of the 15 arrays of the form XXX(k,:) is declared internally as
an array of size XXX(:), and the compilers apparently are creating
local temporaries and copying into and out of these. This is pretty
low level (being inside loops over first chunks, then local columns,
and then nadv).

Explicitly allocating temporary arrays of the correct dimensions and
copying into (for intent(in)) and out of (for intent(out)) external to
the call to pdf_closure improves performance.

For the Intel compiler on Titan, this drops the cost by around 15%.
For the PGI compiler on Titan, this decreases the cost by a factor of
6.

This modification only modifies two of the loops containing calls to
pdf_closure, as these are the only two that are exercized in the
current ACME test cases. There are two others that should be modified in
analogous ways if l_use_ice_latent is true.

Fixes #1031
[BFB]

* worleyph/cam/pdf_closure_call_opt:
  Working around implicit array segment copies
@singhbalwinder
Copy link
Contributor

Pushed to next.

@rljacob
Copy link
Member

rljacob commented Sep 11, 2016

This is giving runtime errors when testing with intel on skybridge.

forrtl: severe (408): fort: (2): Subscript #1 of the array TMP_SCLRM has value 1 which is greater than the upper bound of 0

Image              PC                Routine            Line        Source
cesm.exe           000000000A637DB0  Unknown               Unknown  Unknown
cesm.exe           00000000044F6105  advance_clubb_cor        1067  advance_clubb_core_module.F90
cesm.exe           0000000002726F6B  clubb_intr_mp_clu        1843  clubb_intr.F90
cesm.exe           0000000000E930D7  physpkg_mp_tphysb        2362  physpkg.F90
cesm.exe           0000000000E61AA8  physpkg_mp_phys_r        1008  physpkg.F90
cesm.exe           00000000008D7C12  cam_comp_mp_cam_r         250  cam_comp.F90
cesm.exe           0000000000896AFA  atm_comp_mct_mp_a         343  atm_comp_mct.F90
cesm.exe           0000000000580B5A  component_mod_mp_         230  component_mod.F90
cesm.exe           0000000000556E70  cesm_comp_mod_mp_        1921  cesm_comp_mod.F90
cesm.exe           0000000000571188  MAIN__                    102  cesm_driver.F90
cesm.exe           0000000000537D8E  Unknown               Unknown  Unknown
libc.so.6          00002AAAB1038D5D  Unknown               Unknown  Unknown
cesm.exe           0000000000537C69  Unknown               Unknown  Unknown

Will revert from next and remove from the alpha.8 tag.

@rljacob rljacob removed this from the v1.0 Alpha.8 milestone Sep 11, 2016
singhbalwinder added a commit that referenced this pull request Sep 11, 2016
…#1033)"

This commit reverts PR #1033. This PR was causing the following runtime
failure on Skybridge next:

forrtl: severe (408): fort: (2): Subscript #1 of the array TMP_SCLRM
has value 1 which is greater than the upper bound of 0

cesm.exe 00000000044F6105 advance_clubb_cor 1067
advance_clubb_core_module.F90

This reverts commit bfcc8fc, reversing
changes made to 0ec8b8d.
@singhbalwinder
Copy link
Contributor

Hi @rljacob and @worleyph : I have reverted this PR from next. Once we have a fix we will re-merge it into next.

@worleyph
Copy link
Contributor Author

worleyph commented Oct 4, 2016

@singhbalwinder , there are no CLUBB-savvy developers who can fix this? What option leads the indicated array to have zero dimension? Looks like all of these arrays will have zero dimension if one does, i.e. that sclr_dim == 0.

It appears that the fix will look something like (already in the code):

  if ( sclr_dim > 0 ) then
    wpsclrp(1,1:sclr_dim)   = wpsclrp_sfc(1:sclr_dim)
  end if

So if the correct fix is just to wrap the code that writes to/reads from the temporary arrays with

if ( sclr_dim > 0 ) then
...
endif

we could emulate this exactly, replacing, for example,

    do i = 1, sclr_dim, 1
      tmp_sclrm_zm(i)    = sclrm_zm(k,i)    ! intent(in)
      tmp_wpsclrp(i)     = wpsclrp(k,i)     ! intent(in)
      tmp_sclrp2(i)      = sclrp2(k,i)      ! intent(in)
      tmp_sclrprtp(i)    = sclrprtp(k,i)    ! intent(in)
      tmp_sclrpthlp(i)   = sclrpthlp(k,i)   ! intent(in)
      tmp_wphydrometp(i) = wphydrometp(k,i) ! intent(in)
      tmp_wp2hmp_zm(i)   = wp2hmp_zm(k,i)   ! intent(in)
      tmp_rtphmp(i)      = rtphmp(k,i)      ! intent(in)
      tmp_thlphmp(i)     = thlphmp(k,i)     ! intent(in)
    enddo

with

    if ( sclr_dim > 0 ) then
      tmp_sclrm_zm(1:sclr_dim)    = sclrm_zm(k,1:sclr_dim)    ! intent(in)
      tmp_wpsclrp(1:sclr_dim)     = wpsclrp(k,1:sclr_dim)     ! intent(in)
      tmp_sclrp2(1:sclr_dim)      = sclrp2(k,1:sclr_dim)      ! intent(in)
      tmp_sclrprtp(1:sclr_dim)    = sclrprtp(k,1:sclr_dim)    ! intent(in)
      tmp_sclrpthlp(1:sclr_dim)   = sclrpthlp(k,1:sclr_dim)   ! intent(in)
      tmp_wphydrometp(1:sclr_dim) = wphydrometp(k,1:sclr_dim) ! intent(in)
      tmp_wp2hmp_zm(1:sclr_dim)   = wp2hmp_zm(k,1:sclr_dim)   ! intent(in)
      tmp_rtphmp(1:sclr_dim)      = rtphmp(k,1:sclr_dim)      ! intent(in)
      tmp_thlphmp(1:sclr_dim)     = thlphmp(k,1:sclr_dim)     ! intent(in)
    endif

and analogously in the other 3 locations?

Can you make this change, or do you want me to? If I do it, I'll just start over with a new pull request.

@singhbalwinder
Copy link
Contributor

Thanks @worleyph . I had something similar in my mind but wanted to get your opinion on this. I will make that change, test it and push it to next.

I did run tests on this PR before pushing it to next and the code didn't break on my end on Eos. Skybridge might have a different version of Intel compiler which caught this bug (I was using all intel debug flags on Eos).

@worleyph
Copy link
Contributor Author

worleyph commented Oct 4, 2016

In my opinion, this should not have failed - this is the "ONE TRIP" behavior that is no longer the Fortran default (and hasn't been for many years), and which we should not be enabling on our test systems. However, the new code avoids the problem with only one additional if test per loop body and the same logic also used elsewhere in this routine.

@rljacob
Copy link
Member

rljacob commented Jan 10, 2017

What is the status of this PR?

@worleyph
Copy link
Contributor Author

As far as I know, it was ready to go in in October, but @singhbalwinder needed to add the if-tests? I don't see that this was ever checked in though.

@singhbalwinder
Copy link
Contributor

Sorry this got delayed. I will work on it as soon as I resolve the non-BFB threading issue.

@worleyph
Copy link
Contributor Author

@singhbalwinder , can you finish this up now? It is (or was) BFB.

Patrick Worley and others added 2 commits February 14, 2017 12:07
In the routine advance_clubb_core (in
cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops
of the form:

do k = 1, gr%nz, 1
 call pdf_closure &
  (...
   zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:),    & ! intent(in)
   wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in)
   sclrpthlp_zt(k,:), k,                             & ! intent(in)
   wphydrometp_zt(k,:), wp2hmp(k,:),                 & ! intent(in)
   ...
   rtphmp_zt(k,:), thlphmp_zt(k,:),                  & ! intent(in)
   wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out)
   wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out)
   ...
  )

Each of the 15 arrays of the form XXX(k,:) is declared internally as
an array of size XXX(:), and the compilers apparently are creating
local temporaries and copying into and out of these. This is pretty
low level (being inside loops over first chunks, then local columns,
and then nadv).

Explicitly allocating temporary arrays of the correct dimensions and
copying into (for intent(in)) and out of (for intent(out)) external to
the call to pdf_closure improves performance.

For the Intel compiler on Titan, this drops the cost by around 15%.
For the PGI compiler on Titan, this decreases the cost by a factor of
6.

This modification only modifies two of the loops containing calls to
pdf_closure, as these are the only two that are exercized in the
current ACME test cases. There are two others that should be modified in
analogous ways if l_use_ice_latent is true.
Code is rebased to new master. This commit adds if conditions to fix
"one trip" behavior shown by some compilers during testing.

[BFB] - Bit-For-Bit
@singhbalwinder singhbalwinder force-pushed the worleyph/cam/pdf_closure_call_opt branch from 5e0e2d4 to ba2fb95 Compare February 14, 2017 22:28
@singhbalwinder
Copy link
Contributor

I have now made the necessary changes and rebased this branch off of the current master. @rljacob : Can I merge this PR to next today?

@rljacob
Copy link
Member

rljacob commented Feb 14, 2017

Yes go ahead.

@rljacob rljacob added this to the v1.0beta1 milestone Feb 14, 2017
singhbalwinder added a commit that referenced this pull request Feb 15, 2017
Working around implicit array copies in CLUBB subroutine calls

In the routine advance_clubb_core
(in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops
 of the form:

do k = 1, gr%nz, 1
call pdf_closure &
(...
zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in)
wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in)
sclrpthlp_zt(k,:), k, & ! intent(in)
wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in)
...
rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in)
wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out)
wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out)
...
)

Each of the 15 arrays of the form XXX(k,:) is declared internally as
an array of size XXX(:), and the compilers apparently are creating
local temporaries and copying into and out of these. This is pretty
low level (being inside loops over first chunks, then local columns,
and then nadv).

Explicitly allocating temporary arrays of the correct dimensions and
copying into (for intent(in)) and out of (for intent(out)) external to
the call to pdf_closure improves performance.

For the Intel compiler on Titan, this drops the cost by around 15%.
For the PGI compiler on Titan, this decreases the cost by a factor of
6.

This modification only modifies two of the loops containing calls to
pdf_closure, as these are the only two that are exercized in the
current ACME test cases. There are two others that should be modified in
analogous ways if l_use_ice_latent is true.

Fixes #1031
[BFB]

* worleyph/cam/pdf_closure_call_opt:
  Added an if condition for fixing ONE TRIP behavior of some compilers.
  Working around implicit array segment copies
@singhbalwinder
Copy link
Contributor

merged into next

Tests on skybridge was failig possibly due to "one-trip" compiler
behavior.

[BFB] - Bit-For-Bit
singhbalwinder added a commit that referenced this pull request Feb 21, 2017
Working around implicit array copies in CLUBB subroutine calls

In the routine advance_clubb_core
(in cam/src/physics/clubb/advance_clubb_core_module.F90)
there are loops of the form:

do k = 1, gr%nz, 1
call pdf_closure &
(...
zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in)
wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in)
sclrpthlp_zt(k,:), k, & ! intent(in)
wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in)
...
rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in)
wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out)
wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out)
...
)

Each of the 15 arrays of the form XXX(k,:) is declared internally as
an array of size XXX(:), and the compilers apparently are creating
local temporaries and copying into and out of these. This is pretty
low level (being inside loops over first chunks, then local columns,
and then nadv).

Explicitly allocating temporary arrays of the correct dimensions and
copying into (for intent(in)) and out of (for intent(out)) external to
the call to pdf_closure improves performance.

For the Intel compiler on Titan, this drops the cost by around 15%.
For the PGI compiler on Titan, this decreases the cost by a factor of
6.

This modification only modifies two of the loops containing calls to
pdf_closure, as these are the only two that are exercized in the
current ACME test cases. There are two others that should be modified in
analogous ways if l_use_ice_latent is true.

Fixes #1031
[BFB]

* worleyph/cam/pdf_closure_call_opt:
  Rearranged code so that it passes Skybridge testing
@singhbalwinder
Copy link
Contributor

@worleyph : I modified code thinking that it will avoid the "one-trip" behavior we encountered on Skybridge . My modified code (see commit 0a2325c) was a slight variation to your suggested modifications but, I think, achieves the same goal (unless I am missing something).

The tests were again breaking on Skybridge but running fine on other machines (Melvin, EoS, Constance are the some I tested). To address this, I revised the code (see commit ada7b79) again to completely remove the do-loops and assign the 2d arrays to 1d arrays directly. I pushed the code to next yesterday and it worked fine. I hope you are in agreement with the changes I made to get it through Skybridge testing. If not, please let me know and I will revise the code accordingly.

@worleyph
Copy link
Contributor Author

I'll run a quick test on Titan with PGI, to see if the performance is what it should be. Thanks for continuing to beat on this. I can't believe that the compiler on skybridge has this one-trip issue - that should have gone away 30 years ago, unless there is a flag in the skybridge env_mach_specific that is forcing the one-trip behavior.

@singhbalwinder
Copy link
Contributor

I was thinking on the same lines and checked the config_compiler.xml for skybridge to see if there is a flag which turns this on but found nothing which is obvious. The compiler version is also not too old so I am not sure why Skybridge is behaving this way.

@rljacob
Copy link
Member

rljacob commented Feb 21, 2017

I might be particular to the exact version of Intel, 15.0.3, that skybridge is using.

@worleyph
Copy link
Contributor Author

@singhbalwinder, please complete the PR. Your latest version has a significant performance advantage over the original code when using PGI on TItan. The 'optimization' also improves performance with Intel a little, so introduces no penalty other than the more complicated code.

@rljacob
Copy link
Member

rljacob commented Feb 23, 2017

Yes the new version passed testing on next. Please merge to master.

@singhbalwinder singhbalwinder merged commit ada7b79 into master Feb 23, 2017
singhbalwinder added a commit that referenced this pull request Feb 23, 2017
Working around implicit array copies in CLUBB subroutine calls

In the routine advance_clubb_core
(in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops
of the form:

do k = 1, gr%nz, 1
call pdf_closure &
(...
zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in)
wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in)
sclrpthlp_zt(k,:), k, & ! intent(in)
wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in)
...
rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in)
wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out)
wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out)
...
)

Each of the 15 arrays of the form XXX(k,:) is declared internally as
an array of size XXX(:), and the compilers apparently are creating
local temporaries and copying into and out of these. This is pretty
low level (being inside loops over first chunks, then local columns,
and then nadv).

Explicitly allocating temporary arrays of the correct dimensions and
copying into (for intent(in)) and out of (for intent(out)) external to
the call to pdf_closure improves performance.

For the Intel compiler on Titan, this drops the cost by around 15%.
For the PGI compiler on Titan, this decreases the cost by a factor of
6.

This modification only modifies two of the loops containing calls to
pdf_closure, as these are the only two that are exercized in the
current ACME test cases. There are two others that should be modified in
analogous ways if l_use_ice_latent is true.

Fixes #1031
[BFB]

* worleyph/cam/pdf_closure_call_opt:
  Rearranged code so that it passes Skybridge testing
  Added an if condition for fixing ONE TRIP behavior of some compilers.
  Working around implicit array segment copies
agsalin pushed a commit that referenced this pull request Apr 13, 2017
The Makefile undefine directive was included in make 3.82 . Many
machines (e.g. blues, compute001,..) still use make 3.81 . This
fix uses the "undefine" directive only if the variable is already
defined (On machines that use make 3.81 we are careful not to
define the variable, PNETCDF_PATH, when building with mpi-serial).

Fixes #1033
@singhbalwinder singhbalwinder deleted the worleyph/cam/pdf_closure_call_opt branch July 25, 2017 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Atmosphere BFB PR leaves answers BFB enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants