Working around implicit array copies in CLUBB subroutine calls #1033

worleyph · 2016-08-28T03:51:42Z

In the routine advance_clubb_core (in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops of the form:

do k = 1, gr%nz, 1
call pdf_closure &
(...
zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in)
wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in)
sclrpthlp_zt(k,:), k, & ! intent(in)
wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in)
...
rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in)
wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out)
wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out)
...
)

Each of the 15 arrays of the form XXX(k,:) is declared internally as
an array of size XXX(:), and the compilers apparently are creating
local temporaries and copying into and out of these. This is pretty
low level (being inside loops over first chunks, then local columns,
and then nadv).

Explicitly allocating temporary arrays of the correct dimensions and
copying into (for intent(in)) and out of (for intent(out)) external to
the call to pdf_closure improves performance.

For the Intel compiler on Titan, this drops the cost by around 15%.
For the PGI compiler on Titan, this decreases the cost by a factor of
6.

This modification only modifies two of the loops containing calls to
pdf_closure, as these are the only two that are exercized in the
current ACME test cases. There are two others that should be modified in
analogous ways if l_use_ice_latent is true.

Fixes #1031
[BFB]

worleyph · 2016-08-28T03:58:41Z

Tested using FC5AV1C-L on TItan wth both ne30 and ne120, both MPI-only and MPI/OpenMP and both Intel and PGI, and it ran.
It was BFB for PGI and ne30 and MPI_only on Titan (5 simulated days), looking at atm.log. It should be BFB for the other configurations, but I have not confirmed it.
Need to check that this does at least no harm at ALCF and NERSC.

singhbalwinder · 2016-09-08T17:52:16Z

@worleyph : This PR is scheduled to be merged to next tomorrow. I will run some tests on it today to make sure everything looks good.

singhbalwinder · 2016-09-08T18:03:04Z

This branch is branched off at a point where a buggy commit was committed to the master which was later reverted. As such, running tests on this branch would break ERS tests. I am going to rebase it off of the current master.

singhbalwinder · 2016-09-08T21:48:45Z

Rebased it off of the current master and ran acme developer tests. Everything looks great.

Working around implicit array copies in CLUBB subroutine calls In the routine advance_clubb_core (in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops of the form: do k = 1, gr%nz, 1 call pdf_closure & (... zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in) wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in) sclrpthlp_zt(k,:), k, & ! intent(in) wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in) ... rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in) wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out) wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out) ... ) Each of the 15 arrays of the form XXX(k,:) is declared internally as an array of size XXX(:), and the compilers apparently are creating local temporaries and copying into and out of these. This is pretty low level (being inside loops over first chunks, then local columns, and then nadv). Explicitly allocating temporary arrays of the correct dimensions and copying into (for intent(in)) and out of (for intent(out)) external to the call to pdf_closure improves performance. For the Intel compiler on Titan, this drops the cost by around 15%. For the PGI compiler on Titan, this decreases the cost by a factor of 6. This modification only modifies two of the loops containing calls to pdf_closure, as these are the only two that are exercized in the current ACME test cases. There are two others that should be modified in analogous ways if l_use_ice_latent is true. Fixes #1031 [BFB] * worleyph/cam/pdf_closure_call_opt: Working around implicit array segment copies

singhbalwinder · 2016-09-09T20:17:04Z

Pushed to next.

rljacob · 2016-09-11T04:59:22Z

This is giving runtime errors when testing with intel on skybridge.

forrtl: severe (408): fort: (2): Subscript #1 of the array TMP_SCLRM has value 1 which is greater than the upper bound of 0

Image              PC                Routine            Line        Source
cesm.exe           000000000A637DB0  Unknown               Unknown  Unknown
cesm.exe           00000000044F6105  advance_clubb_cor        1067  advance_clubb_core_module.F90
cesm.exe           0000000002726F6B  clubb_intr_mp_clu        1843  clubb_intr.F90
cesm.exe           0000000000E930D7  physpkg_mp_tphysb        2362  physpkg.F90
cesm.exe           0000000000E61AA8  physpkg_mp_phys_r        1008  physpkg.F90
cesm.exe           00000000008D7C12  cam_comp_mp_cam_r         250  cam_comp.F90
cesm.exe           0000000000896AFA  atm_comp_mct_mp_a         343  atm_comp_mct.F90
cesm.exe           0000000000580B5A  component_mod_mp_         230  component_mod.F90
cesm.exe           0000000000556E70  cesm_comp_mod_mp_        1921  cesm_comp_mod.F90
cesm.exe           0000000000571188  MAIN__                    102  cesm_driver.F90
cesm.exe           0000000000537D8E  Unknown               Unknown  Unknown
libc.so.6          00002AAAB1038D5D  Unknown               Unknown  Unknown
cesm.exe           0000000000537C69  Unknown               Unknown  Unknown

Will revert from next and remove from the alpha.8 tag.

…#1033)" This commit reverts PR #1033. This PR was causing the following runtime failure on Skybridge next: forrtl: severe (408): fort: (2): Subscript #1 of the array TMP_SCLRM has value 1 which is greater than the upper bound of 0 cesm.exe 00000000044F6105 advance_clubb_cor 1067 advance_clubb_core_module.F90 This reverts commit bfcc8fc, reversing changes made to 0ec8b8d.

singhbalwinder · 2016-09-11T05:41:58Z

Hi @rljacob and @worleyph : I have reverted this PR from next. Once we have a fix we will re-merge it into next.

worleyph · 2016-10-04T21:37:28Z

@singhbalwinder , there are no CLUBB-savvy developers who can fix this? What option leads the indicated array to have zero dimension? Looks like all of these arrays will have zero dimension if one does, i.e. that sclr_dim == 0.

It appears that the fix will look something like (already in the code):

  if ( sclr_dim > 0 ) then
    wpsclrp(1,1:sclr_dim)   = wpsclrp_sfc(1:sclr_dim)
  end if

So if the correct fix is just to wrap the code that writes to/reads from the temporary arrays with

if ( sclr_dim > 0 ) then
...
endif

we could emulate this exactly, replacing, for example,

    do i = 1, sclr_dim, 1
      tmp_sclrm_zm(i)    = sclrm_zm(k,i)    ! intent(in)
      tmp_wpsclrp(i)     = wpsclrp(k,i)     ! intent(in)
      tmp_sclrp2(i)      = sclrp2(k,i)      ! intent(in)
      tmp_sclrprtp(i)    = sclrprtp(k,i)    ! intent(in)
      tmp_sclrpthlp(i)   = sclrpthlp(k,i)   ! intent(in)
      tmp_wphydrometp(i) = wphydrometp(k,i) ! intent(in)
      tmp_wp2hmp_zm(i)   = wp2hmp_zm(k,i)   ! intent(in)
      tmp_rtphmp(i)      = rtphmp(k,i)      ! intent(in)
      tmp_thlphmp(i)     = thlphmp(k,i)     ! intent(in)
    enddo

with

    if ( sclr_dim > 0 ) then
      tmp_sclrm_zm(1:sclr_dim)    = sclrm_zm(k,1:sclr_dim)    ! intent(in)
      tmp_wpsclrp(1:sclr_dim)     = wpsclrp(k,1:sclr_dim)     ! intent(in)
      tmp_sclrp2(1:sclr_dim)      = sclrp2(k,1:sclr_dim)      ! intent(in)
      tmp_sclrprtp(1:sclr_dim)    = sclrprtp(k,1:sclr_dim)    ! intent(in)
      tmp_sclrpthlp(1:sclr_dim)   = sclrpthlp(k,1:sclr_dim)   ! intent(in)
      tmp_wphydrometp(1:sclr_dim) = wphydrometp(k,1:sclr_dim) ! intent(in)
      tmp_wp2hmp_zm(1:sclr_dim)   = wp2hmp_zm(k,1:sclr_dim)   ! intent(in)
      tmp_rtphmp(1:sclr_dim)      = rtphmp(k,1:sclr_dim)      ! intent(in)
      tmp_thlphmp(1:sclr_dim)     = thlphmp(k,1:sclr_dim)     ! intent(in)
    endif

and analogously in the other 3 locations?

Can you make this change, or do you want me to? If I do it, I'll just start over with a new pull request.

singhbalwinder · 2016-10-04T21:46:30Z

Thanks @worleyph . I had something similar in my mind but wanted to get your opinion on this. I will make that change, test it and push it to next.

I did run tests on this PR before pushing it to next and the code didn't break on my end on Eos. Skybridge might have a different version of Intel compiler which caught this bug (I was using all intel debug flags on Eos).

worleyph · 2016-10-04T21:59:30Z

In my opinion, this should not have failed - this is the "ONE TRIP" behavior that is no longer the Fortran default (and hasn't been for many years), and which we should not be enabling on our test systems. However, the new code avoids the problem with only one additional if test per loop body and the same logic also used elsewhere in this routine.

rljacob · 2017-01-10T23:38:41Z

What is the status of this PR?

worleyph · 2017-01-11T01:56:08Z

As far as I know, it was ready to go in in October, but @singhbalwinder needed to add the if-tests? I don't see that this was ever checked in though.

singhbalwinder · 2017-01-11T07:40:01Z

Sorry this got delayed. I will work on it as soon as I resolve the non-BFB threading issue.

worleyph · 2017-02-14T02:29:43Z

@singhbalwinder , can you finish this up now? It is (or was) BFB.

In the routine advance_clubb_core (in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops of the form: do k = 1, gr%nz, 1 call pdf_closure & (... zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in) wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in) sclrpthlp_zt(k,:), k, & ! intent(in) wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in) ... rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in) wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out) wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out) ... ) Each of the 15 arrays of the form XXX(k,:) is declared internally as an array of size XXX(:), and the compilers apparently are creating local temporaries and copying into and out of these. This is pretty low level (being inside loops over first chunks, then local columns, and then nadv). Explicitly allocating temporary arrays of the correct dimensions and copying into (for intent(in)) and out of (for intent(out)) external to the call to pdf_closure improves performance. For the Intel compiler on Titan, this drops the cost by around 15%. For the PGI compiler on Titan, this decreases the cost by a factor of 6. This modification only modifies two of the loops containing calls to pdf_closure, as these are the only two that are exercized in the current ACME test cases. There are two others that should be modified in analogous ways if l_use_ice_latent is true.

Code is rebased to new master. This commit adds if conditions to fix "one trip" behavior shown by some compilers during testing. [BFB] - Bit-For-Bit

singhbalwinder · 2017-02-14T22:35:45Z

I have now made the necessary changes and rebased this branch off of the current master. @rljacob : Can I merge this PR to next today?

rljacob · 2017-02-14T22:46:56Z

Yes go ahead.

Working around implicit array copies in CLUBB subroutine calls In the routine advance_clubb_core (in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops of the form: do k = 1, gr%nz, 1 call pdf_closure & (... zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in) wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in) sclrpthlp_zt(k,:), k, & ! intent(in) wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in) ... rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in) wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out) wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out) ... ) Each of the 15 arrays of the form XXX(k,:) is declared internally as an array of size XXX(:), and the compilers apparently are creating local temporaries and copying into and out of these. This is pretty low level (being inside loops over first chunks, then local columns, and then nadv). Explicitly allocating temporary arrays of the correct dimensions and copying into (for intent(in)) and out of (for intent(out)) external to the call to pdf_closure improves performance. For the Intel compiler on Titan, this drops the cost by around 15%. For the PGI compiler on Titan, this decreases the cost by a factor of 6. This modification only modifies two of the loops containing calls to pdf_closure, as these are the only two that are exercized in the current ACME test cases. There are two others that should be modified in analogous ways if l_use_ice_latent is true. Fixes #1031 [BFB] * worleyph/cam/pdf_closure_call_opt: Added an if condition for fixing ONE TRIP behavior of some compilers. Working around implicit array segment copies

singhbalwinder · 2017-02-15T23:49:22Z

merged into next

Tests on skybridge was failig possibly due to "one-trip" compiler behavior. [BFB] - Bit-For-Bit

Working around implicit array copies in CLUBB subroutine calls In the routine advance_clubb_core (in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops of the form: do k = 1, gr%nz, 1 call pdf_closure & (... zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in) wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in) sclrpthlp_zt(k,:), k, & ! intent(in) wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in) ... rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in) wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out) wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out) ... ) Each of the 15 arrays of the form XXX(k,:) is declared internally as an array of size XXX(:), and the compilers apparently are creating local temporaries and copying into and out of these. This is pretty low level (being inside loops over first chunks, then local columns, and then nadv). Explicitly allocating temporary arrays of the correct dimensions and copying into (for intent(in)) and out of (for intent(out)) external to the call to pdf_closure improves performance. For the Intel compiler on Titan, this drops the cost by around 15%. For the PGI compiler on Titan, this decreases the cost by a factor of 6. This modification only modifies two of the loops containing calls to pdf_closure, as these are the only two that are exercized in the current ACME test cases. There are two others that should be modified in analogous ways if l_use_ice_latent is true. Fixes #1031 [BFB] * worleyph/cam/pdf_closure_call_opt: Rearranged code so that it passes Skybridge testing

singhbalwinder · 2017-02-21T15:32:44Z

@worleyph : I modified code thinking that it will avoid the "one-trip" behavior we encountered on Skybridge . My modified code (see commit 0a2325c) was a slight variation to your suggested modifications but, I think, achieves the same goal (unless I am missing something).

The tests were again breaking on Skybridge but running fine on other machines (Melvin, EoS, Constance are the some I tested). To address this, I revised the code (see commit ada7b79) again to completely remove the do-loops and assign the 2d arrays to 1d arrays directly. I pushed the code to next yesterday and it worked fine. I hope you are in agreement with the changes I made to get it through Skybridge testing. If not, please let me know and I will revise the code accordingly.

worleyph · 2017-02-21T17:50:58Z

I'll run a quick test on Titan with PGI, to see if the performance is what it should be. Thanks for continuing to beat on this. I can't believe that the compiler on skybridge has this one-trip issue - that should have gone away 30 years ago, unless there is a flag in the skybridge env_mach_specific that is forcing the one-trip behavior.

singhbalwinder · 2017-02-21T18:00:05Z

I was thinking on the same lines and checked the config_compiler.xml for skybridge to see if there is a flag which turns this on but found nothing which is obvious. The compiler version is also not too old so I am not sure why Skybridge is behaving this way.

rljacob · 2017-02-21T18:01:06Z

I might be particular to the exact version of Intel, 15.0.3, that skybridge is using.

worleyph · 2017-02-23T04:04:47Z

@singhbalwinder, please complete the PR. Your latest version has a significant performance advantage over the original code when using PGI on TItan. The 'optimization' also improves performance with Intel a little, so introduces no penalty other than the more complicated code.

rljacob · 2017-02-23T04:27:31Z

Yes the new version passed testing on next. Please merge to master.

Working around implicit array copies in CLUBB subroutine calls In the routine advance_clubb_core (in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops of the form: do k = 1, gr%nz, 1 call pdf_closure & (... zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in) wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in) sclrpthlp_zt(k,:), k, & ! intent(in) wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in) ... rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in) wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out) wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out) ... ) Each of the 15 arrays of the form XXX(k,:) is declared internally as an array of size XXX(:), and the compilers apparently are creating local temporaries and copying into and out of these. This is pretty low level (being inside loops over first chunks, then local columns, and then nadv). Explicitly allocating temporary arrays of the correct dimensions and copying into (for intent(in)) and out of (for intent(out)) external to the call to pdf_closure improves performance. For the Intel compiler on Titan, this drops the cost by around 15%. For the PGI compiler on Titan, this decreases the cost by a factor of 6. This modification only modifies two of the loops containing calls to pdf_closure, as these are the only two that are exercized in the current ACME test cases. There are two others that should be modified in analogous ways if l_use_ice_latent is true. Fixes #1031 [BFB] * worleyph/cam/pdf_closure_call_opt: Rearranged code so that it passes Skybridge testing Added an if condition for fixing ONE TRIP behavior of some compilers. Working around implicit array segment copies

The Makefile undefine directive was included in make 3.82 . Many machines (e.g. blues, compute001,..) still use make 3.81 . This fix uses the "undefine" directive only if the variable is already defined (On machines that use make 3.81 we are careful not to define the variable, PNETCDF_PATH, when building with mpi-serial). Fixes #1033

worleyph added enhancement Atmosphere BFB PR leaves answers BFB labels Aug 28, 2016

worleyph assigned singhbalwinder Aug 28, 2016

golaz added this to the v1.0 Alpha.8 milestone Aug 29, 2016

singhbalwinder force-pushed the worleyph/cam/pdf_closure_call_opt branch from 2c4e7ac to 5e0e2d4 Compare September 8, 2016 21:46

rljacob removed this from the v1.0 Alpha.8 milestone Sep 11, 2016

Patrick Worley and others added 2 commits February 14, 2017 12:07

Added an if condition for fixing ONE TRIP behavior of some compilers.

ba2fb95

Code is rebased to new master. This commit adds if conditions to fix "one trip" behavior shown by some compilers during testing. [BFB] - Bit-For-Bit

singhbalwinder force-pushed the worleyph/cam/pdf_closure_call_opt branch from 5e0e2d4 to ba2fb95 Compare February 14, 2017 22:28

rljacob added this to the v1.0beta1 milestone Feb 14, 2017

xtcai force-pushed the master branch from eacfaf1 to 8385552 Compare February 18, 2017 09:47

rljacob force-pushed the master branch from 8385552 to 4f3d4f0 Compare February 18, 2017 18:26

Rearranged code so that it passes Skybridge testing

ada7b79

Tests on skybridge was failig possibly due to "one-trip" compiler behavior. [BFB] - Bit-For-Bit

singhbalwinder merged commit ada7b79 into master Feb 23, 2017

singhbalwinder deleted the worleyph/cam/pdf_closure_call_opt branch July 25, 2017 17:18

worleyph mentioned this pull request Mar 9, 2018

Fortran runtime error: Index '1' of dimension 1 of array 'skw_zt' outside of expected range with ne16/ne30 F case using GNU DEBUG on pm-cpu #2044

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working around implicit array copies in CLUBB subroutine calls #1033

Working around implicit array copies in CLUBB subroutine calls #1033

worleyph commented Aug 28, 2016 •

edited by singhbalwinder

Loading

worleyph commented Aug 28, 2016

singhbalwinder commented Sep 8, 2016

singhbalwinder commented Sep 8, 2016

singhbalwinder commented Sep 8, 2016

singhbalwinder commented Sep 9, 2016

rljacob commented Sep 11, 2016

singhbalwinder commented Sep 11, 2016

worleyph commented Oct 4, 2016

singhbalwinder commented Oct 4, 2016

worleyph commented Oct 4, 2016

rljacob commented Jan 10, 2017

worleyph commented Jan 11, 2017

singhbalwinder commented Jan 11, 2017

worleyph commented Feb 14, 2017

singhbalwinder commented Feb 14, 2017

rljacob commented Feb 14, 2017

singhbalwinder commented Feb 15, 2017

singhbalwinder commented Feb 21, 2017

worleyph commented Feb 21, 2017

singhbalwinder commented Feb 21, 2017

rljacob commented Feb 21, 2017

worleyph commented Feb 23, 2017

rljacob commented Feb 23, 2017

Working around implicit array copies in CLUBB subroutine calls #1033

Working around implicit array copies in CLUBB subroutine calls #1033

Conversation

worleyph commented Aug 28, 2016 • edited by singhbalwinder Loading

worleyph commented Aug 28, 2016

singhbalwinder commented Sep 8, 2016

singhbalwinder commented Sep 8, 2016

singhbalwinder commented Sep 8, 2016

singhbalwinder commented Sep 9, 2016

rljacob commented Sep 11, 2016

singhbalwinder commented Sep 11, 2016

worleyph commented Oct 4, 2016

singhbalwinder commented Oct 4, 2016

worleyph commented Oct 4, 2016

rljacob commented Jan 10, 2017

worleyph commented Jan 11, 2017

singhbalwinder commented Jan 11, 2017

worleyph commented Feb 14, 2017

singhbalwinder commented Feb 14, 2017

rljacob commented Feb 14, 2017

singhbalwinder commented Feb 15, 2017

singhbalwinder commented Feb 21, 2017

worleyph commented Feb 21, 2017

singhbalwinder commented Feb 21, 2017

rljacob commented Feb 21, 2017

worleyph commented Feb 23, 2017

rljacob commented Feb 23, 2017

worleyph commented Aug 28, 2016 •

edited by singhbalwinder

Loading