-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Working around implicit array copies in CLUBB subroutine calls #1033
Conversation
Tested using FC5AV1C-L on TItan wth both ne30 and ne120, both MPI-only and MPI/OpenMP and both Intel and PGI, and it ran. |
@worleyph : This PR is scheduled to be merged to next tomorrow. I will run some tests on it today to make sure everything looks good. |
This branch is branched off at a point where a buggy commit was committed to the master which was later reverted. As such, running tests on this branch would break ERS tests. I am going to rebase it off of the current master. |
2c4e7ac
to
5e0e2d4
Compare
Rebased it off of the current master and ran acme developer tests. Everything looks great. |
Working around implicit array copies in CLUBB subroutine calls In the routine advance_clubb_core (in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops of the form: do k = 1, gr%nz, 1 call pdf_closure & (... zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in) wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in) sclrpthlp_zt(k,:), k, & ! intent(in) wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in) ... rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in) wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out) wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out) ... ) Each of the 15 arrays of the form XXX(k,:) is declared internally as an array of size XXX(:), and the compilers apparently are creating local temporaries and copying into and out of these. This is pretty low level (being inside loops over first chunks, then local columns, and then nadv). Explicitly allocating temporary arrays of the correct dimensions and copying into (for intent(in)) and out of (for intent(out)) external to the call to pdf_closure improves performance. For the Intel compiler on Titan, this drops the cost by around 15%. For the PGI compiler on Titan, this decreases the cost by a factor of 6. This modification only modifies two of the loops containing calls to pdf_closure, as these are the only two that are exercized in the current ACME test cases. There are two others that should be modified in analogous ways if l_use_ice_latent is true. Fixes #1031 [BFB] * worleyph/cam/pdf_closure_call_opt: Working around implicit array segment copies
Pushed to next. |
This is giving runtime errors when testing with intel on skybridge.
Will revert from next and remove from the alpha.8 tag. |
…#1033)" This commit reverts PR #1033. This PR was causing the following runtime failure on Skybridge next: forrtl: severe (408): fort: (2): Subscript #1 of the array TMP_SCLRM has value 1 which is greater than the upper bound of 0 cesm.exe 00000000044F6105 advance_clubb_cor 1067 advance_clubb_core_module.F90 This reverts commit bfcc8fc, reversing changes made to 0ec8b8d.
@singhbalwinder , there are no CLUBB-savvy developers who can fix this? What option leads the indicated array to have zero dimension? Looks like all of these arrays will have zero dimension if one does, i.e. that sclr_dim == 0. It appears that the fix will look something like (already in the code):
So if the correct fix is just to wrap the code that writes to/reads from the temporary arrays with
we could emulate this exactly, replacing, for example,
with
and analogously in the other 3 locations? Can you make this change, or do you want me to? If I do it, I'll just start over with a new pull request. |
Thanks @worleyph . I had something similar in my mind but wanted to get your opinion on this. I will make that change, test it and push it to next. I did run tests on this PR before pushing it to next and the code didn't break on my end on Eos. Skybridge might have a different version of Intel compiler which caught this bug (I was using all intel debug flags on Eos). |
In my opinion, this should not have failed - this is the "ONE TRIP" behavior that is no longer the Fortran default (and hasn't been for many years), and which we should not be enabling on our test systems. However, the new code avoids the problem with only one additional if test per loop body and the same logic also used elsewhere in this routine. |
What is the status of this PR? |
As far as I know, it was ready to go in in October, but @singhbalwinder needed to add the if-tests? I don't see that this was ever checked in though. |
Sorry this got delayed. I will work on it as soon as I resolve the non-BFB threading issue. |
@singhbalwinder , can you finish this up now? It is (or was) BFB. |
In the routine advance_clubb_core (in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops of the form: do k = 1, gr%nz, 1 call pdf_closure & (... zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in) wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in) sclrpthlp_zt(k,:), k, & ! intent(in) wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in) ... rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in) wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out) wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out) ... ) Each of the 15 arrays of the form XXX(k,:) is declared internally as an array of size XXX(:), and the compilers apparently are creating local temporaries and copying into and out of these. This is pretty low level (being inside loops over first chunks, then local columns, and then nadv). Explicitly allocating temporary arrays of the correct dimensions and copying into (for intent(in)) and out of (for intent(out)) external to the call to pdf_closure improves performance. For the Intel compiler on Titan, this drops the cost by around 15%. For the PGI compiler on Titan, this decreases the cost by a factor of 6. This modification only modifies two of the loops containing calls to pdf_closure, as these are the only two that are exercized in the current ACME test cases. There are two others that should be modified in analogous ways if l_use_ice_latent is true.
Code is rebased to new master. This commit adds if conditions to fix "one trip" behavior shown by some compilers during testing. [BFB] - Bit-For-Bit
5e0e2d4
to
ba2fb95
Compare
I have now made the necessary changes and rebased this branch off of the current master. @rljacob : Can I merge this PR to next today? |
Yes go ahead. |
Working around implicit array copies in CLUBB subroutine calls In the routine advance_clubb_core (in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops of the form: do k = 1, gr%nz, 1 call pdf_closure & (... zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in) wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in) sclrpthlp_zt(k,:), k, & ! intent(in) wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in) ... rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in) wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out) wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out) ... ) Each of the 15 arrays of the form XXX(k,:) is declared internally as an array of size XXX(:), and the compilers apparently are creating local temporaries and copying into and out of these. This is pretty low level (being inside loops over first chunks, then local columns, and then nadv). Explicitly allocating temporary arrays of the correct dimensions and copying into (for intent(in)) and out of (for intent(out)) external to the call to pdf_closure improves performance. For the Intel compiler on Titan, this drops the cost by around 15%. For the PGI compiler on Titan, this decreases the cost by a factor of 6. This modification only modifies two of the loops containing calls to pdf_closure, as these are the only two that are exercized in the current ACME test cases. There are two others that should be modified in analogous ways if l_use_ice_latent is true. Fixes #1031 [BFB] * worleyph/cam/pdf_closure_call_opt: Added an if condition for fixing ONE TRIP behavior of some compilers. Working around implicit array segment copies
merged into next |
Tests on skybridge was failig possibly due to "one-trip" compiler behavior. [BFB] - Bit-For-Bit
Working around implicit array copies in CLUBB subroutine calls In the routine advance_clubb_core (in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops of the form: do k = 1, gr%nz, 1 call pdf_closure & (... zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in) wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in) sclrpthlp_zt(k,:), k, & ! intent(in) wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in) ... rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in) wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out) wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out) ... ) Each of the 15 arrays of the form XXX(k,:) is declared internally as an array of size XXX(:), and the compilers apparently are creating local temporaries and copying into and out of these. This is pretty low level (being inside loops over first chunks, then local columns, and then nadv). Explicitly allocating temporary arrays of the correct dimensions and copying into (for intent(in)) and out of (for intent(out)) external to the call to pdf_closure improves performance. For the Intel compiler on Titan, this drops the cost by around 15%. For the PGI compiler on Titan, this decreases the cost by a factor of 6. This modification only modifies two of the loops containing calls to pdf_closure, as these are the only two that are exercized in the current ACME test cases. There are two others that should be modified in analogous ways if l_use_ice_latent is true. Fixes #1031 [BFB] * worleyph/cam/pdf_closure_call_opt: Rearranged code so that it passes Skybridge testing
@worleyph : I modified code thinking that it will avoid the "one-trip" behavior we encountered on Skybridge . My modified code (see commit 0a2325c) was a slight variation to your suggested modifications but, I think, achieves the same goal (unless I am missing something). The tests were again breaking on Skybridge but running fine on other machines (Melvin, EoS, Constance are the some I tested). To address this, I revised the code (see commit ada7b79) again to completely remove the do-loops and assign the 2d arrays to 1d arrays directly. I pushed the code to next yesterday and it worked fine. I hope you are in agreement with the changes I made to get it through Skybridge testing. If not, please let me know and I will revise the code accordingly. |
I'll run a quick test on Titan with PGI, to see if the performance is what it should be. Thanks for continuing to beat on this. I can't believe that the compiler on skybridge has this one-trip issue - that should have gone away 30 years ago, unless there is a flag in the skybridge env_mach_specific that is forcing the one-trip behavior. |
I was thinking on the same lines and checked the config_compiler.xml for skybridge to see if there is a flag which turns this on but found nothing which is obvious. The compiler version is also not too old so I am not sure why Skybridge is behaving this way. |
I might be particular to the exact version of Intel, 15.0.3, that skybridge is using. |
@singhbalwinder, please complete the PR. Your latest version has a significant performance advantage over the original code when using PGI on TItan. The 'optimization' also improves performance with Intel a little, so introduces no penalty other than the more complicated code. |
Yes the new version passed testing on next. Please merge to master. |
Working around implicit array copies in CLUBB subroutine calls In the routine advance_clubb_core (in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops of the form: do k = 1, gr%nz, 1 call pdf_closure & (... zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in) wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in) sclrpthlp_zt(k,:), k, & ! intent(in) wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in) ... rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in) wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out) wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out) ... ) Each of the 15 arrays of the form XXX(k,:) is declared internally as an array of size XXX(:), and the compilers apparently are creating local temporaries and copying into and out of these. This is pretty low level (being inside loops over first chunks, then local columns, and then nadv). Explicitly allocating temporary arrays of the correct dimensions and copying into (for intent(in)) and out of (for intent(out)) external to the call to pdf_closure improves performance. For the Intel compiler on Titan, this drops the cost by around 15%. For the PGI compiler on Titan, this decreases the cost by a factor of 6. This modification only modifies two of the loops containing calls to pdf_closure, as these are the only two that are exercized in the current ACME test cases. There are two others that should be modified in analogous ways if l_use_ice_latent is true. Fixes #1031 [BFB] * worleyph/cam/pdf_closure_call_opt: Rearranged code so that it passes Skybridge testing Added an if condition for fixing ONE TRIP behavior of some compilers. Working around implicit array segment copies
The Makefile undefine directive was included in make 3.82 . Many machines (e.g. blues, compute001,..) still use make 3.81 . This fix uses the "undefine" directive only if the variable is already defined (On machines that use make 3.81 we are careful not to define the variable, PNETCDF_PATH, when building with mpi-serial). Fixes #1033
In the routine advance_clubb_core (in cam/src/physics/clubb/advance_clubb_core_module.F90) there are loops of the form:
do k = 1, gr%nz, 1
call pdf_closure &
(...
zm2zt( wpthlp, k ), rtpthlp_zt(k), sclrm(k,:), & ! intent(in)
wpsclrp_zt(k,:), sclrp2_zt(k,:), sclrprtp_zt(k,:),& ! intent(in)
sclrpthlp_zt(k,:), k, & ! intent(in)
wphydrometp_zt(k,:), wp2hmp(k,:), & ! intent(in)
...
rtphmp_zt(k,:), thlphmp_zt(k,:), & ! intent(in)
wpsclrprtp(k,:), wpsclrp2(k,:), sclrpthvp_zt(k,:),& ! intent(out)
wpsclrpthlp(k,:), sclrprcp_zt(k,:), wp2sclrp(k,:),& ! intent(out)
...
)
Each of the 15 arrays of the form XXX(k,:) is declared internally as
an array of size XXX(:), and the compilers apparently are creating
local temporaries and copying into and out of these. This is pretty
low level (being inside loops over first chunks, then local columns,
and then nadv).
Explicitly allocating temporary arrays of the correct dimensions and
copying into (for intent(in)) and out of (for intent(out)) external to
the call to pdf_closure improves performance.
For the Intel compiler on Titan, this drops the cost by around 15%.
For the PGI compiler on Titan, this decreases the cost by a factor of
6.
This modification only modifies two of the loops containing calls to
pdf_closure, as these are the only two that are exercized in the
current ACME test cases. There are two others that should be modified in
analogous ways if l_use_ice_latent is true.
Fixes #1031
[BFB]