-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add return status checks in cohort deallocations #824
Add return status checks in cohort deallocations #824
Conversation
This avoids errors
for E3SM tests
and
for
at time-step
during calls to
With the changes in this PR, all three runs succeed.
and on 1 Cori-KNL node, 64-MPI ran still has 11% (=4279.31/3835.31) growth same as in E3SM-Project/E3SM#4709
|
Yeah what is taking so long? If you don't have access to Summit for testing, get access. All E3SM developers should be on the LCFs because our code must run there. |
Az or others who already have access can test on OLCF machines as it takes a while for you to get access. Meanwhile, you probably can sign off if these changes look okay from the model dev perspective. |
The issue we had, was that the changes did not address underlaying problems. If any of those deallocations are not successfully completed, that should result in catastrophic failure. It just shouldn't happen and the model should not be allowed to proceeed because it will generate a memory leak. As far as we know, this is only a problem for certain machines, ones that we do not necessarily have access to. The plan was get access to the compilers and figure out why these deallocations are not happening, however we only have so many resources and this has not made it to the top of the stack. We were not aware that this was a priority. Is this creating problems on your side? We could have a discussion about priorities and resources if that is true. |
Just noting the fact that this issue presents itself with two different architectures (Power, x86_64) and different compilers (IBM, Cray) which have relatively strict adherence to Fortran standards potentially indicates a subtle bug which is masked by default behavior of other compilers like GNU that you are testing with. Sometimes, it may take a while to narrow down the root cause. If this fix is harmless, then it might be good to incorporate to mitigate impact of this issue while the root cause investigation goes on. Another issue that I wanted to check with you is the use of uninitialized variables. Could you confirm if you explicitly check for those in your development workflow? For example, Land uses NaN initializations that cause various issues and we are trying to get rid of those. |
As an incremental solution, can we modify the warning messages on the iostat checks to warnings + graceful failures (ie endruns)? Would that address the original intent of the PR @amametjanov @rljacob @sarats ? |
Regarding the full fix... One suspicion Greg and I have debated a little bit, was that this has to do with how we define the derived type, or possibly how we define the derived type the deallocation is nested on. We have these chains of derived types: site%patch%cohort. Anyway, the cohort derived type is the one that is having trouble being deallocated under ibm/cray. It is defined as a pointer, and we have a pointer to the structure defined locally in the subroutine. But, in the routine where the deallocation occurs, the patch is defined locally as a target. It could be that some compilers won't deallocate an object nested on a target? See lines 723 and 708 in the diff: https://github.com/NGEET/fates/pull/824/files#diff-c9ef820edf9fa01f23c5d61e6aec25fa126e07b0937379ec099cf9e14146727bR723 Calling these as targets is legacy. Maybe its the problem, maybe not, I'm speculating, but one test that would help narrow down what is going on would be to identify if any successful deallocations are happening. That would be the first test I would conduct. I would test to see if iostat=0 ever succeeds. If it does it at least once, its not a problem with how the types are defined. |
Through the myOLCF portal, you should join the 'cli115' project to test on Summit. Mark Taylor is the PI. There are cli133 and cli133_crusher projects are for testing on Crusher but come under ECP. So, Mark has to sign off on those. |
Thanks @sarats. I'll submit and application to join and give Mark a heads up. |
@rgknox I suspect (optimistically) that you may have found the problem. Perhaps, we can construct a simple standalone example to test this out? |
5721c20
to
edebec9
Compare
…ks. Removed target designations where unnecessary.
The issue did indeed end up being due to use of the I still need to clean up this PR a bit and I think it would be good to incorporate some of the changes that Ryan made with his test branches (including the merge up to API25). Note that this PR still needs full regression testing and the land developer testing of this PR won't be available until Shijie's E3SM-Project/E3SM#5429 is integrated on the E3SM side to bring it up to date with API 25 as well. For reference, here is the IBM fortran documentation wording regarding deallocation of targets:
Note the Fortran 2003 standard does not seem to specifically define this, so my assumption is that this an IBM interpretation of the standard. |
That commit works for
There is still a memory leak reported for
|
…i25-notarget' into azamat/fix-ibm-dealloc-errors
Testing on Summit and Cheyenne underway |
All expected tests pass b4b on Cheyenne against baseline Summit tests TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good to me
The fates tests in the elm land developer tests all passed RUN. Folder location on summit: |
@glemieux Thanks for all your work on this! |
This adds checks for non-zero
stat=status_code
in allocates and deallocates in cohort dynamics/time-stepping.It also adds a workaround for IBM compiler on Summit to nullify the pointer to a cohort or its PRT object, if the initial deallocate attempt fails.
Fixes #702, E3SM-Project/E3SM#5001
[bit-for-bit]