Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add return status checks in cohort deallocations #824

Merged
merged 9 commits into from
Feb 15, 2023

Conversation

amametjanov
Copy link
Contributor

@amametjanov amametjanov commented Jan 14, 2022

This adds checks for non-zero stat=status_code in allocates and deallocates in cohort dynamics/time-stepping.
It also adds a workaround for IBM compiler on Summit to nullify the pointer to a cohort or its PRT object, if the initial deallocate attempt fails.

Fixes #702, E3SM-Project/E3SM#5001

[bit-for-bit]

@amametjanov
Copy link
Contributor Author

amametjanov commented Jan 14, 2022

This avoids errors

1: 0: "/autofs/nccs-svm1_home1/azamat/repos/E3SM-summit/components/elm/src/external_models/fates/parteh/PRTAllometricCNPMod.F90", 
line 1: 1525-109 Error encountered while attempting to deallocate a data object.  The program will stop.

for E3SM tests

  • SMS_Ld20_D.f45_f45.IELMFATES.summit_ibm.elm-fates_eca
  • SMS_Ld20_D.f45_f45.IELMFATES.summit_ibm.elm-fates_rd

and

1: 7: "/autofs/nccs-svm1_home1/azamat/repos/E3SM-summit/components/elm/src/external_models/fates/parteh/PRTAllometricCarbonMod.F90", 
line 1: 1525-109 Error encountered while attempting to deallocate a data object.  The program will stop.

for

  • ERS_Ld20_D.f45_f45.IELMFATES.summit_ibm.elm-fates

at time-step

 Beginning timestep   : 0001-01-04_00:30:00
 FATES dynamics start
FATES Dynamics:    1-01-04

during calls to fuse_cohorts and terminate_cohorts.
The error message when deallocates return non-zero status is

1525-207 An invalid data object has been specified for the DEALLOCATE statement.

With the changes in this PR, all three runs succeed.
Previously noted memory growth is still there: e.g. on 1 Summit node, 8-MPI run shows 3% (=1836.88/1777.94) growth

 tStamp_write: model date =   00010102       0 wall clock = 2022-01-12 20:40:07 avg dt =    63.30 dt =    63.30
 memory_write: model date =   00010102       0 memory =    1777.94 MB (highwater)       1328.88 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC ROF WAV IAC ESP)
...
 tStamp_write: model date =   00010120       0 wall clock = 2022-01-12 21:02:10 avg dt =    72.97 dt =    74.18
 memory_write: model date =   00010120       0 memory =    1836.88 MB (highwater)       1378.25 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC ROF WAV IAC ESP)

and on 1 Cori-KNL node, 64-MPI ran still has 11% (=4279.31/3835.31) growth same as in E3SM-Project/E3SM#4709

 tStamp_write: model date =   00010102       0 wall clock = 2022-01-12 19:28:04 avg dt =     7.06 dt =     7.06
 memory_write: model date =   00010102       0 memory =    3835.31 MB (highwater)        334.82 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC ROF WAV IAC ESP)
...
 tStamp_write: model date =   00010120       0 wall clock = 2022-01-12 19:30:40 avg dt =     8.60 dt =     8.86
 memory_write: model date =   00010120       0 memory =    4279.31 MB (highwater)        345.40 MB (usage)  (pe=    0 comps= cpl ATM LND ICE OCN GLC ROF WAV IAC ESP)

@glemieux glemieux added the e3sm An issue is related to e3sm host land model or a particular PR has a corresponding e3sm-side PR label Jan 20, 2022
@sarats
Copy link

sarats commented Jan 24, 2023

@rgknox It's been more than a year since this fix is awaiting review/merge. Can you clarify if anything else is needed to proceed?

cc @rljacob

@rljacob
Copy link

rljacob commented Jan 24, 2023

Yeah what is taking so long? If you don't have access to Summit for testing, get access. All E3SM developers should be on the LCFs because our code must run there.

@sarats
Copy link

sarats commented Jan 24, 2023

Az or others who already have access can test on OLCF machines as it takes a while for you to get access. Meanwhile, you probably can sign off if these changes look okay from the model dev perspective.

@rgknox
Copy link
Contributor

rgknox commented Jan 24, 2023

The issue we had, was that the changes did not address underlaying problems. If any of those deallocations are not successfully completed, that should result in catastrophic failure. It just shouldn't happen and the model should not be allowed to proceeed because it will generate a memory leak. As far as we know, this is only a problem for certain machines, ones that we do not necessarily have access to. The plan was get access to the compilers and figure out why these deallocations are not happening, however we only have so many resources and this has not made it to the top of the stack. We were not aware that this was a priority. Is this creating problems on your side? We could have a discussion about priorities and resources if that is true.

@sarats
Copy link

sarats commented Jan 25, 2023

Just noting the fact that this issue presents itself with two different architectures (Power, x86_64) and different compilers (IBM, Cray) which have relatively strict adherence to Fortran standards potentially indicates a subtle bug which is masked by default behavior of other compilers like GNU that you are testing with.

Sometimes, it may take a while to narrow down the root cause. If this fix is harmless, then it might be good to incorporate to mitigate impact of this issue while the root cause investigation goes on.

Another issue that I wanted to check with you is the use of uninitialized variables. Could you confirm if you explicitly check for those in your development workflow? For example, Land uses NaN initializations that cause various issues and we are trying to get rid of those.

@rgknox
Copy link
Contributor

rgknox commented Jan 25, 2023

As an incremental solution, can we modify the warning messages on the iostat checks to warnings + graceful failures (ie endruns)?
FATES is constantly allocating and deallocating the derived types that are being identified in this thread, it happens continually during run time. If we don't deallocate them, the memory profile really will grow. These aren't just one-and-done type deallocations.

Would that address the original intent of the PR @amametjanov @rljacob @sarats ?

@rgknox
Copy link
Contributor

rgknox commented Jan 25, 2023

Regarding the full fix... One suspicion Greg and I have debated a little bit, was that this has to do with how we define the derived type, or possibly how we define the derived type the deallocation is nested on. We have these chains of derived types: site%patch%cohort. Anyway, the cohort derived type is the one that is having trouble being deallocated under ibm/cray. It is defined as a pointer, and we have a pointer to the structure defined locally in the subroutine. But, in the routine where the deallocation occurs, the patch is defined locally as a target. It could be that some compilers won't deallocate an object nested on a target? See lines 723 and 708 in the diff:

https://github.com/NGEET/fates/pull/824/files#diff-c9ef820edf9fa01f23c5d61e6aec25fa126e07b0937379ec099cf9e14146727bR723
https://github.com/NGEET/fates/pull/824/files#diff-c9ef820edf9fa01f23c5d61e6aec25fa126e07b0937379ec099cf9e14146727bR708

Calling these as targets is legacy. Maybe its the problem, maybe not, I'm speculating, but one test that would help narrow down what is going on would be to identify if any successful deallocations are happening. That would be the first test I would conduct. I would test to see if iostat=0 ever succeeds. If it does it at least once, its not a problem with how the types are defined.

@glemieux
Copy link
Contributor

@rljacob @sarats I have an account on OLCF and was previously given access to summit by being associated with a project, but it looks like I'm not associated with a project now. Who should be my point of contact for getting the project ID?

@sarats
Copy link

sarats commented Jan 25, 2023

Through the myOLCF portal, you should join the 'cli115' project to test on Summit. Mark Taylor is the PI.

There are cli133 and cli133_crusher projects are for testing on Crusher but come under ECP. So, Mark has to sign off on those.

@glemieux
Copy link
Contributor

Thanks @sarats. I'll submit and application to join and give Mark a heads up.

@sarats
Copy link

sarats commented Jan 25, 2023

@rgknox I suspect (optimistically) that you may have found the problem. Perhaps, we can construct a simple standalone example to test this out?

@amametjanov amametjanov force-pushed the azamat/fix-ibm-dealloc-errors branch from 5721c20 to edebec9 Compare January 26, 2023 03:59
…ks. Removed target designations where unnecessary.
@glemieux
Copy link
Contributor

glemieux commented Feb 11, 2023

The issue did indeed end up being due to use of the target attribute, specifically at the cohort and parteh object level. These are updated with commit 068ac1d. I was able to get a short ERS f45 test to pass on Summit and I have a longer one year smoke test going as well.

I still need to clean up this PR a bit and I think it would be good to incorporate some of the changes that Ryan made with his test branches (including the merge up to API25). Note that this PR still needs full regression testing and the land developer testing of this PR won't be available until Shijie's E3SM-Project/E3SM#5429 is integrated on the E3SM side to bring it up to date with API 25 as well.

For reference, here is the IBM fortran documentation wording regarding deallocation of targets:

An allocatable object with the TARGET attribute cannot be deallocated through an associated pointer. Deallocation of such an object causes the association status of any associated pointer to become undefined. An allocatable object that has an undefined allocation status cannot be subsequently referenced, defined, allocated, or deallocated.

Note the Fortran 2003 standard does not seem to specifically define this, so my assumption is that this an IBM interpretation of the standard.

@amametjanov
Copy link
Contributor Author

That commit works for crayclang compiler on OLCF Crusher too.

$ ./cs.status.notarget | grep -i "overall"
  ERS_Ld20.f45_f45.IELMFATES.crusher_crayclang.elm-fates (Overall: PASS) details:
  SMS_Ld20.f45_f45.IELMFATES.crusher_crayclang.elm-fates_eca (Overall: PASS) details:
  SMS_Ld20.f45_f45.IELMFATES.crusher_crayclang.elm-fates_rd (Overall: PASS) details:

There is still a memory leak reported for amdclang compiler, but that's a different issue:

$ ./cs.status.notarget-amd | grep -i "fail"
  SMS_Ld20.f45_f45.IELMFATES.crusher_amdclang.elm-fates_eca (Overall: FAIL) details:
    FAIL SMS_Ld20.f45_f45.IELMFATES.crusher_amdclang.elm-fates_eca MEMLEAK memleak detected, memory went from 35446.920000 to 47180.110000 in 18 days
  SMS_Ld20.f45_f45.IELMFATES.crusher_amdclang.elm-fates_rd (Overall: FAIL) details:
    FAIL SMS_Ld20.f45_f45.IELMFATES.crusher_amdclang.elm-fates_rd MEMLEAK memleak detected, memory went from 35410.400000 to 47209.590000 in 18 days

@glemieux
Copy link
Contributor

Testing on Summit and Cheyenne underway

@glemieux glemieux requested a review from rgknox February 14, 2023 23:27
@glemieux
Copy link
Contributor

All expected tests pass b4b on Cheyenne against baseline fates-sci.1.63.1_api.25.1.0-ctsm5.1.dev117.
Folder location: /glade/u/home/glemieux/scratch/ctsm-tests/tests_pr824

Summit tests TBD

Copy link
Contributor

@rgknox rgknox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good to me

@glemieux
Copy link
Contributor

The fates tests in the elm land developer tests all passed RUN.

Folder location on summit: /ccs/home/glemieux/scratch/e3sm-tests/deallocate-pr824.fates.summit.ibm.Cb645be3aa2-F5e55e441

@glemieux glemieux merged commit b8b905c into NGEET:main Feb 15, 2023
@rljacob
Copy link

rljacob commented Feb 15, 2023

@glemieux Thanks for all your work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug - software engineering e3sm An issue is related to e3sm host land model or a particular PR has a corresponding e3sm-side PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pointer deallocate with cray compiler
5 participants