Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory leak in MOM6 #764

Closed
junwang-noaa opened this issue Aug 23, 2021 · 9 comments
Closed

memory leak in MOM6 #764

junwang-noaa opened this issue Aug 23, 2021 · 9 comments
Labels
bug Something isn't working

Comments

@junwang-noaa
Copy link
Collaborator

Description

The high resolution C384 coupled test failed due to the memory leak issue. The log files showed that MOM6 has memory leak issue.

20210819 210631.993 INFO PET312 Leaving MOM update_ocean_model: - MemInfo: VmPeak: 2112852 kB
...
20210819 211322.476 INFO PET312 Leaving MOM update_ocean_model: - MemInfo: VmPeak: 2112852 kB
20210819 211405.081 INFO PET312 Leaving MOM update_ocean_model: - MemInfo: VmPeak: 2611544 kB
...

To Reproduce:

Give explicit steps to reproduce the behavior.

1. Check out ufs-weather-model develop branch and turn on the memory profile check in tests/parm/nems.configure.cpld.IN by:
@@ -29,7 +29,7 @@ OCN_petlist_bounds:             @[ocn_petlist_bounds]
 OCN_attributes::
   Verbosity = 0
   DumpFields = false
-  ProfileMemory = false
+  ProfileMemory = true
  1. run coupled test cpld_bmark_wave_v16_p7b.
  2. Check the MOM PET files for memory info

Additional context

Some related discussion in issue #746:

I checked Jessica's run directory, just to confirm that the memory increase
is reduced, it has a ~2% memory increase just after 14 days, then memory
stays unchanged, just like the previous run without CA. MOM6 memory
increases from 3660532 kB to 4217332 kB, the increases only happen when
time steps are multiple of 12 (12,24,36, 60, 228...)

@junwang-noaa I tested the latest 3 commit of MOM6 in ufs, all of them have memory leak issue. Below is from Marshall Ward:

**I have started doing more aggressive memory checking, and recently
fixed many of them, but we know of a few that are not yet fixed.

Nearly all of the leaks are because we do not properly call the
MOM_end_*() functions during the finalization, so do not normally
affect the model during the run.

We are planning to enable valgrind testing once we've fixed all the
known leaks, but this is on hold until we finish up some other
projects.**

@junwang-noaa junwang-noaa added the bug Something isn't working label Aug 23, 2021
@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Aug 30, 2021

See associated Disscusion #779

@arunchawla-NOAA
Copy link

@junwang-noaa @DeniseWorthen @JessicaMeixner-NOAA @jiandewang is there an update on this ? Is there development happening on this at the GFDL end ?

@jiandewang
Copy link
Collaborator

Marshall is aware of this and it's on his to do list. In the discussion #799 it can be seen clearly that the minor memory leak is directly related to the using FMS module to write model output which Marshall believes it's the main cause.

@DeniseWorthen
Copy link
Collaborator

@jiandewang Have you heard anything recently about this issue? Earlier Marshall mentions that it mostly is an issue w/ the MOM_end_*() functions. Later you mention FMS and model output as the main culprit. I'm a little confused which thing is suspected as the cause of the memory leak.

@marshallward
Copy link

It's been a long time since I looked at this. I did not find any significant loss of memory in MOM6, other than memory which was not deallocated at cleanup (i.e. MOM_end_*). Realistically, this is not going to have much impact on any simulations.

There were some memory holes coming from FMS, but these were in the FMS1 I/O and may have been fixed in the FMS2 I/O. Even then, I think we're talking something like O(10M) per rank, which is not huge.

However, this was from our benchmark tests, and may not reflect production runs UFS runs. There could be some untested components in MOM6 which have poor memory usage.

@DeniseWorthen
Copy link
Collaborator

@marshallward Thanks for the update. We're trying to clean up/close old issues which led me to ask the status.

@jiandewang Looking at the discussion #779, maybe it could you repeat the test you did previously and report the current status? At that point we can make a decision whether to close the issue.

@jiandewang
Copy link
Collaborator

@marshallward Thanks for the update. We're trying to clean up/close old issues which led me to ask the status.

@jiandewang Looking at the discussion #779, maybe it could you repeat the test you did previously and report the current status? At that point we can make a decision whether to close the issue.

@DeniseWorthen sure I will repeat our previous test (sorry for the delayed response as I was out of town last week)

@jiandewang
Copy link
Collaborator

@DeniseWorthen I checkout the ufs-weather-model code (Jan 23, 2023 commit, hash # 70de7ef) and repeated what I did before by running C96 coupled with 1x1 ocean and ice for 10 days. Overall I don't see memory leak issue. This figure is from two randomly selected PE memory vmpeak values:
base-percent-line

this one is for all PE (percentage of vmpeak value wrt. it's 2nd step value)
base-percent

I think we can close this issue.

@junwang-noaa
Copy link
Collaborator Author

@jiandewang Thanks for the results. I will close the issue.

epic-cicd-jenkins pushed a commit that referenced this issue Apr 17, 2023
* Bug fix with FIELD_TABLE_FN

* Modify crontab management, use config_defaults.sh.

* Add status badge.

* Update cheyenne crontab management.

* source lmod-setup

* Add main to set_predef_grid

* Bug fix in predef_grid

* Don't import dead params.

* Fix bug in resetting VERBOSE

* Minor fix in INI config.

* Construct var_defns components from dictionary.

* Allow also lower case variables to be exported.

* Updates to python workflow due to PR #776

* Use python versions of link_fix and set_FV3_sfc in job script.

* Use python versions of create_diag/model.

* Some fixes addressing Christina's suggestions.

* Delete shell workflow

* Append pid to temp files.

* Update scripts to work with the latest hashes of UFS_UTILS and UPP (#775)

* update input namelist of chgres_cube

* update diag_table templates

* update scripts

* back to original

* specify miniconda version on Jet

* Remove -S option from link_fix call.

* Fixes due to merge

* Cosmoetic changes.

Co-authored-by: Chan-Hoo.Jeon-NOAA <60152248+chan-hoo@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants