-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UFS P7c memory issue #746
Comments
@jiandewang in order to investigate this, we (@DeniseWorthen and @climbfuji) need a fully self-contained run directory that we can work with. That means an experiment directory with all input files, configuration files, and the job submission script. Can you provide this on hera, please? Thanks. |
run dir which contains all input and configuration files: /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/wrk-P7C/DATAROOT/R_20120101/2012010100/gfs/fcst.125814 run log: /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/wrk-P7C/COMROOT/R_20120101/logs/2012010100/gfs.forecast.highres.log.0 this is through workflow thus there is no job_card (as in rt.sh) in run dir |
I will not be able to work on this unless I get a job submission script. I believe rocoto can dump it out using some verbose flag. @JessicaMeixner-NOAA knows. |
So I printed out the profile memory from the p7b runs and the memory usage is less in the runs from workflow, so my thought was that maybe it's an environmental variable we just need to use in the workflow. I'm planning on setting a run directory and then using a job_card from the rt.sh (appropriately changed) to see if that will run. Eitherway I'll get a run directory w/job_card at the end of it. |
I do know that you can get that job submission script dumped out but I haven't done that in forever, I'll see if I can dig out those instructions. |
Thanks, Jessica. I was hoping to be able to use Forge DDT and MAP to see what is going on. A self-contained run directory will be very helpful for this. |
Check the following section in the log file, compare to p7b rt run, and
update HERA.env to increase stack sizes if needed, add or remove certain
env variable
…----
0 + .
/scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/UFS-P7c/env/HERA.env fcst
00 + '[' 1 -ne 1 ']'
00 + step=fcst
00 + export npe_node_max=40
00 + npe_node_max=40
00 + export 'launcher=srun --export=ALL'
00 + launcher='srun --export=ALL'
00 + export OMP_STACKSIZE=2048000
00 + OMP_STACKSIZE=2048000
00 + export NTHSTACK=1024000000
00 + NTHSTACK=1024000000
00 + ulimit -s unlimited
00 + ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1540672
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) 94208000
open files (-n) 131072
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 1540672
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
On Fri, Aug 13, 2021 at 12:29 PM Dom Heinzeller ***@***.***> wrote:
I do know that you can get that job submission script dumped out but I
haven't done that in forever, I'll see if I can dig out those instructions.
Thanks, Jessica. I was hoping to be able to use Forge DDT and MAP to see
what is going on. A self-contained run directory will be very helpful for
this.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#746 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKY5N2LTL3WCANKYZHKHEP3T4VB67ANCNFSM5CCTWO5A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
--
*Fanglin Yang, Ph.D.*
*Chief, Model Physics Group*
*Modeling and Data Assimilation Branch*
*NOAA/NWS/NCEP Environmental Modeling Center*
*https://www.emc.ncep.noaa.gov/gmb/wx24fy/fyang/
<https://www.emc.ncep.noaa.gov/gmb/wx24fy/fyang/>*
|
@yangfanglin I agree it's likely something in the workflow's HERA.env file that needs to be updated, in a log file for p7b output I found (/scratch1/NCEPDEV/stmp2/Jessica.Meixner/FV3_RT/rt_73915/cpld_bmark_wave_v16_p7b_35d_2013040100/err) :
but the OMP_STACKSIZE seems larger in the workflow, so? I'm working on setting up the canned case now. Hopefully will have it soon. |
I've created a canned case on hera here: My hope is that you can copy this directory to yours and then just "sbatch job_card" but it hasn't been tested yet, so not 100% sure this works yet. The job_card is from rt.sh -- which is what Rahul suggested earlier and would be testing along the same lines as Fanglin was suggesting with it perhaps being an environment variable issue. I'll update the issue after my test goes through. |
The canned case is running for me now (the first time I submitted I had a module load error, but resubmission worked so?). Now we'll have to wait a couple of hours to see if the different environmental variables mean we don't get the same memory errors. |
Great progress! I'll wait for the outcome of your experiment before spending time on this. |
See the output folder /scratch2/NCEPDEV/climate/Jessica.Meixner/p7memissue/Try02: On day 18 in the err file we have:
So even with the environment variables used from rt.sh we still seem to be running into a memory problem. This log file does not have the explicit "ran out of memory" but I'm assuming that's the SIGTERM issue here. I missed the setting for turning the PET logs on with the esmf profile memory information so there will be a Try03 folder with that info soon. |
Okay, so I went back and looked at all the log files from runs that @jiandewang made ( /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/wrk-P7C/COMROOT/R_201*/logs/201*/gfs.forecast.highres.log) and only one of those failed because of Out of Memory, the run I made with memory profiles turned on (/scratch2/NCEPDEV/climate/Jessica.Meixner/p7memissue/Try03) does not seem to be any more than normal? I have seen memory errors fail as the SIGSEGV before, but I guess I'm wondering if we have a memory error or something else? |
the numbers in /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/wrk-P7C/EXPROOT/R_20120101/config.fv3 do not add up. npe_fv3 cannot be 288 if layout_x_gfs=12 and layout_y_gfs=16. The setting WRTTASK_PER_GROUP_GFS=88 is also odd. You may want to increase WRITE_GROUP_GFS as well. |
@yangfanglin this is probably an issue of the old versus CROW configuration, the values used in the forecast directory seem fine to me (/scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/wrk-P7C/DATAROOT/R_20120101/2012010100/gfs/fcst.125814): in input.nml: in model_configure: And 12166=1152 (which is the # in mediator pet list in nems.configure) and +88 = 1240 (which matches the atm pet list) The 88 might be an odd number but it means that the write group is filling out an entire node and not sharing with another component -- this is the configuration I got to run (after having memory problems w/the write group) for p6. |
@yangfanglin since we only write output every 6 hours, having 1 write group has always been sufficient in terms of writing efficiency, is there some reason to have multiple write groups for memory? |
@JessicaMeixner-NOAA the error in log file depends on which node being detected by system that is having issue so they will not be the same. We are lucky that one of the log file contains "out of memory" info. The fact that all the jobs were being killed by system is a clean indication that there is some memory issue. |
I think we can double the threads to check if it is a memory issue, right? |
@bingfu-NOAA right now we are using 2 threads and model died at day 18, using 4 threading will slow down the system and we will not be able to finish 35day run in 8hr. In fact in one of my testing, I used 225s for fv3 and model died at day 13. |
The test that made the 4thread slow down was because I also used a different layout for atm model trying to not use double the nodes. I can try one test with just increasing the thread count (which shouldn't in theory slow it down) just to see if it's really memory or not. It'll probably take a while to get through the queue, but will report back when I have results. |
Okay, it does not appear that the 4thread slow down was just because I used a smaller atm layout, even using the same atm layout, it's much slower. I don't think we'll make it to the 18 days we reached with 2 threads. |
Are all the components using same number of threads? Otherwise it won't
help to increase threads for one component. Also does the PET log files
show that memory is increasing during the integration? If yes, which
component is it?
…On Mon, Aug 16, 2021 at 3:17 PM Jessica Meixner ***@***.***> wrote:
Okay, it does not appear that the 4thread slow down was just because I
used a smaller atm layout, even using the same atm layout, it's much
slower. I don't think we'll make it to the 18 days we reached with 2
threads.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#746 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI7D6TK3B57FJMDEDTMJEZTT5FP3ZANCNFSM5CCTWO5A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
Yes, all the components are using the same number of threads, and the simulation slows down which I would not expect. Yes, the PET log files show that memory is increasing during the integration. You can find that for example here: The 4 thread run directory can be seen here: /scratch2/NCEPDEV/climate/Jessica.Meixner/p7update/thr4/DATAROOT/testthr4/2013040100/gfs/fcst.25077 with log file here: /scratch2/NCEPDEV/climate/Jessica.Meixner/p7update/thr4/COMROOT/testthr4/logs/2013040100/gfs.forecast.highres.log which only got to day 12 before being killed because the 8 hour wall clock is over. |
I was able to run a successful 35 day run (the same as the canned case on hera, but through the workflow) on Orion. I did try to just update to the most recent version of ufs-weather-model on hera, and confirmed that also is dying with SIGTERM errors. |
I ran a test where I set FHMAX=840 (my way of turning off I/O for the atm model) and the model still failed at day 18 (the first run died with a failed node also on day 18). Based on suggestions from the coupling tag-up, the next steps I will try will be to: All other suggestions are welcome. I'll report on results as I get them. |
As expected, running with 1 thread we only got through 6 days of simulation: The run without waves is still running, Rundir: /scratch2/NCEPDEV/climate/Jessica.Meixner/p7update/nowave/DATAROOT/nowave02/2013040100/gfs/fcst.154732 Running with different atm physics settings (most of the jobs are still in the queue): A job running with debug is in the queue. |
The run turning do_ca=false succeeded in running 35 days, all my other tests so far have failed. In the log files with do_ca=true, there are lots of statements such as:
However, if you look at the log file for "domain decomposition" this is only written once for different "MOM" and "Cubic" variables. I'm trying to see if I can add memory profile statements to see if this is an issue or not but could this maybe be only done once for ca @lisa-bengtsson? Any other ideas of where we might have memory leaks with do_ca=true? |
Sorry, I have not seen that before, did the debug run indicate anything? It is great if you could add memory profile statements, the halo exchange is in update_ca.F90 in the routine evolve_ca_sgs, that could be a start perhaps? |
The routine is called update_cells_sgs inside update_ca.F90. |
I checked Jessica's run directory, just to confirm that the memory increase
is reduced, it has a ~2% memory increase just after 14 days, then memory
stays unchanged, just like the previous run without CA. MOM6 memory
increases from 3660532 kB to 4217332 kB, the increases only happen when
time steps are multiple of 12 (12,24,36, 60, 228...)
Lisa, would you please make PRs so that we can get the code updates
committed? Thanks
…On Thu, Aug 19, 2021 at 9:39 PM jiandewang ***@***.***> wrote:
@lisa-bengtsson <https://github.com/lisa-bengtsson> Good news - the job I
ran completed the 35 days! Hopefully the same for @jiandewang
<https://github.com/jiandewang>'s run
20120101 also finished 35day run
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#746 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI7D6TMUXESX2YG4K2AJPFDT5WW6FANCNFSM5CCTWO5A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
What a relief, thank you for testing! @junwang-noaa since it is just a single change that doesn't change any baseline, could it be merged with an existing PR? |
Currently we are trying to commit the P7 related issues. We have the FMS PR
that does not change results, but we are waiting for the FMS library to be
available on the supported platforms, the fv3 dycore update PR is on hold
as it changes results.
…On Thu, Aug 19, 2021 at 10:33 PM lisa-bengtsson ***@***.***> wrote:
What a relief, thank you for testing! @junwang-noaa
<https://github.com/junwang-noaa> since it is just a single change that
doesn't change any baseline, could it be merged with an existing PR?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#746 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI7D6TINFIOH3LQHP5THSFLT5W5GNANCNFSM5CCTWO5A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
@junwang-noaa : can you tell me which PET file you looked at for MOM ? |
It's this one:
/scratch2/NCEPDEV/climate/Jessica.Meixner/p7ca/test01/DATA/caupdate01/2013040100/gfs/fcst.107497/PET1240.ESMF_LogFile
On Thu, Aug 19, 2021 at 10:54 PM jiandewang ***@***.***>
wrote:
… Currently we are trying to commit the P7 related issues. We have the FMS
PR that does not change results, but we are waiting for the FMS library to
be available on the supported platforms, the fv3 dycore update PR is on
hold as it changes results.
… <#m_1982154914073222731_>
On Thu, Aug 19, 2021 at 10:33 PM lisa-bengtsson *@*.***> wrote: What a
relief, thank you for testing! @junwang-noaa
<https://github.com/junwang-noaa> https://github.com/junwang-noaa since
it is just a single change that doesn't change any baseline, could it be
merged with an existing PR? — You are receiving this because you were
mentioned. Reply to this email directly, view it on GitHub <#746 (comment)
<#746 (comment)>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AI7D6TINFIOH3LQHP5THSFLT5W5GNANCNFSM5CCTWO5A
. Triage notifications on the go with GitHub Mobile for iOS
https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675
or Android
https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email
.
@junwang-noaa <https://github.com/junwang-noaa> : can you tell me which
PET file you looked at for MOM ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#746 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI7D6TLFSOCJGK3ZYZ7ROYDT5W7UHANCNFSM5CCTWO5A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
I created two PR's: |
Jun,
A bug fix is needed for sfsub.F90 The fix is to add an "if" in a do
loop. I think George Gayno is creating a "ccpp-physics" issue on this.
(this fix may change results if there are problem points)
The change should be around line # 2021
" do i=1,len
if (nint(slmskl(i)) /= 1) then
if (sicanl(i) >= min_ice(i)) then
slianl(i) = 2.0_kind_io8
else
slianl(i) = zero
sicanl(i) = zero
endif
endif
enddo"
Moorthi
…On Thu, Aug 19, 2021 at 10:39 PM Jun Wang ***@***.***> wrote:
Currently we are trying to commit the P7 related issues. We have the FMS PR
that does not change results, but we are waiting for the FMS library to be
available on the supported platforms, the fv3 dycore update PR is on hold
as it changes results.
On Thu, Aug 19, 2021 at 10:33 PM lisa-bengtsson ***@***.***>
wrote:
> What a relief, thank you for testing! @junwang-noaa
> <https://github.com/junwang-noaa> since it is just a single change that
> doesn't change any baseline, could it be merged with an existing PR?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <
#746 (comment)
>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AI7D6TINFIOH3LQHP5THSFLT5W5GNANCNFSM5CCTWO5A
>
> .
> Triage notifications on the go with GitHub Mobile for iOS
> <
https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675
>
> or Android
> <
https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email
>
> .
>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#746 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALLVRYWWH6J2ILOQOJ5PN4TT5W57PANCNFSM5CCTWO5A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
--
Dr. Shrinivas Moorthi
Research Meteorologist
Modeling and Data Assimilation Branch
Environmental Modeling Center / National Centers for Environmental
Prediction
5830 University Research Court - (W/NP23), College Park MD 20740 USA
Tel: (301)683-3718
e-mail: ***@***.***
Phone: (301) 683-3718 Fax: (301) 683-3718
|
Thanks, Moorthi. What is this fix for?
On Fri, Aug 20, 2021 at 7:44 AM SMoorthi-emc ***@***.***>
wrote:
… Jun,
A bug fix is needed for sfsub.F90 The fix is to add an "if" in a do
loop. I think George Gayno is creating a "ccpp-physics" issue on this.
(this fix may change results if there are problem points)
The change should be around line # 2021
" do i=1,len
if (nint(slmskl(i)) /= 1) then
if (sicanl(i) >= min_ice(i)) then
slianl(i) = 2.0_kind_io8
else
slianl(i) = zero
sicanl(i) = zero
endif
endif
enddo"
Moorthi
On Thu, Aug 19, 2021 at 10:39 PM Jun Wang ***@***.***> wrote:
> Currently we are trying to commit the P7 related issues. We have the FMS
PR
> that does not change results, but we are waiting for the FMS library to
be
> available on the supported platforms, the fv3 dycore update PR is on hold
> as it changes results.
>
> On Thu, Aug 19, 2021 at 10:33 PM lisa-bengtsson ***@***.***>
> wrote:
>
> > What a relief, thank you for testing! @junwang-noaa
> > <https://github.com/junwang-noaa> since it is just a single change
that
> > doesn't change any baseline, could it be merged with an existing PR?
> >
> > —
> > You are receiving this because you were mentioned.
> > Reply to this email directly, view it on GitHub
> > <
>
#746 (comment)
> >,
> > or unsubscribe
> > <
>
https://github.com/notifications/unsubscribe-auth/AI7D6TINFIOH3LQHP5THSFLT5W5GNANCNFSM5CCTWO5A
> >
> > .
> > Triage notifications on the go with GitHub Mobile for iOS
> > <
>
https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675
> >
> > or Android
> > <
>
https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email
> >
> > .
> >
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <
#746 (comment)
>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/ALLVRYWWH6J2ILOQOJ5PN4TT5W57PANCNFSM5CCTWO5A
>
> .
> Triage notifications on the go with GitHub Mobile for iOS
> <
https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675
>
> or Android
> <
https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email
>
> .
>
--
Dr. Shrinivas Moorthi
Research Meteorologist
Modeling and Data Assimilation Branch
Environmental Modeling Center / National Centers for Environmental
Prediction
5830 University Research Court - (W/NP23), College Park MD 20740 USA
Tel: (301)683-3718
e-mail: ***@***.***
Phone: (301) 683-3718 Fax: (301) 683-3718
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#746 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI7D6TIEM55ON6PHYKQEUM3T5Y5YBANCNFSM5CCTWO5A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
I am not sure I understand the question. As stated below, it is simply a
bug fix.
Moorthi
…On Fri, Aug 20, 2021 at 8:17 AM Jun Wang ***@***.***> wrote:
Thanks, Moorthi. What is this fix for?
On Fri, Aug 20, 2021 at 7:44 AM SMoorthi-emc ***@***.***>
wrote:
> Jun,
> A bug fix is needed for sfsub.F90 The fix is to add an "if" in a do
> loop. I think George Gayno is creating a "ccpp-physics" issue on this.
> (this fix may change results if there are problem points)
> The change should be around line # 2021
> " do i=1,len
> if (nint(slmskl(i)) /= 1) then
> if (sicanl(i) >= min_ice(i)) then
> slianl(i) = 2.0_kind_io8
> else
> slianl(i) = zero
> sicanl(i) = zero
> endif
> endif
> enddo"
> Moorthi
>
> On Thu, Aug 19, 2021 at 10:39 PM Jun Wang ***@***.***> wrote:
>
> > Currently we are trying to commit the P7 related issues. We have the
FMS
> PR
> > that does not change results, but we are waiting for the FMS library to
> be
> > available on the supported platforms, the fv3 dycore update PR is on
hold
> > as it changes results.
> >
> > On Thu, Aug 19, 2021 at 10:33 PM lisa-bengtsson ***@***.***>
> > wrote:
> >
> > > What a relief, thank you for testing! @junwang-noaa
> > > <https://github.com/junwang-noaa> since it is just a single change
> that
> > > doesn't change any baseline, could it be merged with an existing PR?
> > >
> > > —
> > > You are receiving this because you were mentioned.
> > > Reply to this email directly, view it on GitHub
> > > <
> >
>
#746 (comment)
> > >,
> > > or unsubscribe
> > > <
> >
>
https://github.com/notifications/unsubscribe-auth/AI7D6TINFIOH3LQHP5THSFLT5W5GNANCNFSM5CCTWO5A
> > >
> > > .
> > > Triage notifications on the go with GitHub Mobile for iOS
> > > <
> >
>
https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675
> > >
> > > or Android
> > > <
> >
>
https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email
> > >
> > > .
> > >
> >
> > —
> > You are receiving this because you are subscribed to this thread.
> > Reply to this email directly, view it on GitHub
> > <
>
#746 (comment)
> >,
> > or unsubscribe
> > <
>
https://github.com/notifications/unsubscribe-auth/ALLVRYWWH6J2ILOQOJ5PN4TT5W57PANCNFSM5CCTWO5A
> >
> > .
> > Triage notifications on the go with GitHub Mobile for iOS
> > <
>
https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675
> >
> > or Android
> > <
>
https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email
> >
> > .
> >
>
>
> --
> Dr. Shrinivas Moorthi
> Research Meteorologist
> Modeling and Data Assimilation Branch
> Environmental Modeling Center / National Centers for Environmental
> Prediction
> 5830 University Research Court - (W/NP23), College Park MD 20740 USA
> Tel: (301)683-3718
>
> e-mail: ***@***.***
> Phone: (301) 683-3718 Fax: (301) 683-3718
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <
#746 (comment)
>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AI7D6TIEM55ON6PHYKQEUM3T5Y5YBANCNFSM5CCTWO5A
>
> .
> Triage notifications on the go with GitHub Mobile for iOS
> <
https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675
>
> or Android
> <
https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email
>
> .
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#746 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALLVRYRMG2WY3FOTABMIL4TT5ZBUHANCNFSM5CCTWO5A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
--
Dr. Shrinivas Moorthi
Research Meteorologist
Modeling and Data Assimilation Branch
Environmental Modeling Center / National Centers for Environmental
Prediction
5830 University Research Court - (W/NP23), College Park MD 20740 USA
Tel: (301)683-3718
e-mail: ***@***.***
Phone: (301) 683-3718 Fax: (301) 683-3718
|
I am wondering if it will fix the restart issue in P7c. |
Thanks, Lisa.
|
No, this has nothing to do with restart.
…On Fri, Aug 20, 2021 at 8:34 AM Jun Wang ***@***.***> wrote:
I am wondering if it will fix the restart issue in P7c.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#746 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALLVRYSHCXBR4RQELR6LMLLT5ZDVRANCNFSM5CCTWO5A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
--
Dr. Shrinivas Moorthi
Research Meteorologist
Modeling and Data Assimilation Branch
Environmental Modeling Center / National Centers for Environmental
Prediction
5830 University Research Court - (W/NP23), College Park MD 20740 USA
Tel: (301)683-3718
e-mail: ***@***.***
Phone: (301) 683-3718 Fax: (301) 683-3718
|
While the main p7c memory issue is at least solved enough to run 35 days, since I ran the debug test (without waves) in trying to help debug this issue, I thought I'd post the results here all the same. Log file:
|
@JessicaMeixner-NOAA would you please create a separate issue for P7c debug error so that we can better track each problem Thanks |
@lisa-bengtsson At this morning's code manager meeting, we decided to combine your stochastic physics PR#44 with Denise's CICE memory profile PR#756 (coming out this morning, both do not change results) and get the PR committed today. @SMoorthi-emc Since your fix may change results, we need to do some testing to see if new baseline is required. Would you please create a CCPP PR? |
@junwang-noaa great, thanks. Please let me know if I can do anything else in regards to this PR. |
Jun,
I will let George do it as he identified the issue.
If George does not want to, then I will do it.
Moorthi
…On Fri, Aug 20, 2021 at 9:22 AM Jun Wang ***@***.***> wrote:
@lisa-bengtsson <https://github.com/lisa-bengtsson> At this morning's
code manager meeting, we decided to combine your stochastic physics PR#44
with Denise's CICE memory profile PR#756 (coming out this morning, both
does not change results) and get the PR committed today.
@SMoorthi-emc <https://github.com/SMoorthi-emc> Since your fix may change
results, we need to do some testing to see if new baseline is required.
Would you please create a CCPP PR?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#746 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALLVRYSBVYLAAKK4BS5YB3LT5ZJHFANCNFSM5CCTWO5A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
--
Dr. Shrinivas Moorthi
Research Meteorologist
Modeling and Data Assimilation Branch
Environmental Modeling Center / National Centers for Environmental
Prediction
5830 University Research Court - (W/NP23), College Park MD 20740 USA
Tel: (301)683-3718
e-mail: ***@***.***
Phone: (301) 683-3718 Fax: (301) 683-3718
|
@lisa-bengtsson Once the RT passes, Phil needs to review/commit the changes in stochastic physics repo, then we can commit the ufs-weather-model PR. |
I just opened an issue: NCAR/ccpp-physics#719 |
@junwang-noaa I tested the latest 3 commit of MOM6 in ufs, all of them have memory leak issue. Below is from Marshall Ward: **I have started doing more aggressive memory checking, and recently Nearly all of the leaks are because we do not properly call the We are planning to enable valgrind testing once we've fixed all the |
Jiande, thanks for the information. I will create a separate issue on ufs
to track the mom6 memory leak, I will copy the related information to that
issue.
…On Mon, Aug 23, 2021 at 9:00 AM jiandewang ***@***.***> wrote:
@junwang-noaa <https://github.com/junwang-noaa> I tested the latest 3
commit of MOM6 in ufs, all of them have memory leak issue. Below is from
Marshall Ward:
**I have started doing more aggressive memory checking, and recently
fixed many of them, but we know of a few that are not yet fixed.
Nearly all of the leaks are because we do not properly call the
MOM_end_*() functions during the finalization, so do not normally
affect the model during the run.
We are planning to enable valgrind testing once we've fixed all the
known leaks, but this is on hold until we finish up some other
projects.**
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#746 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI7D6TMVW4INNYTSHYISWTLT6JA7RANCNFSM5CCTWO5A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
* Updated the default CCPP physics option to FV3_GFS_v16 * Updated the default CCPP physics option to FV3_GFS_v16 in config_defaults.sh Co-authored-by: Natalie Perlin <Natalie@Natalies-MacBook-Air.local>
Description
All UFS P7c runs (using workflow) failed at day 18 (using 300s for fv3) or day 13(using 225s for fv3), most likely due to memory leak.
To Reproduce:
git clone https://github.com/NOAA-EMC/global-workflow
cd global-workflow
git checkout feature/coupled-crow
git submodule update --init --recursive
sh checkout.sh -c
sh build_all.sh -c
sh link_fv3gfs.sh emc hera coupled
and then use the "prototype7" case file.
Additional context
Add any other context about the problem here.
Directly reference any issues or PRs in this or other repositories that this is related to, and describe how they are related. Example:
Output
Screenshots
If applicable, drag and drop screenshots to help explain your problem.
output logs
one sample run log is saved at /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/UFS-P7c/LOG/gfs.forecast.highres.log.0, error information is around line 297663.
_slurmstepd: error: Detected 1 oom-kill event(s) in StepId=21542673.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: h34m17: task 473: Out Of Memory
srun: launch/slurm: step_signal: Terminating StepId=21542673.0
slurmstepd: error: *** STEP 21542673.0 ON h33m12 CANCELLED AT 2021-08-11T23:57:15
PET file can be found at /scratch2/NCEPDEV/climate/Jiande.Wang/z-crow-flow/UFS-P7c/LOG/PET
The text was updated successfully, but these errors were encountered: