Edison: coupled simulation died when creating restart files #854

tangq · 2016-04-21T00:20:13Z

In the last a couple of weeks, about 70% of my coupled simulations on edison died in the middle of creating restart files (some components created restart files successfully). This failure seems occurring randomly. I only see it for coupled simulations, not for atmosphere-only ones. Plus, the model doesn't fail at the some place when rerunning the identical simulation, suggesting the cause of this issue might be I/O related.

Two examples (script for reproducing the first example is attached)
run_acme.alpha_20160419.A_WCYCL2000.edison.csh.txt

$casedir: /scratch2/scratchdirs/tang30/ACME_simulations/20160419.A_WCYCL2000.ne30_oEC.edison.alpha4_00/case_scripts
$rundir: /scratch2/scratchdirs/tang30/ACME_simulations/20160419.A_WCYCL2000.ne30_oEC.edison.alpha4_00/run

$casedir: /scratch2/scratchdirs/tang30/ACME_simulations/20160401.A_WCYCL2000.ne30_oEC.edison.alpha4_00/case_scripts
$rundir: /scratch2/scratchdirs/tang30/ACME_simulations/20160401.A_WCYCL2000.ne30_oEC.edison.alpha4_00/run

golaz · 2016-04-21T00:40:03Z

Edison, the gift that keeps on giving...

mt5555 · 2016-04-21T01:45:37Z

From your run script, it doesn't look like you are setting the PIO_NUMIOTASKS. Which might mean you are using the defaults, which usually try to use many more IO tasks than is helpful, which can make the run more unstable.
So one thing to check would be that the number of I/O tasks is reasonable, say 32 or 64.

jayeshkrishna · 2016-04-21T14:50:34Z

From the cpl log I see that the number of pio io tasks is 40 (env_run.xml sets stride to 24).

jayeshkrishna · 2016-04-21T14:55:44Z

I have seen similar errors when I have left OCN threaded (which I assume is not your case)

jayeshkrishna · 2016-04-21T14:59:05Z

Does the same case succeed (run successfully without failures) in the rest (30%) of the cases? Looking at your runs above it looks like not all runs have the same configuration (So my question is, has the same case/configuration run successfully for you? Has the success rate improved after last week?).

mt5555 · 2016-04-21T15:00:46Z

Since this is I/O related, I might try PIO_NUMIOTASKS much smaller, around 8 or so.
For 1/4 degree, I think 128 is a good number, so for 1 degree, the files are 16x smaller, suggesting 8 might be reasonable.

tangq · 2016-04-21T16:09:32Z

@jayeshkrishna , sometimes the runs from same configuration are successful. I only did one test (the first case above), so don't have a sampling size large enough to say if this problem improved this week. But this first try failed.

@mt5555 , I'll try PIO_NUMIOTASKS=8 for the same case.

tangq · 2016-04-21T16:12:37Z

@mt5555 , I don't see PIO_NUMIOTASKS in env_run.xml. Do you mean PIO_NUMTASKS?

mt5555 · 2016-04-21T16:15:46Z

Yes. sorry about that.

in the ACME v0 days, if you set PIO_NUMTASKS w/o also setting PIO_STRIDE, the PIO_NUMTASKS value would get reset back to the defaults. So you'll have to check the log files to make sure it is really working.

tangq · 2016-04-21T16:25:17Z

So, what settings do I need to change? PIO_NUMTASKS=8? Or need to change PIO_STRIDE as well? I am not familiar with those and usually don't change them.

jayeshkrishna · 2016-04-21T16:26:49Z

Please change both (Note that the stride is preserved even if the number of io tasks is wrong for a component)

worleyph · 2016-04-21T16:31:49Z

@jayeshkrishna , can @tangq just set PIO_STRIDE to -1, and let the PIO_NUMTASKS determine stride? (Is that how -1 works in this case?)

mt5555 · 2016-04-21T16:39:09Z

Hopefully this is fixed, but in ACME v0 that would only work if all components used the same number of MPI tasks. If a component used less than the total available MPI tasks, a bug in the logic would force the component to reset back to the system defaults for stride and numtasks.

jayeshkrishna · 2016-04-21T16:56:03Z

Setting PIO_NUMTASKS only should work (supposed to) and if it doesn't I can fix it. AFAICR, pio resets the settings if the PIO_NUMTASKS+PIO_STRIDE is not valid for a component.

worleyph · 2016-04-21T16:57:44Z

@jayeshkrishna , my concern is that the default layout is setting stride to 24, not to -1. Does @tangq need to reset stride to -1?

jayeshkrishna · 2016-04-21T17:05:44Z

@tangq, please set PIO_STRIDE=-1 and PIO_NUMTASKS=8 and check the logs to make sure that it is working as you expect. If not, I will fix it for you (I have always set both the tasks and stride when I try my experiments, just old habit). This should work (no pio reset to default settings) as long as all components have at least 8 MPI tasks.

mt5555 · 2016-04-21T17:32:38Z

@jayeshkrishna : hopefully this is fixed now, but for the record, in the CESM1.2 days, with PIO_NUMTASKS=8 and PIO_STRIDE=-1, then PIO_STRIDE would be computed for all components based on the total number of MPI tasks, not the number of MPI tasks used by the component, causing bad values if a component was running on a subset of the MPI tasks.

jayeshkrishna · 2016-04-21T17:40:11Z

This behaviour is definitely a bug and I can try some cases later today to try to reproduce it.
I haven't changed this logic, so if the bug existed in CESM 1.2 it might still be there in the current version of code (@tangq : you might be better off setting both the stride and numtasks like I do for my runs since Mark has encountered a bug after setting only the numtasks for his cases with an older version of code).

worleyph · 2016-04-21T17:45:48Z

@jayeshkrishna , should @tangq set both stride and numtasks for each component separately, or will the global settings be sufficient?

jayeshkrishna · 2016-04-21T21:03:48Z

The global settings should work (if he has all the components running with the same number of tasks, or else the stride would be preserved).

tangq · 2016-04-21T21:33:19Z

I used the same case:
$casedir: /scratch2/scratchdirs/tang30/ACME_simulations/20160419.A_WCYCL2000.ne30_oEC.edison.alpha4_00/case_scripts
$rundir: /scratch2/scratchdirs/tang30/ACME_simulations/20160419.A_WCYCL2000.ne30_oEC.edison.alpha4_00/run

change both PIO_NUMTASKS=8 and PIO_STRIDE=8 for a 5-day test run writing restart files every day. Same problem occurred on the 4th day, but the restart files were created successfully for the 2nd and 3rd days.

jayeshkrishna · 2016-04-21T21:58:15Z

I am looking into this, this could be a bug in PIO or OCN (The last time I saw a similar issue OCN was multithreaded)

tangq · 2016-04-21T22:40:45Z

Thanks @jayeshkrishna , I also noticed that the model can die when different modules are creating restart. In the latest test, only one restart file is created, but in the second case on the top of this page, more than one restart files are created. This may imply more than one module causes the problem.

jayeshkrishna · 2016-04-22T19:26:06Z

@tangq : I am currently trying to recreate this issue. Do you have a list of commands to setup the case (I looked at the attached script but it is too long and I would like to set it up manually)? I followed the steps below to run my case (and this case ran successfully - did not crash),

Running on master (not the tag you are working on - I will try that too) - v1.0.0-alpha.4-67-gd02c96d
./create_newcase -compset A_WCYCL2000 -res ne30_oEC -mach edison
Change the following values in env_run.xml
- REST_N = 1 (default $STOP_N)
- PIO_STRIDE=8, PIO_NUMTASKS=8
  (PS: On master due to a bug - CLM_BLDNML_OPTS has incorrect value of -mask #861 - manually edited the CLM_BLDNML_OPTS to set the "-mask " value to oEC60to30)

worleyph · 2016-04-22T19:32:31Z

@jayeshkrishna , will be you using scratch2? Looks like @tangq is using scratch2 for the run directory. You'd hope that this would not matter, but this is a variable that should be examined.

ndkeen · 2016-06-08T18:16:15Z

accident

ndkeen · 2016-06-08T18:16:52Z

grrr

ndkeen · 2016-06-09T04:20:18Z

Starting with douglasjacobsen/mpas/remove-reference-time, I tried the coupled case @golaz documented above. It ran fine for 5 days.

golaz · 2016-06-09T12:41:43Z

@ndkeen : I uploaded my script to checkout out the code, compile and configure the simulation on the confluence page:

https://acme-climate.atlassian.net/wiki/display/SIM/20160520.A_WCYCL1850.ne30_oEC.edison.alpha6_01

The script should help you get started running alpha6_01.

ndkeen · 2016-06-09T22:01:21Z

Is this the same run that is causing issues? I see on that link that you have run this before and it reads that the job ran successfully. Any modifications other than paths?

Will this include Doug's fix?

golaz · 2016-06-10T02:10:40Z

@ndkeen : yes, this is the run that experiences sporadic restart issues. After the model hangs and is killed, I can restart it by rolling back to the previous set of complete restart files. Which is why I was able to get to 80 years so far. By default, my script does not include Doug's fixes for year 34. I added the fixes manually in my case_scripts directory.

ndkeen · 2016-06-10T04:12:19Z

I ran the script @golaz provided. It is a 375-node job for 36 hours. I'm not sure how to make the mods to ensure that the run does not run into the same problem Doug fixed(douglasjacobsen/mpas/remove-reference-time). Originally, we were looking for sporadic/large slowdowns, possibly in IO. After Doug explained the problem and made the fix, I think I might make sure this doesn't fix all your woes and/or look around for more things along those lines before going much further.

ndkeen · 2016-06-10T20:50:05Z

One thing that has been bothering me is why we aren't see a stack trace when the job is killed. This would be invaluable to finding the issue if we could see where the flow was at the time of the SIGTERM or whatever signal is issued when job is cancelled. I see that we are using the -traceback flag even in non-debug builds (which is good!). I'm testing small examples and I'm getting stack traces when the job runs out of time. As it seems difficult for me to reproduce the problem, I'm trying to think of ways I can still help.

One thing I'm looking at is the --signal flag to srun. It allows the job to be sent a signal earlier. For example, we could use it to tell the job to start shutting down 60 seconds before the walltime is up, which might give it more time to "finish"?

srun -n 3 --signal=SIGTERM@40 ./a.out
My understanding is that this sends the process a SIGTERM 40 seconds before end time.
But this may not be our problem.

@douglasjacobsen: How can we ensure that there are not more examples of the infinite loop you discovered? Where exactly was it? If I wanted to devise a simple example, I could try to mimic what the code might have been doing and see if a job cancel gives me a stack trace.

douglasjacobsen · 2016-06-14T17:54:08Z

@ndkeen: The infinite loop is here: https://github.com/ACME-Climate/MPAS/blob/ocean/develop/src/framework/mpas_timekeeping.F#L1741

There is the potential for this loop to still be a problem, but only if a single run of ACME with MPAS goes longer than 34 years in a single submit.

ndkeen · 2016-06-14T19:37:58Z

The long-running job @golaz suggested finally made it thru the Q and sure enough it "stalled" after 364 days. The job seemed to be standing still after multiple hours so I killed it.

The last file it wrote to was
20160520.A_WCYCL1850.ne30_oEC.edison.alpha6_01long.cam.rs.0002-01-01-00000.nc

@mt5555 suggested this behavior might be random -- and that if I ran the same run again, it might not stall in the same place/way. I can try repeating. I can try building debug. I can try starting it again from a restart (if instructions are given) if that helps.

ndkeen · 2016-06-14T19:49:19Z

I cancelled the job. Now this may be a clue or spurious info, but while the job was still running, the size of the cam restart file (pasted in above comment) was 67497680 and the date was June 13, 5:20PM (from ls -l).

After cancelling the job, the date is identical, however the size changed to 6873328648.

Which may suggest that the code may be doing something related to writing that netcdf file when it hangs.

ndkeen · 2016-06-14T22:30:09Z

I submitted it again to see if it stops in the same way. It's already running as I decreased the required time since it stalls in less than 6 hours. NERSC consulting suggested I run with "srun -u" to get unbuffered output and I am trying that experiment as well.

mt5555 · 2016-06-14T22:39:00Z

I agree - it looks like the code was hung during I/O. Could be some kind of race condition/bug in our code, or just random Lustre problems. The worst kind of thing to debug. Based on my (outdated) experience from the v0 simulations, lowering the number of MPI tasks participating in the file writing might make it more robust.

If it continues to hang during restart, as opposed to other I/O, you may be able to trigger it quicker by editing env_run.xml and setting the restart frequency to monthly.

worleyph · 2016-06-14T22:45:35Z

@ndkeen , can you (or someone) look in the file being written during the hang? ... I vaguely remember a performance issue in the recent past where the immediate source of the problem was the model writing out NaNs and the like.

ndkeen · 2016-06-14T22:47:31Z

Yea sure -- I was even going to ask that -- how does one look into a netcdf file? ncdump?

worleyph · 2016-06-14T22:49:32Z

Note that I did not volunteer - someone else needs to speak up (ncdump? something better for large files? Can we even read a file that the model was still writing to when aborted?)

mt5555 · 2016-06-14T22:51:35Z

ncdump probably wont if the job is killed while writing the file (so the write doesn't complete and the metadata is not updated).

ndkeen · 2016-06-14T23:07:04Z

I did use ncdump and can look at the text file. I don't see any nans.

ncdump 20160520.A_WCYCL1850.ne30_oEC.edison.alpha6_01long.cam.rs.0002-01-01-00000.nc > o
grep -i nan o

ndkeen · 2016-06-16T18:36:57Z

I have 4 new results. One is a simple repeat of my above attempt where the run stalled after 364 days. This time it stalled at a different location (after 211 days) with the last files it touched being different as well (ie not cam restart). Here are the files with same timestamp -- after job killed, some other logs written 5 hours later.

-rw-rw-r-- 1 ndk ndk      440712 Jun 14 15:32 rof.log.160614-140505
-rw-rw-r-- 1 ndk ndk    26002488 Jun 14 15:32 20160520.A_WCYCL1850.ne30_oEC.edison.alpha6_01longb.mosart.h0.0001-07.nc
-rw-rw-r-- 1 ndk ndk     1049600 Jun 14 15:32 am.mpas-o.globalStats.0001-08-01.nc
-rw-rw-r-- 1 ndk ndk     2514208 Jun 14 15:32 am.mpas-cice.regionalStatistics.0001.nc
-rw-rw-r-- 1 ndk ndk     1194896 Jun 14 15:32 am.mpas-o.meridionalHeatTransport.0001-08-01.nc
-rw-rw-r-- 1 ndk ndk     1303168 Jun 14 15:32 am.mpas-o.layerVolumeWeightedAverage.0001-08-01.nc
-rw-rw-r-- 1 ndk ndk    68495936 Jun 14 15:32 20160520.A_WCYCL1850.ne30_oEC.edison.alpha6_01longb.clm2.h0.0001-07.nc
-rw-rw-r-- 1 ndk ndk     1053688 Jun 14 15:32 am.mpas-o.surfaceAreaWeightedAverages.0001-08-01.nc
-rw-rw-r-- 1 ndk ndk   302402528 Jun 14 15:32 am.mpas-cice.timeSeriesStatsMonthly.0001-08-01.nc
drwxrwxr-x 3 ndk ndk       12288 Jun 14 15:32 ./
-rw-rw-r-- 1 ndk ndk   495107528 Jun 14 15:32 am.mpas-o.mixedLayerDepths.0001-08-01.nc
-rw-rw-r-- 1 ndk ndk  3475878868 Jun 14 15:32 20160520.A_WCYCL1850.ne30_oEC.edison.alpha6_01longb.cam.h0.0001-07.nc

So that's troubling.

And then I ran another run using srun -u which gives us unbuffered output. The idea being that maybe it will give us more information to the log files when job is killed/cancelled. It seems to stall in the same place as the first run (after 364 days) and there is slightly different log output, but I don't see anything helpful so far.

I then built a debug version and ran for ~10 hours with nothing interesting. The run is incredibly slow. Days are taking 2 hours (!) to run instead of the usual 23 seconds.

THEN I tried the latest Intel compiler (v17) and once again built a debug version. This time, the run stopped fairly quickly with an actual array out of bounds. I emailed @mt5555 about it as it happens in HOMME. I'm also adding some prints to find out more and see if it is real.

I'm also experimenting with Intel's Inspector which can help find memory issues.

ndkeen · 2016-06-17T15:36:53Z

The error that I found in above comment (array out of bounds in HOMME) had actually already been found and fixed in another github issue a few days ago. I applied this fix (adding an initialization) and tried the debug run again. It went beyond this and stopped here:

5400:  pionfput_mod.F90         128           1           1          64           1
5400:  0001-01-01_00:00:00
6012: forrtl: error (65): floating invalid
6012: Image              PC                Routine            Line        Source
6012: cesm.exe           000000000E2C6061  Unknown               Unknown  Unknown
6012: cesm.exe           000000000E2C3EC7  Unknown               Unknown  Unknown
6012: cesm.exe           000000000E272E44  Unknown               Unknown  Unknown
6012: cesm.exe           000000000E272C56  Unknown               Unknown  Unknown
6012: cesm.exe           000000000E1F8CA4  Unknown               Unknown  Unknown
6012: cesm.exe           000000000E204A2D  Unknown               Unknown  Unknown
6012: cesm.exe           000000000DB81820  Unknown               Unknown  Unknown
6012: cesm.exe           000000000B86BA32  ocn_mixed_layer_d         416  mpas_ocn_mixed_layer\
_depths.f90
6012: cesm.exe           000000000B6ACB8D  ocn_analysis_driv        1118  mpas_ocn_analysis_dr\
iver.f90
6012: cesm.exe           000000000B6A5453  ocn_analysis_driv         681  mpas_ocn_analysis_dr\
iver.f90
6012: cesm.exe           0000000009A273EF  ocn_comp_mct_mp_o         830  ocn_c
6012: omp_mct.f90
6012: cesm.exe           000000000045AF97  component_mod_mp_        1049  component_mod.F90
6012: cesm.exe           000000000042CA68  cesm_comp_mod_mp_        3249  cesm_comp_mod.F90
6012: cesm.exe           000000000043713B  MAIN__                    107  cesm_driver.F90
6012: cesm.exe           0000000000400A0E  Unknown               Unknown  Unknown
6012: cesm.exe           000000000E2CCF51  Unknown               Unknown  Unknown
6012: cesm.exe           00000000004008F5  Unknown               Unknown  Unknown
srun: error: nid02755: task 6012: Aborted

I also started another non-debug run and it's been running for 9 hours (only set it to run for 10 hours) and it is at day 1411 ! So it's clearly well beyond where it was before.

mt5555 · 2016-06-21T15:21:18Z

@ndkeen : I saw somewhere you suggested the HOMME out-of-bounds bug might have been behind some of the intermittent crashes/hangs discussed here?

ndkeen · 2016-06-21T17:51:31Z

@mt5555: yes. In fact, I mentioned that in my above comment.

Since that comment, I submitted another longer run, but it has not started yet. I am fairly confident that this bug fix (which is currently on next #918) will allow the simulation to continue. Is this tun expected to go for 10 years?

Also, note that I finally looked into the above stack trace and we think we have a fix:
#920

I'm trying this now. I made the change, re-compiled with same level of debug and trying to run it a few hours to verify it gets past this point.

ndkeen · 2016-07-01T18:46:25Z

A long run completed successfully. Ran for ~30 wall-clock hours and simulated 14 years. For the most part, the run was quick. Ignoring days on month boundaries (where there are restart files written), I count 23 days (out of over 5000) that would be more than 2x slower and only 5 that were over 5x slower than expected.

Run directory here:
/scratch2/scratchdirs/ndk/ACME_simulations/20160520.A_WCYCL1850.ne30_oEC.edison.alpha6_01long

I think this shows that we have fixed the original problem. There were 4 issues uncovered:

A bug that was triggered after 34 years. Fixed here: douglasjacobsen/mpas/remove-reference-time
A missing initialization that likely caused memory corruption fixed here:
https://github.com/ACME-Climate/ACME/pull/918
A floating invalid bug caught in debug mode that may or may not affect optimized runs, but discussed here: https://github.com/ACME-Climate/ACME/issues/920
Ongoing mysterious slowdowns on edison https://github.com/ACME-Climate/ACME/issues/820

We are still trying to debug (4), but that is a different github issue. (Assuming the slowdowns are the same)

ndkeen · 2016-07-05T18:48:12Z

I ran the simulation yet again. 14 years. As far as I know, it was identical run -- just another test to debug performance problems. I can make another plot, but it looks like the daily timings are even better than the plot above -- more consistent and fewer 'blips'.

I think we can close this issue for now? Any thoughts?
Maybe leave the other github issue related to slowdowns on edison open.

edison02% ls /scratch2/scratchdirs/ndk/ACME_simulations/20160520.A_WCYCL1850.ne30_oEC.edison.alpha6_01long
./  ../  build/  case_scripts/  run/
edison02% ls /scratch2/scratchdirs/ndk/ACME_simulations/20160520.A_WCYCL1850.ne30_oEC.edison.alpha6_01longb
./  ../  build/  case_scripts/  run/

mt5555 · 2016-07-20T23:26:31Z

closing per @ndkeen 's recommendation above.

For random slowness problems on Edison, see #820

golaz changed the title ~~couplded simulation died when creating restart files~~ Edison: coupled simulation died when creating restart files Apr 21, 2016

jayeshkrishna added bug mpas-ocean labels Apr 22, 2016

jayeshkrishna self-assigned this Apr 22, 2016

golaz mentioned this issue Apr 22, 2016

Cori is not producing BFB results #802

Closed

ndkeen closed this as completed Jun 8, 2016

ndkeen reopened this Jun 8, 2016

This was referenced Jun 22, 2016

assert fail in subroutine initVertical of initVerticalMod.F90 on edison #921

Closed

(very) slow performance on Edison #820

Closed

mt5555 closed this as completed Jul 20, 2016

Edison: coupled simulation died when creating restart files #854

Edison: coupled simulation died when creating restart files #854

Comments

tangq commented Apr 21, 2016

golaz commented Apr 21, 2016

mt5555 commented Apr 21, 2016

jayeshkrishna commented Apr 21, 2016

jayeshkrishna commented Apr 21, 2016

jayeshkrishna commented Apr 21, 2016

mt5555 commented Apr 21, 2016

tangq commented Apr 21, 2016

tangq commented Apr 21, 2016

mt5555 commented Apr 21, 2016

tangq commented Apr 21, 2016

jayeshkrishna commented Apr 21, 2016

worleyph commented Apr 21, 2016

mt5555 commented Apr 21, 2016

jayeshkrishna commented Apr 21, 2016

worleyph commented Apr 21, 2016

jayeshkrishna commented Apr 21, 2016

mt5555 commented Apr 21, 2016

jayeshkrishna commented Apr 21, 2016

worleyph commented Apr 21, 2016

jayeshkrishna commented Apr 21, 2016

tangq commented Apr 21, 2016

jayeshkrishna commented Apr 21, 2016

tangq commented Apr 21, 2016

jayeshkrishna commented Apr 22, 2016 • edited Loading

worleyph commented Apr 22, 2016

ndkeen commented Jun 8, 2016 • edited Loading

ndkeen commented Jun 8, 2016

ndkeen commented Jun 9, 2016

golaz commented Jun 9, 2016

ndkeen commented Jun 9, 2016

golaz commented Jun 10, 2016

ndkeen commented Jun 10, 2016

ndkeen commented Jun 10, 2016 • edited Loading

douglasjacobsen commented Jun 14, 2016

ndkeen commented Jun 14, 2016

ndkeen commented Jun 14, 2016

ndkeen commented Jun 14, 2016

mt5555 commented Jun 14, 2016

worleyph commented Jun 14, 2016

ndkeen commented Jun 14, 2016

worleyph commented Jun 14, 2016

mt5555 commented Jun 14, 2016

ndkeen commented Jun 14, 2016 • edited Loading

ndkeen commented Jun 16, 2016 • edited Loading

ndkeen commented Jun 17, 2016

mt5555 commented Jun 21, 2016

ndkeen commented Jun 21, 2016

ndkeen commented Jul 1, 2016

ndkeen commented Jul 5, 2016

mt5555 commented Jul 20, 2016

jayeshkrishna commented Apr 22, 2016 •

edited

Loading

ndkeen commented Jun 8, 2016 •

edited

Loading

ndkeen commented Jun 10, 2016 •

edited

Loading

ndkeen commented Jun 14, 2016 •

edited

Loading

ndkeen commented Jun 16, 2016 •

edited

Loading