-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Edison: coupled simulation died when creating restart files #854
Comments
Edison, the gift that keeps on giving... |
From your run script, it doesn't look like you are setting the PIO_NUMIOTASKS. Which might mean you are using the defaults, which usually try to use many more IO tasks than is helpful, which can make the run more unstable. |
From the cpl log I see that the number of pio io tasks is 40 (env_run.xml sets stride to 24). |
I have seen similar errors when I have left OCN threaded (which I assume is not your case) |
Does the same case succeed (run successfully without failures) in the rest (30%) of the cases? Looking at your runs above it looks like not all runs have the same configuration (So my question is, has the same case/configuration run successfully for you? Has the success rate improved after last week?). |
Since this is I/O related, I might try PIO_NUMIOTASKS much smaller, around 8 or so. |
@jayeshkrishna , sometimes the runs from same configuration are successful. I only did one test (the first case above), so don't have a sampling size large enough to say if this problem improved this week. But this first try failed. @mt5555 , I'll try PIO_NUMIOTASKS=8 for the same case. |
@mt5555 , I don't see PIO_NUMIOTASKS in env_run.xml. Do you mean PIO_NUMTASKS? |
Yes. sorry about that. in the ACME v0 days, if you set PIO_NUMTASKS w/o also setting PIO_STRIDE, the PIO_NUMTASKS value would get reset back to the defaults. So you'll have to check the log files to make sure it is really working. |
So, what settings do I need to change? PIO_NUMTASKS=8? Or need to change PIO_STRIDE as well? I am not familiar with those and usually don't change them. |
Please change both (Note that the stride is preserved even if the number of io tasks is wrong for a component) |
@jayeshkrishna , can @tangq just set PIO_STRIDE to -1, and let the PIO_NUMTASKS determine stride? (Is that how -1 works in this case?) |
Hopefully this is fixed, but in ACME v0 that would only work if all components used the same number of MPI tasks. If a component used less than the total available MPI tasks, a bug in the logic would force the component to reset back to the system defaults for stride and numtasks. |
Setting PIO_NUMTASKS only should work (supposed to) and if it doesn't I can fix it. AFAICR, pio resets the settings if the PIO_NUMTASKS+PIO_STRIDE is not valid for a component. |
@jayeshkrishna , my concern is that the default layout is setting stride to 24, not to -1. Does @tangq need to reset stride to -1? |
@tangq, please set PIO_STRIDE=-1 and PIO_NUMTASKS=8 and check the logs to make sure that it is working as you expect. If not, I will fix it for you (I have always set both the tasks and stride when I try my experiments, just old habit). This should work (no pio reset to default settings) as long as all components have at least 8 MPI tasks. |
@jayeshkrishna : hopefully this is fixed now, but for the record, in the CESM1.2 days, with PIO_NUMTASKS=8 and PIO_STRIDE=-1, then PIO_STRIDE would be computed for all components based on the total number of MPI tasks, not the number of MPI tasks used by the component, causing bad values if a component was running on a subset of the MPI tasks. |
This behaviour is definitely a bug and I can try some cases later today to try to reproduce it. |
@jayeshkrishna , should @tangq set both stride and numtasks for each component separately, or will the global settings be sufficient? |
The global settings should work (if he has all the components running with the same number of tasks, or else the stride would be preserved). |
I used the same case: change both PIO_NUMTASKS=8 and PIO_STRIDE=8 for a 5-day test run writing restart files every day. Same problem occurred on the 4th day, but the restart files were created successfully for the 2nd and 3rd days. |
I am looking into this, this could be a bug in PIO or OCN (The last time I saw a similar issue OCN was multithreaded) |
Thanks @jayeshkrishna , I also noticed that the model can die when different modules are creating restart. In the latest test, only one restart file is created, but in the second case on the top of this page, more than one restart files are created. This may imply more than one module causes the problem. |
@tangq : I am currently trying to recreate this issue. Do you have a list of commands to setup the case (I looked at the attached script but it is too long and I would like to set it up manually)? I followed the steps below to run my case (and this case ran successfully - did not crash),
|
@jayeshkrishna , will be you using scratch2? Looks like @tangq is using scratch2 for the run directory. You'd hope that this would not matter, but this is a variable that should be examined. |
accident |
grrr |
Starting with douglasjacobsen/mpas/remove-reference-time, I tried the coupled case @golaz documented above. It ran fine for 5 days. |
@ndkeen : I uploaded my script to checkout out the code, compile and configure the simulation on the confluence page: https://acme-climate.atlassian.net/wiki/display/SIM/20160520.A_WCYCL1850.ne30_oEC.edison.alpha6_01 The script should help you get started running alpha6_01. |
Is this the same run that is causing issues? I see on that link that you have run this before and it reads that the job ran successfully. Any modifications other than paths? Will this include Doug's fix? |
@ndkeen : yes, this is the run that experiences sporadic restart issues. After the model hangs and is killed, I can restart it by rolling back to the previous set of complete restart files. Which is why I was able to get to 80 years so far. By default, my script does not include Doug's fixes for year 34. I added the fixes manually in my case_scripts directory. |
I ran the script @golaz provided. It is a 375-node job for 36 hours. I'm not sure how to make the mods to ensure that the run does not run into the same problem Doug fixed(douglasjacobsen/mpas/remove-reference-time). Originally, we were looking for sporadic/large slowdowns, possibly in IO. After Doug explained the problem and made the fix, I think I might make sure this doesn't fix all your woes and/or look around for more things along those lines before going much further. |
One thing that has been bothering me is why we aren't see a stack trace when the job is killed. This would be invaluable to finding the issue if we could see where the flow was at the time of the SIGTERM or whatever signal is issued when job is cancelled. I see that we are using the -traceback flag even in non-debug builds (which is good!). I'm testing small examples and I'm getting stack traces when the job runs out of time. As it seems difficult for me to reproduce the problem, I'm trying to think of ways I can still help. One thing I'm looking at is the
@douglasjacobsen: How can we ensure that there are not more examples of the infinite loop you discovered? Where exactly was it? If I wanted to devise a simple example, I could try to mimic what the code might have been doing and see if a job cancel gives me a stack trace. |
@ndkeen: The infinite loop is here: https://github.com/ACME-Climate/MPAS/blob/ocean/develop/src/framework/mpas_timekeeping.F#L1741 There is the potential for this loop to still be a problem, but only if a single run of ACME with MPAS goes longer than 34 years in a single submit. |
The long-running job @golaz suggested finally made it thru the Q and sure enough it "stalled" after 364 days. The job seemed to be standing still after multiple hours so I killed it. The last file it wrote to was @mt5555 suggested this behavior might be random -- and that if I ran the same run again, it might not stall in the same place/way. I can try repeating. I can try building debug. I can try starting it again from a restart (if instructions are given) if that helps. |
I cancelled the job. Now this may be a clue or spurious info, but while the job was still running, the size of the cam restart file (pasted in above comment) was After cancelling the job, the date is identical, however the size changed to Which may suggest that the code may be doing something related to writing that netcdf file when it hangs. |
I submitted it again to see if it stops in the same way. It's already running as I decreased the required time since it stalls in less than 6 hours. NERSC consulting suggested I run with "srun -u" to get unbuffered output and I am trying that experiment as well. |
I agree - it looks like the code was hung during I/O. Could be some kind of race condition/bug in our code, or just random Lustre problems. The worst kind of thing to debug. Based on my (outdated) experience from the v0 simulations, lowering the number of MPI tasks participating in the file writing might make it more robust. If it continues to hang during restart, as opposed to other I/O, you may be able to trigger it quicker by editing env_run.xml and setting the restart frequency to monthly. |
@ndkeen , can you (or someone) look in the file being written during the hang? ... I vaguely remember a performance issue in the recent past where the immediate source of the problem was the model writing out NaNs and the like. |
Yea sure -- I was even going to ask that -- how does one look into a netcdf file? ncdump? |
Note that I did not volunteer - someone else needs to speak up (ncdump? something better for large files? Can we even read a file that the model was still writing to when aborted?) |
ncdump probably wont if the job is killed while writing the file (so the write doesn't complete and the metadata is not updated). |
I did use ncdump and can look at the text file. I don't see any nans.
|
I have 4 new results. One is a simple repeat of my above attempt where the run stalled after 364 days. This time it stalled at a different location (after 211 days) with the last files it touched being different as well (ie not cam restart). Here are the files with same timestamp -- after job killed, some other logs written 5 hours later.
So that's troubling. And then I ran another run using I then built a debug version and ran for ~10 hours with nothing interesting. The run is incredibly slow. Days are taking 2 hours (!) to run instead of the usual 23 seconds. THEN I tried the latest Intel compiler (v17) and once again built a debug version. This time, the run stopped fairly quickly with an actual array out of bounds. I emailed @mt5555 about it as it happens in HOMME. I'm also adding some prints to find out more and see if it is real. I'm also experimenting with Intel's Inspector which can help find memory issues. |
The error that I found in above comment (array out of bounds in HOMME) had actually already been found and fixed in another github issue a few days ago. I applied this fix (adding an initialization) and tried the debug run again. It went beyond this and stopped here:
I also started another non-debug run and it's been running for 9 hours (only set it to run for 10 hours) and it is at day 1411 ! So it's clearly well beyond where it was before. |
@ndkeen : I saw somewhere you suggested the HOMME out-of-bounds bug might have been behind some of the intermittent crashes/hangs discussed here? |
@mt5555: yes. In fact, I mentioned that in my above comment. Since that comment, I submitted another longer run, but it has not started yet. I am fairly confident that this bug fix (which is currently on next #918) will allow the simulation to continue. Is this tun expected to go for 10 years? Also, note that I finally looked into the above stack trace and we think we have a fix: I'm trying this now. I made the change, re-compiled with same level of debug and trying to run it a few hours to verify it gets past this point. |
A long run completed successfully. Ran for ~30 wall-clock hours and simulated 14 years. For the most part, the run was quick. Ignoring days on month boundaries (where there are restart files written), I count 23 days (out of over 5000) that would be more than 2x slower and only 5 that were over 5x slower than expected. Run directory here: I think this shows that we have fixed the original problem. There were 4 issues uncovered:
We are still trying to debug (4), but that is a different github issue. (Assuming the slowdowns are the same) |
I ran the simulation yet again. 14 years. As far as I know, it was identical run -- just another test to debug performance problems. I can make another plot, but it looks like the daily timings are even better than the plot above -- more consistent and fewer 'blips'. I think we can close this issue for now? Any thoughts?
|
In the last a couple of weeks, about 70% of my coupled simulations on edison died in the middle of creating restart files (some components created restart files successfully). This failure seems occurring randomly. I only see it for coupled simulations, not for atmosphere-only ones. Plus, the model doesn't fail at the some place when rerunning the identical simulation, suggesting the cause of this issue might be I/O related.
Two examples (script for reproducing the first example is attached)
run_acme.alpha_20160419.A_WCYCL2000.edison.csh.txt
$casedir: /scratch2/scratchdirs/tang30/ACME_simulations/20160419.A_WCYCL2000.ne30_oEC.edison.alpha4_00/case_scripts
$rundir: /scratch2/scratchdirs/tang30/ACME_simulations/20160419.A_WCYCL2000.ne30_oEC.edison.alpha4_00/run
$casedir: /scratch2/scratchdirs/tang30/ACME_simulations/20160401.A_WCYCL2000.ne30_oEC.edison.alpha4_00/case_scripts
$rundir: /scratch2/scratchdirs/tang30/ACME_simulations/20160401.A_WCYCL2000.ne30_oEC.edison.alpha4_00/run
The text was updated successfully, but these errors were encountered: