-
Notifications
You must be signed in to change notification settings - Fork 0
Feature request: signal handling when job is ending #1336
Comments
As far as I'm aware catching signals in Fortran is still non-standard, though many compilers (like GNU: https://gcc.gnu.org/onlinedocs/gfortran/SIGNAL.html) support it. Maybe we can investigate whether this would be feasible to do in C using the POSIX standard |
It may also be the case that memory allocation in signal handlers is forbidden (https://stackoverflow.com/questions/33619071/signal-handling-and-check-pointing-for-mpif90/33647381), so we may need to do sneaky things like setting a global variable and returning from the handler, then relying on code that checks the global variable periodically to determine whether, e.g., restart files should be written. |
@mgduda , thanks for the feedback on this. It was unclear to me if this is Fortran standard or not. The pdf I linked to seemed to imply it is now standard but some compilers still include their previously non-standard way of handling it. (But it didn't come out and say that, so I still wasn't sure about that.) In any case, I suspect this feature may not in the end be super useful since we generally want to protect from unexpected termination anyway, and so if we are writing restarts at a regular frequency anyway, the ability to force a restart at wallclock end might not add a whole lot of extra value. Getting timer information on a run that times out might be useful, but maybe not worth the hassle here. I mostly wanted to jot this idea down for posterity, so thanks for adding to it. |
@mgduda , reading the link you added more carefully, perhaps a more useful way to get this behavior would be to include a timing functionality that triggers writing a restart and/or model termination after a specified wallclock duration that is configurable at runtime (i.e. in the namelist you set You would have to modify that time with each job to make it consistent with your job submission script, of course, but in my original proposal, you already needed to include a special line in the submission script with the appropriate time and arbitrarily chosen signal code, so that it isn't that much worse. Another advantage to this approach is that I don't think it would require any framework changes - any core could implement that on their own now. |
Indeed, we had this capability in POP for a while, but it was problematic because it was non-standard. |
@matthewhoffman Implementing the second option that you proposed -- the ability to force-write a restart stream after some specified elapsed wallclock time -- would be rather easy to implement in a portable way, I think. The Fortran |
It is possible to have a queue system send a signal when the job is near its end time, e.g.:
https://slurm.schedmd.com/sbatch.html
Within Fortran, it is possible to catch a signal, e.g.:
https://www.sharcnet.ca/help/images/4/42/Fortran_Signal_Handling.pdf
Combining these, it would be possible for MPAS to catch a signal if the job is ending and then do things like write a restart and terminate cleanly.
Without thinking about it more carefully, I'm not sure that any Framework changes would be necessary for this to be implemented in a core.
The text was updated successfully, but these errors were encountered: