-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PhBaseWorkChain: add handle schedulor out of walltime #754
Conversation
I am not so sure about this |
c2103dc
to
4d5d8cb
Compare
@sphuber Could you have a look at this PR? |
max_seconds_new = max_seconds * factor | ||
|
||
self.ctx.restart_calc = node | ||
self.ctx.inputs.parameters.setdefault('INPUTPH', {})['recover'] = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not too familiar with ph.x
so I am not sure if this is the correct way to do it, but it could very well be. Just saying that I cannot really sign off on this as I have no idea. I would advise that we ask someone that is knowledgeable on ph.x
to validate this strategy.
Thanks for the changes @unkcpz . Everything looks ok now. If we can get a confirmation on the use of |
Thanks @sphuber ! I am asking Samuel about this, I think he has much experience in running ph calculations. Let's wait for a while, and I will keep this pr updated. |
Hi @sphuber , I confirm with @sponce24 that the |
Thanks a lot @unkcpz |
…iidateam#754) Certain scheduler plugins can detect an out-of-walltime error in which case the `ERROR_SCHEDULER_OUT_OF_WALLTIME` exit code will already have been set on the node when the actual output parser is called. The `PhParser` is updated to check for this exit code, and after having parsed as much as possible from the output, the same exit code is kept by not returning any other more specific exit code. The `PhBaseWorkChain` adds a new handler for this exit code and will perform a full restart by setting `recover = False`. It needs to be a full restart because with an OOW error from the scheduler, the state of the files on disk are almost certainly corrupt as the scheduler will have killed the job when it was writing to disk. Co-authored-by: Sebastiaan Huber <mail@sphuber.net>
fixes #753
Add a handler for
exit_status==120
when scheduler wall time reached before the calculation shut down neatly.The new calculation is restart from scratch by setting
['INPUTPH']['recover']=False
.