PhBaseWorkChain: add handle schedulor out of walltime #754

unkcpz · 2021-11-19T10:39:45Z

fixes #753

Add a handler for exit_status==120 when scheduler wall time reached before the calculation shut down neatly.
The new calculation is restart from scratch by setting ['INPUTPH']['recover']=False.

unkcpz · 2021-11-19T10:44:27Z

I am not so sure about this recover=false setting will correctly do the restart from scratch. @sphuber You may want to have a close look at this PR. Since for the handler convergence_not_reached it says restart from scratch but set the restart_calc and thus recover=True which to me this is not restart from scratch for PH calculation.

unkcpz · 2021-11-22T15:33:17Z

@sphuber Could you have a look at this PR?

aiida_quantumespresso/parsers/ph.py

aiida_quantumespresso/workflows/ph/base.py

sphuber · 2021-11-23T08:57:46Z

aiida_quantumespresso/workflows/ph/base.py

+ max_seconds_new = max_seconds * factor
+
+ self.ctx.restart_calc = node
+ self.ctx.inputs.parameters.setdefault('INPUTPH', {})['recover'] = False


I am not too familiar with ph.x so I am not sure if this is the correct way to do it, but it could very well be. Just saying that I cannot really sign off on this as I have no idea. I would advise that we ask someone that is knowledgeable on ph.x to validate this strategy.

sphuber · 2021-11-23T08:59:31Z

@sphuber Could you have a look at this PR?

Thanks @unkcpz . Mostly some minor comments. Regarding the restart strategy: as mentioned in a comment, I think we should ask someone that uses ph.x a lot, as I don't know.

sphuber · 2021-11-23T10:25:10Z

Thanks for the changes @unkcpz . Everything looks ok now. If we can get a confirmation on the use of recover = True I will merge this.

unkcpz · 2021-11-23T10:28:04Z

Thanks @sphuber ! I am asking Samuel about this, I think he has much experience in running ph calculations. Let's wait for a while, and I will keep this pr updated.

unkcpz · 2021-11-23T17:21:42Z

Hi @sphuber , I confirm with @sponce24 that the parent_folder should always be set which in our case the self.ctx.restart_calc always set to the previous calculation node to get remote_folder, either from finished pw run or previous failed ph run.
Then set recover=.false. will control whether to recalculate the irr representations already saved.
So for the 'pure' restart from scratch, set the self.ctx.restart_calc and set recover=.false..

sphuber · 2021-11-24T09:21:00Z

Thanks a lot @unkcpz

…iidateam#754) Certain scheduler plugins can detect an out-of-walltime error in which case the `ERROR_SCHEDULER_OUT_OF_WALLTIME` exit code will already have been set on the node when the actual output parser is called. The `PhParser` is updated to check for this exit code, and after having parsed as much as possible from the output, the same exit code is kept by not returning any other more specific exit code. The `PhBaseWorkChain` adds a new handler for this exit code and will perform a full restart by setting `recover = False`. It needs to be a full restart because with an OOW error from the scheduler, the state of the files on disk are almost certainly corrupt as the scheduler will have killed the job when it was writing to disk. Co-authored-by: Sebastiaan Huber <mail@sphuber.net>

PhBaseWorkChain: add handle schedulor out of walltime

4d5d8cb

unkcpz force-pushed the scheduler-OOW branch from c2103dc to 4d5d8cb Compare November 19, 2021 10:46

sphuber requested changes Nov 23, 2021

View reviewed changes

unkcpz added 2 commits November 23, 2021 10:49

update review 1

5fa5e05

convergence handler: not achieved -> not reached

318aba5

unkcpz requested a review from sphuber November 23, 2021 17:22

Merge branch 'develop' into scheduler-OOW

80db4b1

sphuber approved these changes Nov 24, 2021

View reviewed changes

sphuber merged commit 56bb2df into aiidateam:develop Nov 24, 2021

unkcpz deleted the scheduler-OOW branch November 24, 2021 09:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PhBaseWorkChain: add handle schedulor out of walltime #754

PhBaseWorkChain: add handle schedulor out of walltime #754

unkcpz commented Nov 19, 2021

unkcpz commented Nov 19, 2021 •

edited

Loading

unkcpz commented Nov 22, 2021

sphuber Nov 23, 2021

sphuber commented Nov 23, 2021

sphuber commented Nov 23, 2021

unkcpz commented Nov 23, 2021

unkcpz commented Nov 23, 2021 •

edited

Loading

sphuber commented Nov 24, 2021

PhBaseWorkChain: add handle schedulor out of walltime #754

PhBaseWorkChain: add handle schedulor out of walltime #754

Conversation

unkcpz commented Nov 19, 2021

unkcpz commented Nov 19, 2021 • edited Loading

unkcpz commented Nov 22, 2021

sphuber Nov 23, 2021

Choose a reason for hiding this comment

sphuber commented Nov 23, 2021

sphuber commented Nov 23, 2021

unkcpz commented Nov 23, 2021

unkcpz commented Nov 23, 2021 • edited Loading

sphuber commented Nov 24, 2021

unkcpz commented Nov 19, 2021 •

edited

Loading

unkcpz commented Nov 23, 2021 •

edited

Loading