Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PhBaseWorkChain: add handle schedulor out of walltime #754

Merged
merged 4 commits into from
Nov 24, 2021

Conversation

unkcpz
Copy link
Member

@unkcpz unkcpz commented Nov 19, 2021

fixes #753

Add a handler for exit_status==120 when scheduler wall time reached before the calculation shut down neatly.
The new calculation is restart from scratch by setting ['INPUTPH']['recover']=False.

@unkcpz
Copy link
Member Author

unkcpz commented Nov 19, 2021

I am not so sure about this recover=false setting will correctly do the restart from scratch. @sphuber You may want to have a close look at this PR. Since for the handler convergence_not_reached it says restart from scratch but set the restart_calc and thus recover=True which to me this is not restart from scratch for PH calculation.

@unkcpz
Copy link
Member Author

unkcpz commented Nov 22, 2021

@sphuber Could you have a look at this PR?

aiida_quantumespresso/parsers/ph.py Outdated Show resolved Hide resolved
aiida_quantumespresso/workflows/ph/base.py Outdated Show resolved Hide resolved
aiida_quantumespresso/workflows/ph/base.py Show resolved Hide resolved
aiida_quantumespresso/workflows/ph/base.py Show resolved Hide resolved
max_seconds_new = max_seconds * factor

self.ctx.restart_calc = node
self.ctx.inputs.parameters.setdefault('INPUTPH', {})['recover'] = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not too familiar with ph.x so I am not sure if this is the correct way to do it, but it could very well be. Just saying that I cannot really sign off on this as I have no idea. I would advise that we ask someone that is knowledgeable on ph.x to validate this strategy.

@sphuber
Copy link
Contributor

sphuber commented Nov 23, 2021

@sphuber Could you have a look at this PR?

Thanks @unkcpz . Mostly some minor comments. Regarding the restart strategy: as mentioned in a comment, I think we should ask someone that uses ph.x a lot, as I don't know.

@sphuber
Copy link
Contributor

sphuber commented Nov 23, 2021

Thanks for the changes @unkcpz . Everything looks ok now. If we can get a confirmation on the use of recover = True I will merge this.

@unkcpz
Copy link
Member Author

unkcpz commented Nov 23, 2021

Thanks @sphuber ! I am asking Samuel about this, I think he has much experience in running ph calculations. Let's wait for a while, and I will keep this pr updated.

@unkcpz
Copy link
Member Author

unkcpz commented Nov 23, 2021

Hi @sphuber , I confirm with @sponce24 that the parent_folder should always be set which in our case the self.ctx.restart_calc always set to the previous calculation node to get remote_folder, either from finished pw run or previous failed ph run.
Then set recover=.false. will control whether to recalculate the irr representations already saved.
So for the 'pure' restart from scratch, set the self.ctx.restart_calc and set recover=.false..

@unkcpz unkcpz requested a review from sphuber November 23, 2021 17:22
@sphuber sphuber merged commit 56bb2df into aiidateam:develop Nov 24, 2021
@sphuber
Copy link
Contributor

sphuber commented Nov 24, 2021

Thanks a lot @unkcpz

@unkcpz unkcpz deleted the scheduler-OOW branch November 24, 2021 09:34
bastonero pushed a commit to bastonero/aiida-quantumespresso that referenced this pull request Dec 20, 2021
…iidateam#754)

Certain scheduler plugins can detect an out-of-walltime error in which
case the `ERROR_SCHEDULER_OUT_OF_WALLTIME` exit code will already have
been set on the node when the actual output parser is called. The
`PhParser` is updated to check for this exit code, and after having
parsed as much as possible from the output, the same exit code is kept
by not returning any other more specific exit code.

The `PhBaseWorkChain` adds a new handler for this exit code and will
perform a full restart by setting `recover = False`. It needs to be a
full restart because with an OOW error from the scheduler, the state of
the files on disk are almost certainly corrupt as the scheduler will
have killed the job when it was writing to disk.

Co-authored-by: Sebastiaan Huber <mail@sphuber.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PhBaseWorkChain: not recover from ran out of walltime
2 participants