Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PwBaseWorkChain: revisit restarting from calculations #691

Open
mbercx opened this issue May 19, 2021 · 9 comments
Open

PwBaseWorkChain: revisit restarting from calculations #691

mbercx opened this issue May 19, 2021 · 9 comments

Comments

@mbercx
Copy link
Member

mbercx commented May 19, 2021

Currently the PwBaseWorkChain keeps track of a parent calculation to restart from in the restart_calc variable in the context. For example, when the user provides a parent_folder input to restart the calculation from, its creator is assigned to the restart_calc context variable here:

if 'parent_folder' in self.ctx.inputs:
self.ctx.restart_calc = self.ctx.inputs.parent_folder.creator

This context variable is later used to set the CONTROL.restart_mode input tag of the pw.x calculation:

if self.ctx.restart_calc:
self.ctx.inputs.parameters['CONTROL']['restart_mode'] = 'restart'
self.ctx.inputs.parent_folder = self.ctx.restart_calc.outputs.remote_folder
else:
self.ctx.inputs.parameters['CONTROL']['restart_mode'] = 'from_scratch'
self.ctx.inputs.pop('parent_folder', None)

However, there is a problem in case the parent_folder has no creator, for example when the user has stashed some calculation outputs like the charge density from which she/he later wants to restart.

I think we should adapt the code so the creator is no longer necessary.

The restart_mode input tag

EDIT: I've edited the message below because I had made a mistake during my tests on this issue.

The explanation on the restart_mode is a little confusing, and gives the impression that it can only be used to restart from an interrupted run. I've done a bit of testing on a simple SCF calculation on Si to see what happens when you set this to 'restart', compared to using startingpot or startingwfc:

  • restart_mode = 'restart': You find the following lines in the output file:

       The initial density is read from file :
       ./out/aiida.save/charge-density
    
       Starting wfcs from file
    

    The calculation finishes in 1 SCF step, and the total CPU time is 6.09s.

  • startingpot = 'file': You find the following lines in the output file:

       The initial density is read from file :
       ./out/aiida.save/charge-density
    
       Starting wfcs are    8 randomized atomic wfcs
    

    The calculation finishes in 1 SCF step, and the total CPU time is 11.99s.

  • startingwfc = 'file': You find the following lines in the output file:

       Initial potential from superposition of free atoms
    
       starting charge    7.99888, renormalised to    8.00000
       Starting wfcs from file
    
       total cpu time spent up to now is        0.5 secs
    
       Self-consistent Calculation
    

    The calculation finished in 7 SCF steps, and the total CPU time is 17.99s.

  • startingpot = 'file' AND startingwfc = 'file': Same as restart_mode = 'restart'

So, it seems that:

  • setting restart_mode to 'restart' does restart both interrupted and completed runs properly. In this case both the wave functions and charge density are read from file, which gives a similar result as setting both startingpot and startingwfcto'file'`.

  • For startingpot = 'file', the calculation only requires 1 SCF step, but it takes longer because the initial wave function coefficients are random. Which is to be expected.

  • For startingwfc = 'file', it seems that the initial potential (or charge density) is still taken from the superposition of atomic charge densities. Because of this, the calculation still needs 7 SCF steps to converge and this restart method takes the longest. I suppose the question here is why the potential is not constructed from the wave functions before starting the SCF.

TL;DR

In the end, it seems that using restart_mode = 'restart' is indeed the way to go for restarting pw.x calculations for both interrupted and completed runs.

@mbercx
Copy link
Member Author

mbercx commented May 19, 2021

Thanks to @MackeEric for first drawing my attention to this! Also pinging @ramirezfranciscof and @qiaojunfeng in case they have comments.

@mbercx
Copy link
Member Author

mbercx commented May 19, 2021

So, I was a little rushed in running these tests this morning, and messed up the one for restart_mode = 'restart'. I've updated the OP to correct this.

@MackeEric: didn't you mention during one of our meetings that QE doesn't properly restart from the charge density / wave functions when restart_mode = 'restart'? Based my little test runs for Si, it seems QE v6.7 does properly restart also completed runs with this input by loading both the charge density and the wave functions. In fact, it seems almost identical to setting both startingpot = 'file' and startingwfc = 'file'.

@mbercx
Copy link
Member Author

mbercx commented May 19, 2021

A caveat to using restart_mode = 'restart' is in case e.g. the number of bands has been changed. Then trying to load the wave functions will fail:

     The initial density is read from file :
     ./out/aiida.save/charge-density


 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     Error in routine pw_restart - read_collected_wfc (1):
     The number of bands for this run is    12, but only     8 bands were read from file
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

In this case, it is interesting to use startingpot = 'file', since only 1 SCF will be needed. We should probably use this setting in the sanity_check_insufficient_bands error handler, which claims to restart at the moment:

@process_handler(exit_codes=ExitCode(0))
def sanity_check_insufficient_bands(self, calculation):
"""Perform a sanity check on the band occupations of a successfully converged calculation.
Verify that the occupation of the last band is below a certain threshold, unless `occupations` was explicitly
set to `fixed` in the input parameters. If this is violated, the calculation used too few bands and cannot be
trusted. The number of bands is increased and the calculation is restarted, starting from the last.
"""

But as far as I can see it doesn't 😅 , which is what is also reported:

self.report(f'Action taken: increased number of bands to {nbnd_new} and restarting from scratch')

So we probably should still distinguish between different restart modes to deal with cases like this.

@MackeEric
Copy link

@MackeEric: didn't you mention during one of our meetings that QE doesn't properly restart from the charge density / wave functions when restart_mode = 'restart'? Based my little test runs for Si, it seems QE v6.7 does properly restart also completed runs with this input by loading both the charge density and the wave functions. In fact, it seems almost identical to setting both startingpot = 'file' and startingwfc = 'file'.

So in my experience, I have seen a somehow inconsistent behaviour, at least with QE v6.6. In a test scenario I designed on my own, I also found that startingpot = 'file' and startingwfc = 'file' are basically equivalent to setting restart_mode = 'restart'.

However, I have frequently seen cases where QE crashed when I was restarting from a calulation where I had set occupations = 'smearing' and then used occupations = 'fixed' for the restarted run (even with nbnd and tot_magnetization set to the same values as outputted by the smearing run!). In principle, this should not cause any trouble but in reality it often does, yielding PW errors such as too many bands are not converged directly after the first iteration.

On the other hand, the very same restarts worked perfectly fine if I only used startingpot = 'file' while leaving restart_mode = 'restart'.

@mbercx
Copy link
Member Author

mbercx commented May 20, 2021

Thanks @MackeEric!

I also found that startingpot = 'file' and startingwfc = 'file' are basically equivalent to setting restart_mode = 'restart'.

One difference I realised while doing some more testing is that restart_mode = 'restart' also reads the structure from the previous calculation:

     Atomic positions and unit cell read from directory:
     ./out/aiida.save/
     Atomic positions from file used, from input discarded

So, when e.g. restarting from a previous relax calculation, using restart_mode = 'restart' just needs one SCF and ionic step, whereas setting startingpot = 'file' and startingwfc = 'file' will use of course use the initial structure.

However, I have frequently seen cases where QE crashed when I was restarting from a calulation where I had set occupations = 'smearing' and then used occupations = 'fixed' for the restarted run

Right, you weren't just restarting the same run, but adapting inputs. I can imagine using the wave functions from the 'smearing' run for a 'fixed' run causes troubles, since suddenly there will be states with partial occupations. I imagine similar errors will occur when using startingwfc = 'file' in this case. (Coindidentally, I did just try this for my simple Si SCF case and there I didn't run into any issues.)

So, using restart_mode = 'restart' is fine in case the run has completed, provided you don't change any calculation settings that cause trouble when loading the wave functions of the previous run.

@bastonero
Copy link
Collaborator

Hi everyone,

This days I did find @MackeEric very same problem ("too many bands are not converged"), while trying to "restart" a pw calculation in the hubbard workflow. When you want to start from a density of an other pw calculation, but not use the wfcs (as in, eg, the hubbard workflows), it is then necessary to be able to start the calculation using restart_mode='from_scratch' and startingpot='file', since the "smeared" wfcs could be too much different.

Could a solution be having a restart_folder and a parent_folder?
The former is specifically used for restarting, while the latter is used to just copy the out folder (where one may want also to copy only specific part of it as an option for particular purposes? ).

Anyway, I do believe that a solution is needed, especially when I think about the hubbard workflow.

@mbercx
Copy link
Member Author

mbercx commented Jun 2, 2021

Thanks for the comment @bastonero! I indeed think a possible path forward could be keeping track of a restart_folder instead of a restart_calc and updating the logic where needed to use startingpot = file instead of restart_mode = 'restart'. I just haven't looked into this in much detail, so I'm not sure if it leads to issues somewhere.

I have a meeting on Friday at 2pm with @timrov on discussing these restart issues. Maybe you also want to join?

@bastonero
Copy link
Collaborator

I have a meeting on Friday at 2pm with @timrov on discussing these restart issues. Maybe you also want to join?
Yes, I think it would be great, thanks!

@mbercx
Copy link
Member Author

mbercx commented Jun 4, 2021

Notes from meeting with @ramirezfranciscof, @timrov, @MackeEric and @bastonero:

  • We should consider only copying some files when restarting from a calculation. This should be fixable at the plugin level, and could help with reducing restart times when e.g. the wave functions are very big.
  • Question: When you try restart_mode = 'restart' and some of the outputs (e.g. wfc) are missing, does it crash the calculation?
  • Question: Is the indeed output structure is read from xml (is this the restart file the documentation mentions?) by default even if restart_mode = 'from_scratch'? See ion_positions input tag.
  • TODO: write document on all this for other QE users, and as a guideline for the discussion on making possible changes to QE (see below). Also check recent QE school for material on the topic.

Issues to raise on QE GitLab:

  • The charge density (potential) should be constructed from the wave functions in case startingwfc = 'file'.
  • Is there a problem with restarting with restart_mode = 'restart' outside of two interrupt cases suggested in the documentation? For example, if you restart from a calculation that has hit the maximum number of ionic iterations, is this ok?
  • Change the way restarting works in QE:
    1. restart_mode = 'restart' should not fail if something is missing. I.e. it should just check what outputs are there and use them to restart. (check question above)
    2. The individual settings for restarting (startingpot, startingwfc and ion_positions) should override the outputs that restart_mode uses, not the other way around.
    3. For consistency, it might actually be nice to rename ion_positions to e.g. startingion or something like that. Now it's not super clear that this is regarding restarting, and it would make it more in line with startingpot and startingwfc.

Pinging @giovannipizzi for comments, considering his considerable QE experience. 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants