Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PwBaseWorkChain: fix bug in validate_resources validator #683

Merged
merged 1 commit into from
May 7, 2021

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented May 5, 2021

Fixes #682

The validate_resources outline step would check that the
metadata.options namespace included at least the input
resources.num_machines and max_wallclock_seconds as long as the
automatic parallelization feature was not enabled. This was added when
the workchain was first developed, because the scheduler resources were
not validated on the base CalcJob in aiida-core.

However, since aiida-core==1.0.0 the scheduler resources are validated
on the CalcJob base class. This means that a PwBaseWorkChain will
now not even be instantiated if the resources for the PwCalculation
are insufficient, so it is better to leave the validation up to the base
class.

What is worse, the logic in validate_resources is actually incorrect
as it assumed that the resources always need to include num_machines.
However, this is not the case as it is actually Scheduler
implementation specific, and for example for the SgeScheduler this
field is even prohibited and will raise when specified. This made that
the PwBaseWorkChain was fundamentally incompatible with those
schedulers before this fix.

The PhBaseWorkChain suffered from the same bug and has also been
fixed.

@sphuber sphuber requested a review from mbercx May 5, 2021 14:23
@mbercx
Copy link
Member

mbercx commented May 5, 2021

Thanks @sphuber, will review later, but just wanted to mention that I recognise the issue with the failing test:

self = frozendict({'q2r': {'code': <Code: Remote code 'test.quantumespresso.q2r' on localhost-test, pk: 170, uuid: 96e8b36c-9...ored)>, 'metadata': {'options': {'resources': {'num_machines': 1}, 'max_wallclock_seconds': 1800, 'withmpi': False}}}})
args = ('_initialised', True), kwargs = {}

    def _immutable(self, *args, **kwargs):
        r"""
        Function for not implemented method since the object is immutable
        """
    
>       raise AttributeError(f"'{self.__class__.__name__}' object is read-only")
E       AttributeError: 'AttributesFrozendict' object is read-only

This issue was also raised by @asle85 on slack. She noticed that her frozendict version had been upgraded to 2.0.2, while it was 1.2 before. Not sure what is causing the upgrade of this dependency, but reverting back to 1.2 fixed the problem for her. Will do a bit more digging.

@sphuber
Copy link
Contributor Author

sphuber commented May 5, 2021

The problem is that for Python 3.6, plumpy==0.15.1 gets installed as that is the last version to support that Python version. That plumpy version still uses the frozendict library to implement its FrozenDict class, which is mostly used as subclass for the AttributesFrozenDict which in turns is used to wrap the inputs of a process in before returning it in self.inputs. This is to make sure they cannot be mutated during the process running code, such as in a workchain, which would be bad. Anyway, in frozendict==1.2 the __setattr__ method was callable and this is actually called in plumpy here:

https://github.com/aiidateam/plumpy/blob/7be3b8efaef7350ee32aae4be966db32f95156ec/plumpy/utils.py#L110

However, in frozendict==2.0 they added this line:

https://github.com/Marco-Sulla/python-frozendict/blob/e2b59b190ee77c3d7bdb1f2cd772c0b531694a4c/frozendict/core.py#L160

So now that self._initialized = True call in the constructor of AttributesFrozenDict causes the exception. Really the only solution is to release a v0.15.2 for plumpy that restricts the version of frozendict<2.0. Otherwise all new installations with Python 3.6 will be broken meaning all versions aiida-core<1.6 will be broken. I am not sure if we want to have that. I will open an issue on plumpy.

@sphuber
Copy link
Contributor Author

sphuber commented May 5, 2021

All clear now capt'n! @mbercx

Copy link
Member

@mbercx mbercx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sphuber! Gave the PR a pass and left some comments.

Now that we are touching this piece of the code, I'm wondering if in this PR we should perhaps also fix the fact that in case the user sets max_seconds in the input parameters, this is silently ignored by both PwBaseWorkChain and PhBaseWorkChain, since:

def set_max_seconds(self, max_wallclock_seconds):
"""Set the `max_seconds` to a fraction of `max_wallclock_seconds` option to prevent out-of-walltime problems.
:param max_wallclock_seconds: the maximum wallclock time that will be set in the scheduler settings.
"""
max_seconds_factor = self.defaults.delta_factor_max_seconds
max_seconds = max_wallclock_seconds * max_seconds_factor
self.ctx.inputs.parameters['CONTROL']['max_seconds'] = max_seconds

Wouldn't it be better to only set max_seconds based on delta_factor_max_seconds in case the user hasn't specified max_seconds in the input parameters, i.e. only call the set_max_seconds method if 'max_seconds' not in self.ctx.inputs.parameters['CONTROL']?

aiida_quantumespresso/workflows/pw/base.py Outdated Show resolved Hide resolved
aiida_quantumespresso/workflows/ph/base.py Outdated Show resolved Hide resolved
aiida_quantumespresso/workflows/pw/base.py Show resolved Hide resolved
aiida_quantumespresso/workflows/pw/base.py Outdated Show resolved Hide resolved
aiida_quantumespresso/workflows/pw/base.py Show resolved Hide resolved
@mbercx
Copy link
Member

mbercx commented May 6, 2021

While browsing through the issues, I also found #528. Does this PR fix that one as well?

@sphuber
Copy link
Contributor Author

sphuber commented May 6, 2021

While browsing through the issues, I also found #528. Does this PR fix that one as well?

The original issue should have been fixed indirectly by newer aiida-core versions that include resource validation in the CalcJob spec. I think the direct problem is that the computer configured in the tutorial does not define a default number MPI procs and the script didn't specify explicitly, hence the error. The comment by the other user would actually be fixed by this PR and is the issue described in #682 so I think we can close both issues.

@sphuber sphuber force-pushed the fix/682/pw-base-resource-validation branch from b669ed2 to ec6b189 Compare May 6, 2021 20:37
@sphuber
Copy link
Contributor Author

sphuber commented May 6, 2021

@mbercx thanks for the review. Have addressed everything except the question of max_seconds always being overridden. Would be happy to add it here but what do we need additional validation? What if it is equal or larger than the max_wallclock_seconds? Do we just set whatever the user asks?

@sphuber sphuber requested a review from mbercx May 6, 2021 20:38
@mbercx
Copy link
Member

mbercx commented May 6, 2021

@mbercx thanks for the review. Have addressed everything except the question of max_seconds always being overridden. Would be happy to add it here but what do we need additional validation? What if it is equal or larger than the max_wallclock_seconds? Do we just set whatever the user asks?

You could add extra validation, but perhaps this isn't necessary. If the user specifies both max_wallclock_seconds and max_seconds and sets max_seconds > max_wallclock_seconds, isn't that sort of on them? 😅

It's strange that the max_wallclock_seconds isn't a required input though... What happens if you try to run without it?

@sphuber
Copy link
Contributor Author

sphuber commented May 6, 2021

You could add extra validation, but perhaps this isn't necessary.

Of course I could, but that was exactly the question: how do we want the validation to behave. Maybe we indeed just say that we only set max_seconds if not defined explicitly and if it is, then it is up to the user to make sure it makes sense. Setting it too low or too high should lead to the calculation prematurely exiting or the value even being ignored. Either way, not really our problem. Fine I will add the simplest version here then :)

It's strange that the max_wallclock_seconds isn't a required input though... What happens if you try to run without it?

It probably isn't required on the CalcJob because it would most likely be again scheduler dependent. I doubt that it would work for a standard SLURM submission, as that would almost certainly require a max walltime, but for the DirectScheduler I am not so sure. I actually think that for that scheduler it doesn't even work if you do specify it and the job runs forever.

@sphuber sphuber force-pushed the fix/682/pw-base-resource-validation branch from ec6b189 to 9b8144b Compare May 6, 2021 21:39
@sphuber
Copy link
Contributor Author

sphuber commented May 6, 2021

@mbercx All done Sire!

Copy link
Member

@mbercx mbercx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good thing at least one of us test runs your implementations. 😏

aiida_quantumespresso/workflows/ph/base.py Outdated Show resolved Hide resolved
aiida_quantumespresso/workflows/pw/base.py Outdated Show resolved Hide resolved
@sphuber
Copy link
Contributor Author

sphuber commented May 7, 2021

It's a good thing at least one of us test runs your implementations. smirk

That's the thing, you are so diligent at double-checking my shitty code, that it is making me even more sloppy. So really this is all your fault

@mbercx
Copy link
Member

mbercx commented May 7, 2021

That's the thing, you are so diligent at double-checking my shitty code, that it is making me even more sloppy. So really this is all your fault

Haha, that comment turned my remark around on me so fast it gave me whiplash. 😅

@sphuber sphuber force-pushed the fix/682/pw-base-resource-validation branch from 9b8144b to 0059c8c Compare May 7, 2021 07:32
@sphuber
Copy link
Contributor Author

sphuber commented May 7, 2021

Haha, that comment turned my remark around on me so fast it gave me whiplash. sweat_smile

She's all fixed up boss! With tests and all

Copy link
Member

@mbercx mbercx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do love me some extra testing, thanks @sphuber! Just some typos to fix and a question if we can't simplify the fixtures. Merge at your discretion!


def test_set_max_seconds(generate_workchain_ph):
"""Test that `max_seconds` get sets in the parameters based on `max_wallclock_seconds` unless already set."""
inputs = generate_workchain_ph(return_inputs=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not generate the inputs with the generate_inputs_ph fixture instead of adding the return_inputs input to the generate_workchain_ph fixture and then calling it twice, once to get the inputs and once to actually get the work chain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem described for the comment on pw is less pronounced for ph but I want to keep it consistent and the input definition for the PhBasWorkChain may become more complicated in the future


def test_set_max_seconds(generate_workchain_pw):
"""Test that `max_seconds` get sets in the parameters based on `max_wallclock_seconds` unless already set."""
inputs = generate_workchain_pw(return_inputs=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for the ph.x test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the generate_inputs_pw does not return all the minimally required inputs needed for the workchain. If you look at generate_workchain_pw it needs to add the inputs to a namespace, and pop an input. I don't want to have to add these lines in every test.



def test_set_max_seconds(generate_workchain_ph):
"""Test that `max_seconds` get sets in the parameters based on `max_wallclock_seconds` unless already set."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Test that `max_seconds` get sets in the parameters based on `max_wallclock_seconds` unless already set."""
"""Test that `max_seconds` gets set in the parameters based on `max_wallclock_seconds` unless already set."""



def test_set_max_seconds(generate_workchain_pw):
"""Test that `max_seconds` get sets in the parameters based on `max_wallclock_seconds` unless already set."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Test that `max_seconds` get sets in the parameters based on `max_wallclock_seconds` unless already set."""
"""Test that `max_seconds` gets set in the parameters based on `max_wallclock_seconds` unless already set."""

The `validate_resources` outline step would check that the
`metadata.options` namespace included at least the input
`resources.num_machines` and `max_wallclock_seconds` as long as the
automatic parallelization feature was not enabled. This was added when
the workchain was first developed, because the scheduler resources were
not validated on the base `CalcJob` in `aiida-core`.

However, since `aiida-core==1.0.0` the scheduler resources are validated
on the `CalcJob` base class. This means that a `PwBaseWorkChain` will
now not even be instantiated if the resources for the `PwCalculation`
are insufficient, so it is better to leave the validation up to the base
class.

What is worse, the logic in `validate_resources` is actually incorrect
as it assumed that the resources always need to include `num_machines`.
However, this is not the case as it is actually `Scheduler`
implementation specific, and for example for the `SgeScheduler` this
field is even prohibited and will raise when specified. This made that
the `PwBaseWorkChain` was fundamentally incompatible with those
schedulers before this fix.

The `PhBaseWorkChain` suffered from the same bug and has also been
fixed.
@sphuber sphuber force-pushed the fix/682/pw-base-resource-validation branch from 0059c8c to e1b6489 Compare May 7, 2021 15:16
@sphuber sphuber merged commit 043f469 into develop May 7, 2021
@sphuber sphuber deleted the fix/682/pw-base-resource-validation branch May 7, 2021 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PwBaseWorkChain validation of resources is incorrect
2 participants