Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

👌 IMPROVE: Allow numpy arrays to be serialized on process checkpoints #4730

Merged
merged 6 commits into from
Jun 16, 2021

Conversation

greschd
Copy link
Member

@greschd greschd commented Feb 9, 2021

NOTE: This PR requires extra careful review, both because it is security relevant, and because I am not very familiar with the import code.

To allow e.g. numpy arrays to be serialized to a process checkpoint, the AiiDALoader is based on yaml.UnsafeLoader instead of yaml.FullLoader. Since this could pose a security risk when sharing databases with maliciously crafted checkpoints, the checkpoints are removed upon importing an archive.

Fixes #3709.

Questions to consider:

  • Is the AiiDALoader only used on checkpoints, or also somewhere else?
  • Does the node_type always start with process. for nodes which have checkpoints?
  • Anything else you can think of 😄

@greschd
Copy link
Member Author

greschd commented Feb 9, 2021

@sphuber do you know if we can remove the version constraint on pyyaml with this change?

@codecov
Copy link

codecov bot commented Feb 9, 2021

Codecov Report

Merging #4730 (c6ceb78) into develop (28ef3a5) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4730      +/-   ##
===========================================
+ Coverage    80.11%   80.11%   +0.01%     
===========================================
  Files          515      515              
  Lines        36666    36673       +7     
===========================================
+ Hits         29372    29378       +6     
- Misses        7294     7295       +1     
Flag Coverage Δ
django 74.56% <92.31%> (-0.02%) ⬇️
sqlalchemy 73.50% <92.31%> (-<0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/engine/persistence.py 82.09% <100.00%> (ø)
aiida/engine/processes/process.py 91.73% <100.00%> (ø)
aiida/manage/manager.py 81.47% <100.00%> (ø)
aiida/orm/utils/serialize.py 100.00% <100.00%> (ø)
...ida/tools/importexport/dbimport/backends/common.py 97.34% <100.00%> (+0.20%) ⬆️
...ida/tools/importexport/dbimport/backends/django.py 92.86% <100.00%> (+0.03%) ⬆️
aiida/tools/importexport/dbimport/backends/sqla.py 93.64% <100.00%> (+0.03%) ⬆️
aiida/transports/plugins/local.py 81.41% <0.00%> (-0.25%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 28ef3a5...c6ceb78. Read the comment docs.

@greschd
Copy link
Member Author

greschd commented Feb 9, 2021

Not sure what's up with those missing tests.

@chrisjsewell
Copy link
Member

Is the AiiDALoader only used on checkpoints, or also somewhere else?

the deserializer method and hence AiiDALoader is only used in two places:

aiida/engine/persistence.py:
  117  
  118          checkpoint = calculation.checkpoint
  119  
  120          if checkpoint is None:
  121              raise PersistenceError(f'Calculation<{calculation.pk}> does not have a saved checkpoint')
  122  
  123          try:
  124:             bundle = serialize.deserialize(checkpoint)
  125          except Exception:
  126              raise PersistenceError(f'Failed to load the checkpoint for process<{pid}>: {traceback.format_exc()}')
  127  
  128          return bundle
  129  

aiida/engine/processes/process.py:
  599      def decode_input_args(self, encoded: str) -> Dict[str, Any]:  # pylint: disable=no-self-use
  600          """
  601          Decode saved input arguments as they came from the saved instance state Bundle
  602  
  603          :param encoded: encoded (serialized) inputs
  604          :return: The decoded input args
  605          """
  606:         return serialize.deserialize(encoded)

The deserializer method docstring should be changed, given at the moment it claims to be safe 😬

def deserialize(serialized):
    """Deserialize a yaml dump that represents a serialized data structure.

    .. note:: no need to use `yaml.safe_load` here because the `Loader` will ensure that loading is safe.

    :param serialized: a yaml serialized string representation
    :return: the deserialized data structure
    """
    return yaml.load(serialized, Loader=AiiDALoader)

Perhaps we should literally change the method name to deserialize_unsafe?

@greschd
Copy link
Member Author

greschd commented Feb 9, 2021

Perhaps we should literally change the method name to deserialize_unsafe?

I'd be on board with that, @sphuber what's your opinion?

@sphuber
Copy link
Contributor

sphuber commented Feb 9, 2021

Perhaps we should literally change the method name to deserialize_unsafe?

I'd be on board with that, @sphuber what's your opinion?

Would be fine by me. It was so far intended only for internal use and so changing it would be fine, although users may have been using the function as well. Maybe we keep the old deserialize with the safe loader and deprecate it, with a warning saying it is unsafe?

Does the node_type always start with process. for nodes which have checkpoints?

In principle yes, but experience has taught that there are always cases that we didn't think of. Would the performance suffer enormously if we just apply the strip to all nodes? Just always popping the key checkpoints? Then again, there could be a data plugin out there that decided to store an attribute called checkpoint. Don't think this is actually the case, but still maybe best not risk it and better assume that the process nodes start with process..

@chrisjsewell
Copy link
Member

Maybe we keep the old deserialize with the safe loader and deprecate it, with a warning saying it is unsafe?

meh, thats a lot of code to leave lying around. I'd be inclined to say if people are using a function this deep in the API then that's their fault 😜

@greschd
Copy link
Member Author

greschd commented Feb 11, 2021

Then again, there could be a data plugin out there that decided to store an attribute called checkpoint.

Right, that was my reason for not doing it.

Don't think this is actually the case, but still maybe best not risk it and better assume that the process nodes start with process..

I think the question is this: Can a "reasonable" (i.e., not malicious) plugin create processes with a different node_type, that could then later be exploited?

We could also think of renaming the checkpoints attribute key, to something like _aiida_checkpoints, and then strip it from all nodes. Would that need a migration, or are checkpoints anyway useless between AiiDA versions?

@greschd
Copy link
Member Author

greschd commented Feb 11, 2021

I've done the renaming to deserialize_untrusted now. Indeed it's a bit of a pain to keep the old version around, because the yaml.add_constructor calls etc. need to be duplicated. Doable of course, but maybe not worth it.

@sphuber
Copy link
Contributor

sphuber commented Feb 11, 2021

I've done the renaming to deserialize_untrusted now. Indeed it's a bit of a pain to keep the old version around, because the yaml.add_constructor calls etc. need to be duplicated. Doable of course, but maybe not worth it.

If you guys agree that it can be safely dropped, that is fine by me. I have just become more wary of this since I used to be overly trigger happy with breaking things before v1.0. In this case I agree that it can be argued that it is fine to simply change.

@sphuber
Copy link
Contributor

sphuber commented Feb 11, 2021

We could also think of renaming the checkpoints attribute key, to something like _aiida_checkpoints, and then strip it from all nodes. Would that need a migration, or are checkpoints anyway useless between AiiDA versions?

I guess that technically we state that one should fully finish all active processes before upgrading (as well as make a full backup) but I doubt that many users actually do this, especially between minor or patch versions. Not performing a migration when changing the attribute name will break active processes as the code will look to deserialize the process from an attribute that has the wrong name. The migration would actually be really simple, we have already other examples of simply changing an attribute name, so the effort should be relatively limited, so it would definitely still be an option I think

@greschd
Copy link
Member Author

greschd commented Feb 11, 2021

I've made a quick search in the plugins I've got lying around and didn't find a use of deserialize; I think @ltalirz has a snazzy way of doing this for all plugins on Github? Of course that's also no guarantee, but at least a good indicator.

Overall I agree on being careful with deprecations; I guess we can just keep it around for one minor version?

Agreed on the migration part -- so do you think it'd be better to rename and drop the key for all nodes? Are we using some framework to generate migrations, or should I just copy/paste an existing one?

@ltalirz
Copy link
Member

ltalirz commented Feb 11, 2021

I've made a quick search in the plugins I've got lying around and didn't find a use of deserialize; I think @ltalirz has a snazzy way of doing this for all plugins on Github? Of course that's also no guarantee, but at least a good indicator.

I don't see any use of it

@chrisjsewell
Copy link
Member

I have just become more wary of this since I used to be overly trigger happy with breaking things before v1.0. In this case I agree that it can be argued that it is fine to simply change.

I agree to be diligent on deprecations (I've had to do enough in #4712 etc 😓) but on the other hand, we do clearly state our public API, and so should not be scared to change things that are not part of it.
Otherwise, we should start using _ prefixed functions/methods more, so that they are even more clearly highlighted as for internal use only.

@sphuber
Copy link
Contributor

sphuber commented Feb 11, 2021

Thinking about more about the need of a migration of the checkpoint attribute. The reason I didn't prefix it with _aiida is because it is an attribute of a ProcessNode and since those cannot be subclassed through plugins, all attributes are "internal" AiiDA ones. So I think it is fine to leave as is and just do as you proposed and only target nodes in the archives with a node type starting with process.. There is then a risk we miss nodes with the checkpoint attribute if somehow they have a faulty node_type, but then again, these won't be valid process nodes and won't be imported as such.

@greschd
Copy link
Member Author

greschd commented Feb 11, 2021

Great, it seems the re-naming is not needed then.

Should I still add the deprecation for one minor version, or anything else that needs changing?

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment, and a question: shouldn't we also not relax the requirement on pyyaml? The whole point was that it is currently pinned to pyyaml~=5.1.2 and with this change I think we can start to use 5.2 and maybe upwards of that. We should likewise also fix this in plumpy because that currently has the same limitation. So maybe we should first fix this in plumpy, make a new release and then update the requirements for both plumpy and pyyaml in this PR before merging it.

Regarding the deprecations. Either we do it properly or we don't, I don't see the point of doing a half-assed deprecation. So I think it is fine to simply rename with deprecation as it currently does.

:param fields: the database fields for the entity
"""
if fields.get('node_type', '').startswith('process.'):
fields = copy.copy(fields)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the copy needed here? And if it is needed, shouldn't we use deepcopy because we are manipulating a key inside a nested dictionary so the fields['attribute'] will still have a reference to the original object, wouldn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the copy needed here?

TBH, I'm not sure why the copy is needed -- that is simply copy/pasted from the _sanitize_extras above. I don't have enough context on the whole import procedure to know if it's necessary.

the fields['attribute'] will still have a reference to the original object, wouldn't it

Right, good catch. Maybe a complete deepcopy is a bit overkill, but we could also copy.copy the fields['attributes'].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're now reconstructing the attributes dict, this should be resolved - but feel free to double-check @sphuber.

@greschd
Copy link
Member Author

greschd commented Feb 12, 2021

Just one comment, and a question: shouldn't we also not relax the requirement on pyyaml? The whole point was that it is currently pinned to pyyaml~=5.1.2 and with this change I think we can start to use 5.2 and maybe upwards of that. We should likewise also fix this in plumpy because that currently has the same limitation. So maybe we should first fix this in plumpy, make a new release and then update the requirements for both plumpy and pyyaml in this PR before merging it.

Even at the 5.1.2 version, this change now allows for loading e.g. numpy arrays. But yeah, I think the order you suggested makes sense. What should the new version specifier be? ~=5.1, ~=5.2, ~=5.4?

@sphuber
Copy link
Contributor

sphuber commented Feb 12, 2021

What should the new version specifier be? ~=5.1, ~=5.2, ~=5.4?

I would suggest ~=5.0 as our standard policy is to support all minor versions of a major release. Unless there is a very specific reason why we cannot or should not allow 5.0 and 5.1 I would allow them.

@greschd
Copy link
Member Author

greschd commented Mar 17, 2021

AFAICT updating to a newer pyyaml version is a related but not absolutely necessary change.

@sphuber
Copy link
Contributor

sphuber commented Mar 17, 2021

I see what you mean, technically the changes here are compatible with the latest release of plumpy, however, you won't actually be able to install pyyaml~=5.2 since there will be a dependency conflict. And even if you were still able to run with it, I am pretty sure plumpy would except, at least the tests will. Maybe we won't hit that code path in AiiDA usage. I realize that upgrading pyyaml is not necessary for the changes in this PR, but that was at least part of the goal. Having it pinned at 5.1 is not very good since the package is so generic and probably used by many other parts in the environment.

All that being said, I guess we can decouple them. We could merge this and then adjust plumpy in a separate PR. I was just wondering if that needed to be in conjunction, but I think it is ok to do it decoupled.

EDIT: it looks confirmed that plumpy currently really is not compatible with pyyaml>=5.2: https://github.com/aiidateam/plumpy/runs/2131741837?check_suite_focus=true The tests actually seem to hang.

@greschd
Copy link
Member Author

greschd commented Mar 17, 2021

All that being said, I guess we can decouple them. We could merge this and then adjust plumpy in a separate PR. I was just wondering if that needed to be in conjunction, but I think it is ok to do it decoupled.

The only reason I can think of why we'd want to move them together is if there's something in aiida-core that isn't compatible with 5.2, that we haven't checked. So if we relax the requirement now, it could lead to broken installs once the 5.2 - compatible plumpy comes out.

@ramirezfranciscof ramirezfranciscof added the pr/blocked PR is blocked by another PR that should be merged first label Apr 7, 2021
@ramirezfranciscof
Copy link
Member

Added the blocked tag because this would need an update in PlumPy.

@greschd
Copy link
Member Author

greschd commented Apr 7, 2021

@ramirezfranciscof This is only "sort of" blocked by plumpy - we could merge it and fix #3709 without change in plumpy, but it would then probably be better to leave the pyyaml dependency at pyyaml~=5.1.2.

@ltalirz
Copy link
Member

ltalirz commented May 5, 2021

@sphuber since you already reviewed this, can you give a final check + the green light for this PR?

Seems to me that the consensus is that this can be merged (while we will still need to wait for fixes in plumpy for actually being able to upgrade pyyaml).

@ltalirz
Copy link
Member

ltalirz commented May 5, 2021

@greschd would you mind rebasing this on develop?

@ltalirz
Copy link
Member

ltalirz commented May 5, 2021

Finally, a question for @greschd (sorry if that was already posed):
Are there any precautions one needs to take with active processes here?
I.e. is it possible that there will be some serialized data lying around that I can no longer deserialize after the update?

@greschd
Copy link
Member Author

greschd commented May 5, 2021

Are there any precautions one needs to take with active processes here?
I.e. is it possible that there will be some serialized data lying around that I can no longer deserialize after the update?

Hmm, good question. In principle I think this shouldn't be an issue because we are switching from the more restrictive FullLoader to the more liberal UnsafeLoader. That means things that FullLoader will load should also load with UnsafeLoader.
That being said, I'm not sure if this reasoning is watertight, or there might be some edge cases where the two loaders otherwise behave differently.

Dominik Gresch and others added 5 commits May 6, 2021 09:13
To allow e.g. numpy arrays to be serialized to a process checkpoint,
the `AiiDALoader` is based on `yaml.UnsafeLoader` instead of
`yaml.FullLoader`. Since this could pose a security risk when sharing
databases with maliciously crafted checkpoints, the checkpoints are
removed upon importing an archive.

Fixes aiidateam#3709.
@greschd
Copy link
Member Author

greschd commented May 6, 2021

@greschd would you mind rebasing this on develop?

done

Copy link
Member

@chrisjsewell chrisjsewell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right lets get this merged cheers! and I'll look at updating plumpy today

@chrisjsewell chrisjsewell changed the title Base AiiDALoader on UnsafeLoader, strip checkpoints on import. 👌 IMPROVE: Allow numpy arrays to be serialized on process checkpoints Jun 16, 2021
@chrisjsewell chrisjsewell merged commit 1bc9dbe into aiidateam:develop Jun 16, 2021
@greschd greschd deleted the fix_yaml_load branch June 16, 2021 08:45
sphuber pushed a commit that referenced this pull request Aug 10, 2021
…#4730)

To allow objects such as numpy arrays to be serialized to a process checkpoint,
the `AiiDALoader` now inherits from `yaml.UnsafeLoader` instead of `yaml.FullLoader`.

Note, this change represents a potential security risk,
whereby maliciously crafted code could be added to the serialized data
and then loaded upon importing an archive.
To mitigate this risk, the function `deserialize` has been renamed to `deserialize_unsafe`,
and node checkpoint attributes are removed before importing an archive.
This code is not part of the public API,
and so we assume no specific deprecations are required.

This change has also allowed for a relaxation of the `pyaml` pinning (to 5.2),
although it should be noted that this upgrade will not be realised
until a similar relaxation is implemented in plumpy.

Cherry-pick: 1bc9dbe
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr/blocked PR is blocked by another PR that should be merged first
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error when loading checkpoint
5 participants