👌 IMPROVE: Allow numpy arrays to be serialized on process checkpoints #4730

greschd · 2021-02-09T13:02:24Z

NOTE: This PR requires extra careful review, both because it is security relevant, and because I am not very familiar with the import code.

To allow e.g. numpy arrays to be serialized to a process checkpoint, the AiiDALoader is based on yaml.UnsafeLoader instead of yaml.FullLoader. Since this could pose a security risk when sharing databases with maliciously crafted checkpoints, the checkpoints are removed upon importing an archive.

Fixes #3709.

Questions to consider:

Is the AiiDALoader only used on checkpoints, or also somewhere else?
Does the node_type always start with process. for nodes which have checkpoints?
Anything else you can think of 😄

greschd · 2021-02-09T13:03:53Z

@sphuber do you know if we can remove the version constraint on pyyaml with this change?

aiida/tools/importexport/dbimport/backends/common.py

codecov · 2021-02-09T17:24:31Z

Codecov Report

Merging #4730 (c6ceb78) into develop (28ef3a5) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           develop    #4730      +/-   ##
===========================================
+ Coverage    80.11%   80.11%   +0.01%     
===========================================
  Files          515      515              
  Lines        36666    36673       +7     
===========================================
+ Hits         29372    29378       +6     
- Misses        7294     7295       +1

Flag	Coverage Δ
django	`74.56% <92.31%> (-0.02%)`	⬇️
sqlalchemy	`73.50% <92.31%> (-<0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
aiida/engine/persistence.py	`82.09% <100.00%> (ø)`
aiida/engine/processes/process.py	`91.73% <100.00%> (ø)`
aiida/manage/manager.py	`81.47% <100.00%> (ø)`
aiida/orm/utils/serialize.py	`100.00% <100.00%> (ø)`
...ida/tools/importexport/dbimport/backends/common.py	`97.34% <100.00%> (+0.20%)`	⬆️
...ida/tools/importexport/dbimport/backends/django.py	`92.86% <100.00%> (+0.03%)`	⬆️
aiida/tools/importexport/dbimport/backends/sqla.py	`93.64% <100.00%> (+0.03%)`	⬆️
aiida/transports/plugins/local.py	`81.41% <0.00%> (-0.25%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 28ef3a5...c6ceb78. Read the comment docs.

greschd · 2021-02-09T18:12:12Z

Not sure what's up with those missing tests.

chrisjsewell · 2021-02-09T18:20:19Z

Is the AiiDALoader only used on checkpoints, or also somewhere else?

the deserializer method and hence AiiDALoader is only used in two places:

aiida/engine/persistence.py:
  117  
  118          checkpoint = calculation.checkpoint
  119  
  120          if checkpoint is None:
  121              raise PersistenceError(f'Calculation<{calculation.pk}> does not have a saved checkpoint')
  122  
  123          try:
  124:             bundle = serialize.deserialize(checkpoint)
  125          except Exception:
  126              raise PersistenceError(f'Failed to load the checkpoint for process<{pid}>: {traceback.format_exc()}')
  127  
  128          return bundle
  129  

aiida/engine/processes/process.py:
  599      def decode_input_args(self, encoded: str) -> Dict[str, Any]:  # pylint: disable=no-self-use
  600          """
  601          Decode saved input arguments as they came from the saved instance state Bundle
  602  
  603          :param encoded: encoded (serialized) inputs
  604          :return: The decoded input args
  605          """
  606:         return serialize.deserialize(encoded)

The deserializer method docstring should be changed, given at the moment it claims to be safe 😬

def deserialize(serialized):
    """Deserialize a yaml dump that represents a serialized data structure.

    .. note:: no need to use `yaml.safe_load` here because the `Loader` will ensure that loading is safe.

    :param serialized: a yaml serialized string representation
    :return: the deserialized data structure
    """
    return yaml.load(serialized, Loader=AiiDALoader)

Perhaps we should literally change the method name to deserialize_unsafe?

greschd · 2021-02-09T19:01:49Z

Perhaps we should literally change the method name to deserialize_unsafe?

I'd be on board with that, @sphuber what's your opinion?

sphuber · 2021-02-09T21:53:55Z

Perhaps we should literally change the method name to deserialize_unsafe?

I'd be on board with that, @sphuber what's your opinion?

Would be fine by me. It was so far intended only for internal use and so changing it would be fine, although users may have been using the function as well. Maybe we keep the old deserialize with the safe loader and deprecate it, with a warning saying it is unsafe?

Does the node_type always start with process. for nodes which have checkpoints?

In principle yes, but experience has taught that there are always cases that we didn't think of. Would the performance suffer enormously if we just apply the strip to all nodes? Just always popping the key checkpoints? Then again, there could be a data plugin out there that decided to store an attribute called checkpoint. Don't think this is actually the case, but still maybe best not risk it and better assume that the process nodes start with process..

chrisjsewell · 2021-02-10T12:28:25Z

Maybe we keep the old deserialize with the safe loader and deprecate it, with a warning saying it is unsafe?

meh, thats a lot of code to leave lying around. I'd be inclined to say if people are using a function this deep in the API then that's their fault 😜

greschd · 2021-02-11T12:02:10Z

Then again, there could be a data plugin out there that decided to store an attribute called checkpoint.

Right, that was my reason for not doing it.

Don't think this is actually the case, but still maybe best not risk it and better assume that the process nodes start with process..

I think the question is this: Can a "reasonable" (i.e., not malicious) plugin create processes with a different node_type, that could then later be exploited?

We could also think of renaming the checkpoints attribute key, to something like _aiida_checkpoints, and then strip it from all nodes. Would that need a migration, or are checkpoints anyway useless between AiiDA versions?

greschd · 2021-02-11T12:13:15Z

I've done the renaming to deserialize_untrusted now. Indeed it's a bit of a pain to keep the old version around, because the yaml.add_constructor calls etc. need to be duplicated. Doable of course, but maybe not worth it.

sphuber · 2021-02-11T13:45:37Z

I've done the renaming to deserialize_untrusted now. Indeed it's a bit of a pain to keep the old version around, because the yaml.add_constructor calls etc. need to be duplicated. Doable of course, but maybe not worth it.

If you guys agree that it can be safely dropped, that is fine by me. I have just become more wary of this since I used to be overly trigger happy with breaking things before v1.0. In this case I agree that it can be argued that it is fine to simply change.

sphuber · 2021-02-11T13:48:02Z

We could also think of renaming the checkpoints attribute key, to something like _aiida_checkpoints, and then strip it from all nodes. Would that need a migration, or are checkpoints anyway useless between AiiDA versions?

I guess that technically we state that one should fully finish all active processes before upgrading (as well as make a full backup) but I doubt that many users actually do this, especially between minor or patch versions. Not performing a migration when changing the attribute name will break active processes as the code will look to deserialize the process from an attribute that has the wrong name. The migration would actually be really simple, we have already other examples of simply changing an attribute name, so the effort should be relatively limited, so it would definitely still be an option I think

greschd · 2021-02-11T13:58:46Z

I've made a quick search in the plugins I've got lying around and didn't find a use of deserialize; I think @ltalirz has a snazzy way of doing this for all plugins on Github? Of course that's also no guarantee, but at least a good indicator.

Overall I agree on being careful with deprecations; I guess we can just keep it around for one minor version?

Agreed on the migration part -- so do you think it'd be better to rename and drop the key for all nodes? Are we using some framework to generate migrations, or should I just copy/paste an existing one?

ltalirz · 2021-02-11T15:01:15Z

I've made a quick search in the plugins I've got lying around and didn't find a use of deserialize; I think @ltalirz has a snazzy way of doing this for all plugins on Github? Of course that's also no guarantee, but at least a good indicator.

I don't see any use of it

chrisjsewell · 2021-02-11T15:28:20Z

I have just become more wary of this since I used to be overly trigger happy with breaking things before v1.0. In this case I agree that it can be argued that it is fine to simply change.

I agree to be diligent on deprecations (I've had to do enough in #4712 etc 😓) but on the other hand, we do clearly state our public API, and so should not be scared to change things that are not part of it.
Otherwise, we should start using _ prefixed functions/methods more, so that they are even more clearly highlighted as for internal use only.

sphuber · 2021-02-11T16:57:24Z

Thinking about more about the need of a migration of the checkpoint attribute. The reason I didn't prefix it with _aiida is because it is an attribute of a ProcessNode and since those cannot be subclassed through plugins, all attributes are "internal" AiiDA ones. So I think it is fine to leave as is and just do as you proposed and only target nodes in the archives with a node type starting with process.. There is then a risk we miss nodes with the checkpoint attribute if somehow they have a faulty node_type, but then again, these won't be valid process nodes and won't be imported as such.

greschd · 2021-02-11T19:09:19Z

Great, it seems the re-naming is not needed then.

Should I still add the deprecation for one minor version, or anything else that needs changing?

sphuber

Just one comment, and a question: shouldn't we also not relax the requirement on pyyaml? The whole point was that it is currently pinned to pyyaml~=5.1.2 and with this change I think we can start to use 5.2 and maybe upwards of that. We should likewise also fix this in plumpy because that currently has the same limitation. So maybe we should first fix this in plumpy, make a new release and then update the requirements for both plumpy and pyyaml in this PR before merging it.

Regarding the deprecations. Either we do it properly or we don't, I don't see the point of doing a half-assed deprecation. So I think it is fine to simply rename with deprecation as it currently does.

sphuber · 2021-02-12T08:30:21Z

aiida/tools/importexport/dbimport/backends/common.py

+    :param fields: the database fields for the entity
+    """
+    if fields.get('node_type', '').startswith('process.'):
+        fields = copy.copy(fields)


Why is the copy needed here? And if it is needed, shouldn't we use deepcopy because we are manipulating a key inside a nested dictionary so the fields['attribute'] will still have a reference to the original object, wouldn't it?

Why is the copy needed here?

TBH, I'm not sure why the copy is needed -- that is simply copy/pasted from the _sanitize_extras above. I don't have enough context on the whole import procedure to know if it's necessary.

the fields['attribute'] will still have a reference to the original object, wouldn't it

Right, good catch. Maybe a complete deepcopy is a bit overkill, but we could also copy.copy the fields['attributes'].

Since we're now reconstructing the attributes dict, this should be resolved - but feel free to double-check @sphuber.

greschd · 2021-02-12T11:01:42Z

Just one comment, and a question: shouldn't we also not relax the requirement on pyyaml? The whole point was that it is currently pinned to pyyaml~=5.1.2 and with this change I think we can start to use 5.2 and maybe upwards of that. We should likewise also fix this in plumpy because that currently has the same limitation. So maybe we should first fix this in plumpy, make a new release and then update the requirements for both plumpy and pyyaml in this PR before merging it.

Even at the 5.1.2 version, this change now allows for loading e.g. numpy arrays. But yeah, I think the order you suggested makes sense. What should the new version specifier be? ~=5.1, ~=5.2, ~=5.4?

sphuber · 2021-02-12T11:05:07Z

What should the new version specifier be? ~=5.1, ~=5.2, ~=5.4?

I would suggest ~=5.0 as our standard policy is to support all minor versions of a major release. Unless there is a very specific reason why we cannot or should not allow 5.0 and 5.1 I would allow them.

greschd · 2021-03-17T11:22:50Z

AFAICT updating to a newer pyyaml version is a related but not absolutely necessary change.

sphuber · 2021-03-17T14:54:27Z

I see what you mean, technically the changes here are compatible with the latest release of plumpy, however, you won't actually be able to install pyyaml~=5.2 since there will be a dependency conflict. And even if you were still able to run with it, I am pretty sure plumpy would except, at least the tests will. Maybe we won't hit that code path in AiiDA usage. I realize that upgrading pyyaml is not necessary for the changes in this PR, but that was at least part of the goal. Having it pinned at 5.1 is not very good since the package is so generic and probably used by many other parts in the environment.

All that being said, I guess we can decouple them. We could merge this and then adjust plumpy in a separate PR. I was just wondering if that needed to be in conjunction, but I think it is ok to do it decoupled.

EDIT: it looks confirmed that plumpy currently really is not compatible with pyyaml>=5.2: https://github.com/aiidateam/plumpy/runs/2131741837?check_suite_focus=true The tests actually seem to hang.

greschd · 2021-03-17T16:13:36Z

All that being said, I guess we can decouple them. We could merge this and then adjust plumpy in a separate PR. I was just wondering if that needed to be in conjunction, but I think it is ok to do it decoupled.

The only reason I can think of why we'd want to move them together is if there's something in aiida-core that isn't compatible with 5.2, that we haven't checked. So if we relax the requirement now, it could lead to broken installs once the 5.2 - compatible plumpy comes out.

ramirezfranciscof · 2021-04-07T09:21:22Z

Added the blocked tag because this would need an update in PlumPy.

greschd · 2021-04-07T09:34:48Z

@ramirezfranciscof This is only "sort of" blocked by plumpy - we could merge it and fix #3709 without change in plumpy, but it would then probably be better to leave the pyyaml dependency at pyyaml~=5.1.2.

ltalirz · 2021-05-05T16:04:32Z

@sphuber since you already reviewed this, can you give a final check + the green light for this PR?

Seems to me that the consensus is that this can be merged (while we will still need to wait for fixes in plumpy for actually being able to upgrade pyyaml).

ltalirz · 2021-05-05T16:05:04Z

@greschd would you mind rebasing this on develop?

ltalirz · 2021-05-05T16:07:19Z

Finally, a question for @greschd (sorry if that was already posed):
Are there any precautions one needs to take with active processes here?
I.e. is it possible that there will be some serialized data lying around that I can no longer deserialize after the update?

greschd · 2021-05-05T22:34:25Z

Are there any precautions one needs to take with active processes here?
I.e. is it possible that there will be some serialized data lying around that I can no longer deserialize after the update?

Hmm, good question. In principle I think this shouldn't be an issue because we are switching from the more restrictive FullLoader to the more liberal UnsafeLoader. That means things that FullLoader will load should also load with UnsafeLoader.
That being said, I'm not sure if this reasoning is watertight, or there might be some edge cases where the two loaders otherwise behave differently.

To allow e.g. numpy arrays to be serialized to a process checkpoint, the `AiiDALoader` is based on `yaml.UnsafeLoader` instead of `yaml.FullLoader`. Since this could pose a security risk when sharing databases with maliciously crafted checkpoints, the checkpoints are removed upon importing an archive. Fixes aiidateam#3709.

greschd · 2021-05-06T09:28:26Z

@greschd would you mind rebasing this on develop?

done

chrisjsewell

right lets get this merged cheers! and I'll look at updating plumpy today

…#4730) To allow objects such as numpy arrays to be serialized to a process checkpoint, the `AiiDALoader` now inherits from `yaml.UnsafeLoader` instead of `yaml.FullLoader`. Note, this change represents a potential security risk, whereby maliciously crafted code could be added to the serialized data and then loaded upon importing an archive. To mitigate this risk, the function `deserialize` has been renamed to `deserialize_unsafe`, and node checkpoint attributes are removed before importing an archive. This code is not part of the public API, and so we assume no specific deprecations are required. This change has also allowed for a relaxation of the `pyaml` pinning (to 5.2), although it should be noted that this upgrade will not be realised until a similar relaxation is implemented in plumpy. Cherry-pick: 1bc9dbe

greschd force-pushed the fix_yaml_load branch from ece8c86 to 55d786f Compare February 9, 2021 13:02

greschd requested review from ltalirz and sphuber February 9, 2021 13:04

greschd force-pushed the fix_yaml_load branch from 55d786f to d9d8d99 Compare February 9, 2021 15:25

chrisjsewell suggested changes Feb 9, 2021

View reviewed changes

aiida/tools/importexport/dbimport/backends/common.py Outdated Show resolved Hide resolved

greschd force-pushed the fix_yaml_load branch from d9d8d99 to 9d8ba5e Compare February 9, 2021 16:43

greschd requested a review from chrisjsewell February 9, 2021 16:47

greschd force-pushed the fix_yaml_load branch from 9d8ba5e to b4fbaea Compare February 9, 2021 17:03

greschd force-pushed the fix_yaml_load branch from b4fbaea to 8937a5e Compare February 11, 2021 11:54

greschd force-pushed the fix_yaml_load branch from 8937a5e to ed0437d Compare February 11, 2021 12:03

sphuber reviewed Feb 12, 2021

View reviewed changes

greschd force-pushed the fix_yaml_load branch from db4dc63 to 2e9f98e Compare February 12, 2021 11:07

sphuber mentioned this pull request Mar 28, 2021

build(deps): bump pyyaml from 5.1.2 to 5.4 in /requirements #4833

Closed

ltalirz mentioned this pull request Mar 29, 2021

⬆️ UPDATE: psgu v0.2 #4834

Merged

ramirezfranciscof added the pr/blocked PR is blocked by another PR that should be merged first label Apr 7, 2021

greschd mentioned this pull request May 5, 2021

Error when loading checkpoint #3709

Closed

greschd force-pushed the fix_yaml_load branch from 57cafe9 to b3c480a Compare May 5, 2021 22:26

Dominik Gresch and others added 5 commits May 6, 2021 09:13

Rename the deserialize function to deserialize_unsafe.

60297b3

Fix copy behavior in _strip_checkpoints.

9b050bd

Relax requirement on pyyaml version.

b52ff00

Remove 'silent' keyword in call to 'export'.

be54520

greschd force-pushed the fix_yaml_load branch from 7ebdec8 to be54520 Compare May 6, 2021 07:13

Merge branch 'develop' into fix_yaml_load

c6ceb78

chrisjsewell approved these changes Jun 16, 2021

View reviewed changes

chrisjsewell changed the title ~~Base AiiDALoader on UnsafeLoader, strip checkpoints on import.~~ 👌 IMPROVE: Allow numpy arrays to be serialized on process checkpoints Jun 16, 2021

chrisjsewell merged commit 1bc9dbe into aiidateam:develop Jun 16, 2021

greschd deleted the fix_yaml_load branch June 16, 2021 08:45

ltalirz mentioned this pull request Aug 4, 2021

update pyyaml dependency aiidateam/plumpy#220

Closed

sphuber mentioned this pull request Sep 10, 2021

PDOSWorkChain excepts due to daemon restart aiidateam/aiida-quantumespresso#724

Closed

chrisjsewell mentioned this pull request Mar 12, 2022

🔀 MERGE: Release v2.0.0b1 #5426

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

👌 IMPROVE: Allow numpy arrays to be serialized on process checkpoints #4730

👌 IMPROVE: Allow numpy arrays to be serialized on process checkpoints #4730

greschd commented Feb 9, 2021

greschd commented Feb 9, 2021

codecov bot commented Feb 9, 2021 •

edited

Loading

greschd commented Feb 9, 2021

chrisjsewell commented Feb 9, 2021

greschd commented Feb 9, 2021

sphuber commented Feb 9, 2021

chrisjsewell commented Feb 10, 2021

greschd commented Feb 11, 2021

greschd commented Feb 11, 2021

sphuber commented Feb 11, 2021

sphuber commented Feb 11, 2021

greschd commented Feb 11, 2021 •

edited

Loading

ltalirz commented Feb 11, 2021 •

edited

Loading

chrisjsewell commented Feb 11, 2021

sphuber commented Feb 11, 2021

greschd commented Feb 11, 2021

sphuber left a comment

sphuber Feb 12, 2021

greschd Feb 12, 2021

greschd Mar 5, 2021

greschd commented Feb 12, 2021

sphuber commented Feb 12, 2021

greschd commented Mar 17, 2021

sphuber commented Mar 17, 2021 •

edited

Loading

greschd commented Mar 17, 2021 •

edited

Loading

ramirezfranciscof commented Apr 7, 2021

greschd commented Apr 7, 2021

ltalirz commented May 5, 2021

ltalirz commented May 5, 2021

ltalirz commented May 5, 2021 •

edited

Loading

greschd commented May 5, 2021 •

edited

Loading

greschd commented May 6, 2021

chrisjsewell left a comment

👌 IMPROVE: Allow numpy arrays to be serialized on process checkpoints #4730

👌 IMPROVE: Allow numpy arrays to be serialized on process checkpoints #4730

Conversation

greschd commented Feb 9, 2021

greschd commented Feb 9, 2021

codecov bot commented Feb 9, 2021 • edited Loading

Codecov Report

greschd commented Feb 9, 2021

chrisjsewell commented Feb 9, 2021

greschd commented Feb 9, 2021

sphuber commented Feb 9, 2021

chrisjsewell commented Feb 10, 2021

greschd commented Feb 11, 2021

greschd commented Feb 11, 2021

sphuber commented Feb 11, 2021

sphuber commented Feb 11, 2021

greschd commented Feb 11, 2021 • edited Loading

ltalirz commented Feb 11, 2021 • edited Loading

chrisjsewell commented Feb 11, 2021

sphuber commented Feb 11, 2021

greschd commented Feb 11, 2021

sphuber left a comment

Choose a reason for hiding this comment

sphuber Feb 12, 2021

Choose a reason for hiding this comment

greschd Feb 12, 2021

Choose a reason for hiding this comment

greschd Mar 5, 2021

Choose a reason for hiding this comment

greschd commented Feb 12, 2021

sphuber commented Feb 12, 2021

greschd commented Mar 17, 2021

sphuber commented Mar 17, 2021 • edited Loading

greschd commented Mar 17, 2021 • edited Loading

ramirezfranciscof commented Apr 7, 2021

greschd commented Apr 7, 2021

ltalirz commented May 5, 2021

ltalirz commented May 5, 2021

ltalirz commented May 5, 2021 • edited Loading

greschd commented May 5, 2021 • edited Loading

greschd commented May 6, 2021

chrisjsewell left a comment

Choose a reason for hiding this comment

codecov bot commented Feb 9, 2021 •

edited

Loading

greschd commented Feb 11, 2021 •

edited

Loading

ltalirz commented Feb 11, 2021 •

edited

Loading

sphuber commented Mar 17, 2021 •

edited

Loading

greschd commented Mar 17, 2021 •

edited

Loading

ltalirz commented May 5, 2021 •

edited

Loading

greschd commented May 5, 2021 •

edited

Loading