-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
👌 IMPROVE: Allow numpy arrays to be serialized on process checkpoints #4730
Conversation
ece8c86
to
55d786f
Compare
@sphuber do you know if we can remove the version constraint on |
55d786f
to
d9d8d99
Compare
d9d8d99
to
9d8ba5e
Compare
9d8ba5e
to
b4fbaea
Compare
Codecov Report
@@ Coverage Diff @@
## develop #4730 +/- ##
===========================================
+ Coverage 80.11% 80.11% +0.01%
===========================================
Files 515 515
Lines 36666 36673 +7
===========================================
+ Hits 29372 29378 +6
- Misses 7294 7295 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Not sure what's up with those missing tests. |
the aiida/engine/persistence.py:
117
118 checkpoint = calculation.checkpoint
119
120 if checkpoint is None:
121 raise PersistenceError(f'Calculation<{calculation.pk}> does not have a saved checkpoint')
122
123 try:
124: bundle = serialize.deserialize(checkpoint)
125 except Exception:
126 raise PersistenceError(f'Failed to load the checkpoint for process<{pid}>: {traceback.format_exc()}')
127
128 return bundle
129
aiida/engine/processes/process.py:
599 def decode_input_args(self, encoded: str) -> Dict[str, Any]: # pylint: disable=no-self-use
600 """
601 Decode saved input arguments as they came from the saved instance state Bundle
602
603 :param encoded: encoded (serialized) inputs
604 :return: The decoded input args
605 """
606: return serialize.deserialize(encoded) The def deserialize(serialized):
"""Deserialize a yaml dump that represents a serialized data structure.
.. note:: no need to use `yaml.safe_load` here because the `Loader` will ensure that loading is safe.
:param serialized: a yaml serialized string representation
:return: the deserialized data structure
"""
return yaml.load(serialized, Loader=AiiDALoader) Perhaps we should literally change the method name to |
I'd be on board with that, @sphuber what's your opinion? |
Would be fine by me. It was so far intended only for internal use and so changing it would be fine, although users may have been using the function as well. Maybe we keep the old
In principle yes, but experience has taught that there are always cases that we didn't think of. Would the performance suffer enormously if we just apply the strip to all nodes? Just always popping the key checkpoints? Then again, there could be a data plugin out there that decided to store an attribute called |
meh, thats a lot of code to leave lying around. I'd be inclined to say if people are using a function this deep in the API then that's their fault 😜 |
b4fbaea
to
8937a5e
Compare
Right, that was my reason for not doing it.
I think the question is this: Can a "reasonable" (i.e., not malicious) plugin create processes with a different We could also think of renaming the |
8937a5e
to
ed0437d
Compare
I've done the renaming to |
If you guys agree that it can be safely dropped, that is fine by me. I have just become more wary of this since I used to be overly trigger happy with breaking things before v1.0. In this case I agree that it can be argued that it is fine to simply change. |
I guess that technically we state that one should fully finish all active processes before upgrading (as well as make a full backup) but I doubt that many users actually do this, especially between minor or patch versions. Not performing a migration when changing the attribute name will break active processes as the code will look to deserialize the process from an attribute that has the wrong name. The migration would actually be really simple, we have already other examples of simply changing an attribute name, so the effort should be relatively limited, so it would definitely still be an option I think |
I've made a quick search in the plugins I've got lying around and didn't find a use of Overall I agree on being careful with deprecations; I guess we can just keep it around for one minor version? Agreed on the migration part -- so do you think it'd be better to rename and drop the key for all nodes? Are we using some framework to generate migrations, or should I just copy/paste an existing one? |
|
I agree to be diligent on deprecations (I've had to do enough in #4712 etc 😓) but on the other hand, we do clearly state our public API, and so should not be scared to change things that are not part of it. |
Thinking about more about the need of a migration of the |
Great, it seems the re-naming is not needed then. Should I still add the deprecation for one minor version, or anything else that needs changing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one comment, and a question: shouldn't we also not relax the requirement on pyyaml
? The whole point was that it is currently pinned to pyyaml~=5.1.2
and with this change I think we can start to use 5.2
and maybe upwards of that. We should likewise also fix this in plumpy
because that currently has the same limitation. So maybe we should first fix this in plumpy
, make a new release and then update the requirements for both plumpy
and pyyaml
in this PR before merging it.
Regarding the deprecations. Either we do it properly or we don't, I don't see the point of doing a half-assed deprecation. So I think it is fine to simply rename with deprecation as it currently does.
:param fields: the database fields for the entity | ||
""" | ||
if fields.get('node_type', '').startswith('process.'): | ||
fields = copy.copy(fields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the copy needed here? And if it is needed, shouldn't we use deepcopy
because we are manipulating a key inside a nested dictionary so the fields['attribute']
will still have a reference to the original object, wouldn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the copy needed here?
TBH, I'm not sure why the copy is needed -- that is simply copy/pasted from the _sanitize_extras
above. I don't have enough context on the whole import procedure to know if it's necessary.
the fields['attribute'] will still have a reference to the original object, wouldn't it
Right, good catch. Maybe a complete deepcopy
is a bit overkill, but we could also copy.copy
the fields['attributes']
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're now reconstructing the attributes
dict, this should be resolved - but feel free to double-check @sphuber.
Even at the 5.1.2 version, this change now allows for loading e.g. numpy arrays. But yeah, I think the order you suggested makes sense. What should the new version specifier be? |
I would suggest |
db4dc63
to
2e9f98e
Compare
AFAICT updating to a newer |
I see what you mean, technically the changes here are compatible with the latest release of All that being said, I guess we can decouple them. We could merge this and then adjust plumpy in a separate PR. I was just wondering if that needed to be in conjunction, but I think it is ok to do it decoupled. EDIT: it looks confirmed that |
The only reason I can think of why we'd want to move them together is if there's something in |
Added the blocked tag because this would need an update in PlumPy. |
@ramirezfranciscof This is only "sort of" blocked by plumpy - we could merge it and fix #3709 without change in plumpy, but it would then probably be better to leave the |
@sphuber since you already reviewed this, can you give a final check + the green light for this PR? Seems to me that the consensus is that this can be merged (while we will still need to wait for fixes in plumpy for actually being able to upgrade pyyaml). |
@greschd would you mind rebasing this on develop? |
Finally, a question for @greschd (sorry if that was already posed): |
Hmm, good question. In principle I think this shouldn't be an issue because we are switching from the more restrictive |
To allow e.g. numpy arrays to be serialized to a process checkpoint, the `AiiDALoader` is based on `yaml.UnsafeLoader` instead of `yaml.FullLoader`. Since this could pose a security risk when sharing databases with maliciously crafted checkpoints, the checkpoints are removed upon importing an archive. Fixes aiidateam#3709.
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right lets get this merged cheers! and I'll look at updating plumpy today
AiiDALoader
on UnsafeLoader
, strip checkpoints on import.…#4730) To allow objects such as numpy arrays to be serialized to a process checkpoint, the `AiiDALoader` now inherits from `yaml.UnsafeLoader` instead of `yaml.FullLoader`. Note, this change represents a potential security risk, whereby maliciously crafted code could be added to the serialized data and then loaded upon importing an archive. To mitigate this risk, the function `deserialize` has been renamed to `deserialize_unsafe`, and node checkpoint attributes are removed before importing an archive. This code is not part of the public API, and so we assume no specific deprecations are required. This change has also allowed for a relaxation of the `pyaml` pinning (to 5.2), although it should be noted that this upgrade will not be realised until a similar relaxation is implemented in plumpy. Cherry-pick: 1bc9dbe
NOTE: This PR requires extra careful review, both because it is security relevant, and because I am not very familiar with the import code.
To allow e.g. numpy arrays to be serialized to a process checkpoint, the
AiiDALoader
is based onyaml.UnsafeLoader
instead ofyaml.FullLoader
. Since this could pose a security risk when sharing databases with maliciously crafted checkpoints, the checkpoints are removed upon importing an archive.Fixes #3709.
Questions to consider:
AiiDALoader
only used on checkpoints, or also somewhere else?node_type
always start withprocess.
for nodes which have checkpoints?