Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node: add the is_valid_cache setter property #5207

Merged

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Oct 29, 2021

Fixes #5189

This property allows to mark an individual node as invalid for use as a
caching source. It works by setting the _aiida_valid_cache extra to
either True or False.

Previously, the unofficial way of ensuring a node was no longer used as
a cache source, was to remove the _aiida_hash extra, containing the
hash of the node. This would achieve the desired effect, since without a
hash, the node would never be found when looking for identical nodes to
cache from. However, this not an ideal approach because the hash may be
useful for other purposes. In addition, if the node were to be rehashed
at some point, which could happen by accident if all nodes of a
particular class got rehashed, it would all of a sudden be used in
caching again.

The solution is to have a way to persistently mark node instances as
invalid for caching. Note that the is_valid_cache property of the
Node before was intended as a hook to allow changing the caching
behavior of subclasses of Node as a whole, and not on the instance
level. This is now changed as the property has a setter that operates on
the instance level. Subclasses should take care to respect this base
class implementation if it makes sense. Since Node can only be
subclassed by aiida-core and not in plugins, we can easily enforce
this system.

@sphuber
Copy link
Contributor Author

sphuber commented Oct 29, 2021

@ramirezfranciscof @unkcpz @chrisjsewell @greschd your comments please :) This is a very rough first sketch of what I think we need. Once we agree on the mechanism, I can work on more tests, documentation etc.

@codecov
Copy link

codecov bot commented Oct 29, 2021

Codecov Report

Merging #5207 (6ff7897) into develop (5f259bd) will increase coverage by 0.01%.
The diff coverage is 87.50%.

❗ Current head 6ff7897 differs from pull request most recent head f8a6753. Consider uploading reports for the commit f8a6753 to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #5207      +/-   ##
===========================================
+ Coverage    81.22%   81.22%   +0.01%     
===========================================
  Files          532      532              
  Lines        37337    37344       +7     
===========================================
+ Hits         30323    30329       +6     
- Misses        7014     7015       +1     
Flag Coverage Δ
django 76.08% <87.50%> (+0.01%) ⬆️
sqlalchemy 75.17% <87.50%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/common/hashing.py 94.02% <ø> (-0.05%) ⬇️
aiida/orm/nodes/process/process.py 85.72% <66.67%> (-0.29%) ⬇️
aiida/orm/nodes/node.py 96.41% <92.31%> (+0.06%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5f259bd...f8a6753. Read the comment docs.

Copy link
Member

@unkcpz unkcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sphuber , I have no objection to this. Just one minor request of seting it as property object.


return (node for node in builder.all(flat=True) if node.is_valid_cache)

def set_is_valid_cache(self, valid: bool) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that possible to have it like this, make it a property?

Suggested change
def set_is_valid_cache(self, valid: bool) -> None:
@is_valid_cache.setter
def is_valid_cache(self, valid: bool) -> None:

Copy link
Member

@greschd greschd Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that, would make it easier for Node subclasses to override the behavior. Maybe we should document that as an "official" way to influence caching?
Of course, they would mostly implement the getter, not the setter.

That is then also where the RemoteData implementation should likely go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It certainly is possible and probably the right thing to do. Only reason I already didn't do it was because originally the method was called something completely different. Then I saw that is_valid_cache exists and that is essentially the getter of what I was implementing, however, that was currently being documented as a property to override by classes (not instances). So then it felt a bit weird to make the setter really only operate on instances.

But then again, that is what the property is now doing: it is now an instance level concept, and so your suggestion makes sense. Will change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should document that as an "official" way to influence caching?

If people agree that this is indeed the way to go to officially control caching on a per-node basis, then I would definitely add this to the docs. I first wanted to get a sense if people agree before writing this though. So I take @greschd is on board. Then I will just wait a bit longer for the others to have the chance to respond.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@chrisjsewell
Copy link
Member

Given only ProcessNode are cachable, does it make sense to have this on the base Node class? (I'd rather not add even more unnecessary attributes to data classes)
Or, if a data node is set as not valid, should that mean any process nodes that have it as incoming should also be invalid for cacheing?

@sphuber
Copy link
Contributor Author

sphuber commented Nov 1, 2021

Given only ProcessNode are cachable, does it make sense to have this on the base Node class? (I'd rather not add even more unnecessary attributes to data classes)

Not sure. I see your objection to the node namespace being busy, but the entire caching logic is currently implemented on the node level and the fact that Data is currently not valid for caching, is an implementation detail. This could have been different and so I think these methods belong on the base class. On top of that, the is_valid_cache property already exists and is defined on the Node class. I cannot really move this now and only having the setter on the subclass would be really weird.

Or, if a data node is set as not valid, should that mean any process nodes that have it as incoming should also be invalid for cacheing?

That is not really meant with Data not being cachable. What we mean there is that when a Data node is stored, we do not look for nodes with the same hash in the database to take that instead. In our model and definition of caching that concept doesn't make any sense. For CalcJobNodes it makes sense because when looking for comparable nodes, you can take the outputs without having to compute them, but this doesn't apply to a data node, where you are just simply storing a node. It would be a waste of query time to look for comparable nodes and then do nothing with it.

@sphuber sphuber force-pushed the fix/5189/node-invalidate-for-cache branch from 37994b5 to 5c33bfc Compare November 2, 2021 21:32
@sphuber sphuber changed the title Node: add the set_is_valid_cache method Node: add the is_valid_cache setter property Nov 3, 2021
@sphuber sphuber requested review from greschd and unkcpz November 3, 2021 10:04
Copy link
Member

@giovannipizzi giovannipizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just a minor detail that might affect performance, but I agree with the concept.

I also agree that this should be the documented way to enable/disable caching; and the docs should also suggest how to extend the behaviour in subclasses (and when it's recommended to do it, or not to do it). I guess calling the super, and then using and if one wants to possibly disable caching e.g. in remote data? Or is there a better way?

Two suggestions to be opened as issues, if you agree:

  • have a verdi data remote check-valid-cache NODE1 NODE2 ... that (opening a single connection) checks all nodes in some way - e.g. if they are empty - and sets the valid cache to False. (However, I'm thinking if this is something general enough to be done in AiiDA... maybe we should move it to aiida-quantumespresso check-remotedata-valid-cache or something like this, and once implemented suggest this as a reference to the other plugins, since the plugin will know what to check)
  • In the long term (not for now), if we realise that the performance of checking these hashing/caching extras is not good, we might want to move these to a table column that is indexed, since this will be much more efficient in searching for nodes with the same hash (since extras are not indexed, so it's always a table scan).

node_hash = self._get_hash()

if not node_hash or not self._cachable:
return iter(())

builder = QueryBuilder(backend=self.backend)
builder.append(self.__class__, filters={'extras._aiida_hash': node_hash}, project='*', subclassing=False)
nodes_identical = (n[0] for n in builder.iterall())
builder.append(self.__class__, filters={f'extras.{self._HASH_EXTRA_KEY}': node_hash}, subclassing=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just delegate the query builder to already filter by node.is_valid_cache, rather than doing it in python below via the getter? I imagine this can only be more efficient (one needs to be careful to the default value if not set, of course)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be possible if the extra is the only thing that determines whether the node is a valid caching source. However, the method can and actually is already overridden by the ProcessNode class. It also considers the exit code that is stored, which if it has the invalidates_cache set to True on the associated ProcessSpec then is_valid_cache also returns False.

unkcpz
unkcpz previously approved these changes Nov 3, 2021
Copy link
Member

@unkcpz unkcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me! Thanks.

@sphuber sphuber force-pushed the fix/5189/node-invalidate-for-cache branch from 5c33bfc to e90ca9e Compare November 3, 2021 10:50
@sphuber
Copy link
Contributor Author

sphuber commented Nov 3, 2021

The current design of adding a setter for the is_valid_cache property was discussed in the AiiDA meeting of 2021-11-03 and was approved by those present. I have also updated the documentation to provide details on this mechanism. @greschd @unkcpz this is ready for final review.

@sphuber
Copy link
Contributor Author

sphuber commented Nov 3, 2021

I also agree that this should be the documented way to enable/disable caching; and the docs should also suggest how to extend the behaviour in subclasses (and when it's recommended to do it, or not to do it). I guess calling the super, and then using and if one wants to possibly disable caching e.g. in remote data? Or is there a better way?

This is really not relevant for data classes, because a data node is never considered as a caching source, since caching is disabled for data nodes. Note that the value returned by is_invalid_cache of a Data node is not (and should not be) considered when it is an input to a ProcessNode that is being stored with caching enabled. It is just the other equivalent ProcessNodes whose is_valid_cache output is considered to see whether they are valid caching sources.

have a verdi data remote check-valid-cache NODE1 NODE2 ... that (opening a single connection) checks all nodes in some way - e.g. if they are empty - and sets the valid cache to False. (However, I'm thinking if this is something general enough to be done in AiiDA... maybe we should move it to aiida-quantumespresso check-remotedata-valid-cache or something like this, and once implemented suggest this as a reference to the other plugins, since the plugin will know what to check)

I think for ensuring that CalcJobNode's, whose remote_folder output is cleaned, are not used as cache sources is not to only set a flag on the remote_folder and have the caching mechanism consider outputs as well in the hash, but rather that cleaning the remote_folder calls remote_folder.creator.is_valid_cache = False and so indirectly invalidates the associated CalcJobNode to never be considered as cache source.

In the long term (not for now), if we realise that the performance of checking these hashing/caching extras is not good, we might want to move these to a table column that is indexed, since this will be much more efficient in searching for nodes with the same hash (since extras are not indexed, so it's always a table scan).

Fully agree. This ties in to the discussion of having a concept of "internal attributes" that control various AiiDA mechanisms but are not really the same as normal node attributes. Having them in extras is too fragile as they can be accidentally changed or removed by users. When we have this new private extras, we should think which ones should be indexed and how.

@sphuber sphuber requested a review from giovannipizzi November 3, 2021 10:59
@sphuber sphuber force-pushed the fix/5189/node-invalidate-for-cache branch from e90ca9e to 6ff7897 Compare November 3, 2021 13:44
This property allows to mark an individual node as invalid for use as a
caching source. It works by setting the `_aiida_valid_cache` extra to
either `True` or `False`.

Previously, the unofficial way of ensuring a node was no longer used as
a cache source, was to remove the `_aiida_hash` extra, containing the
hash of the node. This would achieve the desired effect, since without a
hash, the node would never be found when looking for identical nodes to
cache from. However, this not an ideal approach because the hash may be
useful for other purposes. In addition, if the node were to be rehashed
at some point, which could happen by accident if all nodes of a
particular class got rehashed, it would all of a sudden be used in
caching again.

The solution is to have a way to persistently mark node instances as
invalid for caching. Note that the `is_valid_cache` property of the
`Node` before was intended as a hook to allow changing the caching
behavior of subclasses of `Node` as a whole, and not on the instance
level. This is now changed as the property has a setter that operates on
the instance level. Subclasses should take care to respect this base
class implementation if it makes sense. Since `Node` can only be
subclassed by `aiida-core` and not in plugins, we can easily enforce
this system.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add API to remove a process node from being used in caching
5 participants