-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ NEW: Add orm.Entity.fields
interface for QueryBuilder
#5088
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #5088 +/- ##
===========================================
- Coverage 82.12% 81.48% -0.63%
===========================================
Files 533 531 -2
Lines 38510 37348 -1162
===========================================
- Hits 31624 30431 -1193
- Misses 6886 6917 +31
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
another thought would be in some way to tie the fields to the backend, e.g. |
thanks @chrisjsewell! One question that comes to mind is how much tangible benefit the separation between frontend and backend will really still bring once we've dropped one of the two backends. But I guess that's still a bit further down the line and more controversial, so by adding this mapping explicitly and making it available for programmatic use I think we are making a step in the right direction. Thinking about how to reuse this in e.g. the REST API, and your open question:
If I see correctly, you are using python's type hints to write the "schema" here, which I guess might be limiting when you want to represent more elaborate constraints than data types, such as specific field names in a dictionary that are optional; allowed value ranges (which could also be very interesting to reuse in the verdi CLI) as well as documentation of the fields. In the aiida-restapi package we use pydantic for this - do you think something like this would make sense here as well? |
Another, another thought would be, perhaps you could also extend this to specifying repository attributes, e.g. @has_fields
class NewData(orm.Data):
@property
@field_method("path/to/file", in_repo=True)
def y(self) -> str:
return self.get_object_content("path/to/file") |
thanks @ltalirz
Well I guess here you could just add additional key-words to @field_method("attributes.x", validate=dict(min=0, max=100))
def x(self) -> Optional[int]:
return self.get_attribute("x", None)
I feel the drawback of pydantic is that maybe it does not work well with sub-classing; if we do not use it directly (e.g. for |
thanks, I also recently came across I guess what I'm saying is that when making this change it would be great if you could include with it one small code example of how e.g. the REST API (or the verdi cli) could take advantage of this interface, removing the duplicated schema information that currently resides there (explicitly in the case of the new aiida-restapi; implicitly in old REST API & the verdi cli). Just to make sure that we're achieving this goal - or if we are not, to be aware of what's still missing and how we can get there. |
Yeh I use it all over the place, its great, e.g. https://github.com/executablebooks/MyST-Parser/blob/master/myst_parser/main.py#L31
Yep fair. I don't really want to start rewriting the old REST API though lol, do you have any CLI commands in mind where this could be useful? |
b403194
to
bf90802
Compare
(in 74f39f2 I have removed the need for using |
aiida/orm/nodes/process/process.py
Outdated
@@ -134,6 +144,23 @@ def process_class(self) -> Type['Process']: | |||
|
|||
return process_class | |||
|
|||
@property # type: ignore | |||
@field_method() | |||
def process_type(self) -> Optional[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think process_type
should be removed from the base Node
class, since obviously it is just a process specific field and an implementation detail that it is also contained in the database table for all nodes.
(I added it here so that it only shows up in fields
of ProcessNode
subclasses, not all nodes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same for computer
, that is only relevant to code and process nodes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, computer should stay for all nodes I think, some other nodes use it (specifically: RemoteData). Default = None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
computer
should definitely stay on the Node
level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted this change as no longer necessary
Thanks @chrisjsewell, in general I think this was a very important missing feature. Just a few quick questions:
|
Yes, fully backward compatible
certainly to prioritise this syntax, then, once people have had a good chance to play around with it, I would indeed consider deprecating
I think with #5093 there is definitely a way forward with this |
b326479
to
e20114e
Compare
Alright now, after #5093, I know what the heck is going on with the QueryBuilder lol, it was pretty easy to add this (a4ee061). So now you can do e.g.: In [1]: QueryBuilder().append(Dict, project=[Dict.fields.pk, Dict.fields.dict["a"]]).dict()
Out[1]:
[{'Dict_1': {'pk': 15, 'dict.a': 1}},
{'Dict_1': {'pk': 7, 'dict.a': 1}},
{'Dict_1': {'pk': 5, 'dict.a': 1}},
{'Dict_1': {'pk': 4, 'dict.a': None}},
{'Dict_1': {'pk': 3, 'dict.a': None}}] 🎉 (again, this is fully back compatible, i.e. if you projected just the |
ok, presuming the tests pass, this is now good to go IMO |
Obviously it should also be documented but, so as not to make this PR any more complex, I will do that in a separate PR once this API is agreed on and merged |
Thanks @chrisjsewell ! While I'm OK to keep the API flexible, I'm wondering if it's more confusing or more clear to make this mapping automatically - at some point, users need to know that properties are stored either as files or as attributes (or extras), and use those (I don't see this as a schema detail, that indeed should be visible, but as an "ontological" aspect of how we represent node content in AiiDA). And in practice only attributes (and extras) are efficiently queryable, not content in the files. Also (this is for my clarity), |
yes
yes; it is an alias and also set to In [1]: str(Dict.fields.dict)
Out[1]: 'OrmField(attributes.*) -> dict'
In [2]: str(Dict.fields.extras)
Out[2]: 'OrmField(extras.*) -> dict'
So this is actually a broader point that I was planning to make in response to #4976 (and I plant to open a PR for programmatically/dynamically setting Firstly, I would say there are broadly three kinds of users of the code (that can overlap):
Here I am talking about (3). IMO they should never need to know about the concept of
Take When you tab complete on it, you get 92 options! In [1]: print(sorted([a for a in Int().__dir__() if not a.startswith('_')]))
['Collection', 'add_comment', 'add_incoming', 'attributes', 'attributes_items', 'attributes_keys', 'backend', 'backend_entity', 'check_mutability', 'class_node_type', 'clear_attributes', 'clear_extras', 'clear_hash', 'clone', 'computer', 'convert', 'creator', 'ctime', 'delete_attribute', 'delete_attribute_many', 'delete_extra', 'delete_extra_many', 'delete_object', 'description', 'erase', 'export', 'extras', 'extras_items', 'extras_keys', 'from_backend_entity', 'get', 'get_all_same_nodes', 'get_attribute', 'get_attribute_many', 'get_cache_source', 'get_comment', 'get_comments', 'get_description', 'get_export_formats', 'get_extra', 'get_extra_many', 'get_hash', 'get_incoming', 'get_object_content', 'get_outgoing', 'get_stored_link_triples', 'has_cached_links', 'id', 'importfile', 'importstring', 'init_from_backend', 'initialize', 'is_created_from_cache', 'is_stored', 'is_valid_cache', 'label', 'list_object_names', 'list_objects', 'logger', 'mtime', 'new', 'node_type', 'objects', 'open', 'pk', 'process_type', 'put_object_from_file', 'put_object_from_filelike', 'put_object_from_tree', 'rehash', 'remove_comment', 'repository_metadata', 'repository_serialize', 'reset_attributes', 'reset_extras', 'set_attribute', 'set_attribute_many', 'set_extra', 'set_extra_many', 'set_source', 'source', 'store', 'store_all', 'update_comment', 'user', 'uuid', 'validate_incoming', 'validate_outgoing', 'validate_storability', 'value', 'verify_are_parents_stored', 'walk'] Firstly, this is not user-friendly but, more conceptually, a user should never be directly setting/getting attributes or objects (i.e. files) on an (w.r.t to tab completion, the naming of methods is also not ideal, e.g. if I want to deal with extras, then really I just want to do You could roughly reduce this to at least 53: In [1]: print(sorted([a for a in Int().__dir__() if not a.startswith('_')]))
['add_comment', 'add_incoming', 'class_node_type', 'clear_extras', 'clear_hash', 'clone', 'convert', 'ctime', 'delete_extra', 'delete_extra_many', 'description', 'erase', 'export', 'extras', 'extras_items', 'extras_keys', 'get_all_same_nodes', 'get_cache_source', 'get_comment', 'get_comments', 'get_description', 'get_export_formats', 'get_extra', 'get_extra_many', 'get_hash', 'get_incoming', 'get_outgoing', 'get_stored_link_triples', 'has_cached_links', 'is_created_from_cache', 'is_stored', 'label', 'mtime', 'new', 'node_type', 'pk', 'rehash', 'remove_comment', 'reset_extras', 'set_extra', 'set_extra_many', 'set_source', 'source', 'store', 'store_all', 'update_comment', 'user', 'uuid', 'validate_incoming', 'validate_outgoing', 'validate_storability', 'value', 'verify_are_parents_stored'] (and this could be dynamically reduced even more when the node is stored) Analogously, for querying, a user shouldn't need to "know"/care the In[1]: QueryBuilder().append(Int, project="attribute.value").dict()
Out[1]:
[{'Int_1': {'attribute.value': 1}}] they just want to do: In[1]: QueryBuilder().append(Int, project=Int.fields.value).dict()
Out[1]:
[{'Int_1': {'value': 1}}] and then this is "magnified" on other types, that have a lot more thing stored as keys of In [1]: Code.fields
Out[1]:
{'append_text': 'OrmField(attributes.append_text) -> Union[str, NoneType]',
'ctime': 'OrmField(ctime) -> datetime',
'description': 'OrmField(description) -> str',
'extras': 'OrmField(extras.*) -> dict',
'input_plugin': 'OrmField(attributes.input_plugin) -> Union[str, NoneType]',
'is_local': 'OrmField(attributes.is_local) -> Union[bool, NoneType]',
'label': 'OrmField(label) -> str',
'local_executable': 'OrmField(attributes.local_executable) -> Union[str, '
'NoneType]',
'mtime': 'OrmField(mtime) -> datetime',
'node_type': 'OrmField(node_type) -> str',
'pk': 'OrmField(id) -> int',
'prepend_text': 'OrmField(attributes.prepend_text) -> Union[str, NoneType]',
'remote_exec_path': 'OrmField(attributes.remote_exec_path) -> Union[str, '
'NoneType]',
'repository_metadata': 'OrmField(repository_metadata) -> Dict',
'source': 'OrmField(attributes.source) -> Union[dict, NoneType]',
'user_id': 'OrmField(user_id) -> int -> user',
'uuid': 'OrmField(uuid) -> str'}
In [2]: CalcJobNode.fields
Out[2]:
{'ctime': 'OrmField(ctime) -> datetime',
'description': 'OrmField(description) -> str',
'exception': 'OrmField(attributes.exception) -> Union[str, NoneType]',
'exit_message': 'OrmField(attributes.exit_message) -> Union[str, NoneType]',
'exit_status': 'OrmField(attributes.exit_status) -> Union[int, NoneType]',
'extras': 'OrmField(extras.*) -> dict',
'job_state': 'OrmField(attributes.state) -> Union[str, NoneType]',
'label': 'OrmField(label) -> str',
'mtime': 'OrmField(mtime) -> datetime',
'node_type': 'OrmField(node_type) -> str',
'paused': 'OrmField(attributes.paused) -> bool',
'pk': 'OrmField(id) -> int',
'process_label': 'OrmField(attributes.process_label) -> Union[str, NoneType]',
'process_state': 'OrmField(attributes.process_state) -> Union[str, NoneType]',
'process_status': 'OrmField(attributes.process_status) -> Union[str, '
'NoneType]',
'process_type': 'OrmField(process_type) -> Union[str, NoneType]',
'repository_metadata': 'OrmField(repository_metadata) -> Dict',
'scheduler_state': 'OrmField(attributes.scheduler_state) -> str',
'sealed': 'OrmField(attributes.sealed) -> bool',
'user_id': 'OrmField(user_id) -> int -> user',
'uuid': 'OrmField(uuid) -> str'} A final note here; the fact that we use an
So coming back to your original question: no I don't meant |
3ed1875
to
d768a62
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @chrisjsewell . I think the concept proposed here is very useful and badly needed indeed. I went through the changes and most of it looks fine, with some minor comments here and there, but I also still have some more generic questions about the design and the implications.
With the current design, we are effectively coupling the front-end directly to implementation details in the backend. By having to specify the database column in the front-end, the separation that we had will be gone. I am not saying that this would necessarily be a deal-breaker, since anyway we might be getting rid of the need of a separation if we're getting rid of Django, but I wonder if it is good and necessary in principle. What we really need in the front-end for the user is what you summarized very succinctly elsewhere:
keys that relate to "queryable" fields, that are read-only once the node is stored
keys that relate to "queryable" fields, that are read/write always (i.e. extras)
keys that relate to "non-queryable" fields
I would maybe add "queryable" but "private" fields that AiiDA stores in the database (hash etc.). This is an effective implementation-free description of what is stored in the repository and what in the database. Could we not move the concept of OrmFields
to the backend layer and have the front-end orm.Entity
automatically detect the defined fields on the BackendEntity
that it contains. This would allow us to later extend the concept of fields in the front-end Entity
with non-queryable fields, which could be data stored in the repository.
A second question is about consistency. There are two methods of defining a field to an ORM class: by explicitly declaring it in __fields__
or by decorating a class method or property. This is mostly a problem of historicity and having to keep backwards compatibility, but if we were to design the ORM from scratch, wouldn't we ideally just define the fields explicitly through __fields__
and have those generate properties dynamically. This would save a lot of code typing and would guarantee a uniform interface. My question is now that if you agree with this, and if so, if we should maybe already try to take this approach and drop the decorator. We can then also deprecate the methods that happen to not follow the naming convention of the automatically and dynamically named getters. There can of course still be custom methods if they need to perform additional logic to just retrieving a database value and returning it.
I think this approach would also prevent inconsistencies in to what fields are marked as fields, and which aren't. Reading through the code, I think there are quite a few entities that do not have certain fields defined, even though they probably should. Computer.user
is one of them for example. Of course we could always thoroughly go through the code before merging to make sure everything is there, but I was wondering if the other approach may prevent the potential for missing fields.
@@ -77,6 +77,8 @@ class QueryDictType(TypedDict): | |||
# mapping: tag -> [] -> field -> 'func' -> 'max' | 'min' | 'count' | |||
# 'cast' -> 'b' | 'd' | 'f' | 'i' | 'j' | 't' | |||
project: Dict[str, List[Dict[str, Dict[str, Any]]]] | |||
# mapping: tag -> field -> return key for iterdict method | |||
project_map: Dict[str, Dict[str, str]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You define this here but don't actually use it I believe. Yet in other files (e.g., aiida.orm.querybuilder
), you could use it but then just literally write Dict[str, Dict[str, str]]
again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chrisjsewell what's the plan here? I see a thumbs up but not an implementation.
@field_method(f'attributes.{SCHEDULER_STATE_KEY}', key='scheduler_state', dtype=str) | ||
def get_scheduler_state(self) -> Optional['JobState']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice here that the dtype
does not necessarily have to correspond to the return type of the method that it decorates. Makes sense since the method can operate on the raw database value before returning it as is the case here. I was just wondering if there could be a potential problem here. I guess in the end it boils down to the choice of defining an entities fields through its methods. I guess this is mostly just a convenience thing, because you also have the possibility to define the OrmFields
directly and manually in the fields
attribute. I am wondering if there is a downside to having both, especially since the decoration of method can have this discrepancy in types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeh, I like the idea of having the field “close” to the method, as it feels like it will make things like refactoring easier, but indeed the special case discrepancies are not ideal. I guess you will still have this problem if, as you mention, we want to look to auto generate the methods from the fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #5088 (comment)
SCHEDULER_STATE_KEY = 'scheduler_state' | ||
CALC_JOB_STATE_KEY = 'state' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make all class constants of CalcJobNode
module constants, as you did for ProcessNode
, just for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I realised this change was not necessary, so have reverted it (for ProcessNode
and CalcJobNode
)
Thanks for the feedback @sphuber
I certainly agree we want to push good abstractions 👍. Obviously the query builder was actually abstracted in the first place, in the sense you request/receive directly backend columns. I’ll have a think if there is a better abstraction, but a key thing here is that I would like this API to be available to be available to plugin developers, to define their own fields on added data nodes, which I think would not be possible if moved to the backend
yeh interesting, possibly. With the current implementation, I was sort of thinking along the lines of how sqlalchemy works; having field attributes, which auto-generate the table attribute. What you suggest would be, I guess an inversion of this. The only issue I foresee is with static analysis and instance attribute auto completion, since then all these would be dynamically generated |
Ok, in be43aec I have removed the I have also renamed |
48f157d
to
9631a91
Compare
9631a91
to
02bed54
Compare
@sphuber this is all rebased and "simplified" (see #5088 (comment)) if you want to have time to have another look Outstanding:
|
Reviewing this PR. Tests passing where it is currently. I'm now incrementally moving (rebasing) it towards v2.5. I've made it as far as #6141. A few minor adjustments along the way. Will discuss these at a later time. I'm now running into an issue with #6134. Specifically, the new |
Okay. Rebased on There has been a good deal of development merged since Chris has last given this attention (I presume). Though my rebasing covered some, I suspect I may have missed a thing or two (or more). One thing that comes to mind is if any core data types have been added that require A few other things that I don't quite understand:
|
Thanks @edan-bainglass . You are not yet up-to-date with |
Huh 🤔 I was sure I included the move to src (between v2.5.0 and main). Maybe I didn't push? I'll check in a bit. Out replenishing stock 🛒 Back from 🛒 Oh right. I'm doing all of this on my own fork (of Chris's fork). There I am caught up to So, with that out of the way, what is the policy? Should I PR to Chris's fork, or just push my rebasing directly to his repo? Also, there are still comments on the PR that I am awaiting response. @chrisjsewell, if you can please address these, that'd be super 🙏 |
There is no real existing policy. Depends on what @chrisjsewell prefers. But if he is no longer interested or doesn't have the time, which I can understand, maybe we can continue with your fork. We close this PR and you open a new one and continue there. That might be easiest |
Sure. If @chrisjsewell has no objections, we can proceed as you suggest. |
The
QueryBuilder
currently has a conceptual problem, in that it essentially exposes the backend database to the user. A classic example of this is that, on the front end, we useNode.pk
, but when using theQueryBuilder
, one has to use the "backend" nameid
(always confusing for new users) (#2577, #2196).Additionally, there is currently no "programmatic" way to infer what can be retrieved from the database for each ORM entity, which is something very desirable for the REST API (aiidateam/aiida-restapi#1).
In this PR, I propose a new API (mainly in
aiida/orm/fields.py
), for:"decorating" ORM properties as database fields,Adding a__qb_fields__
class attribute containingQbField
instances (and theQbAttrField
subclass)Entity.fields
class attribute, thenQueryBuilder
instances.The API is backward compatible and looks like this (also see
tests/orm/test_fields.py
):On any ORM entity, it will inherit parent fields, then if you want to add extra fields, you can add a
__fields__
class attribute or directly decorate methods:This class will now have a
fields
class attribute, which gives you the mapping of ORM properties to database fields:(you can see the fields for all aiida nodes in
tests/orm/test_fields
)Within the
QueryBuilder
you can then directly use these fields for theproject
,filters
andorder_by
values e.g.then run the query as normal.
If you run
qb.dict()
/qb.iterdict()
, it will correctly give you the front end projection keys, e.g.Probably the main limitation at present, is that the output by(a4ee061)qb.dict()
/qb.iterdict()
will still return the backend keys, e.g.id
notpk
Some other things that come to mind:
For e.g.(useDict.fields.dict
we would ideally like to be able to do e.g.Dict.fields.dict.key1
subscriptable=True
when defining field)Some fields are missing for joined ids e.g.(addedNode.user_id
foreign_key
field kwarg)Composite fields, e.g. in(can now useStructureData
you havepbc
->[attributes.pbc1, attributes.pbc2, attributes.pbc3]
in the database__qb_fields__
)Field post-conversions (after having been returned from the database ORM), e.g. list to numpy array(possible, but not adding to this PR)You have to use sometyep: ignore[misc]
on the decorators currently, due to Decorated property not supported python/mypy#1362Obviously, also need to add tests(addedtests/orm/test_fields.py
)Definitely love to have some feedback on this cc @sphuber, @giovannipizzi, @ltalirz, @mbercx, @ramirezfranciscof