-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Derived properties vs. primary properties #410
Comments
This issue has not been addressed by #376. However, I wonder whether the granularity is right. Suppose a database is serving both fractional and Cartesian coordinates. Some entries have only fractional, and some have only Cartesian coordinates in the underlying data, thus the database has to perform on-the-fly conversions to present both types of coordinates. Thus the database cannot say that all Cartesian/fractional coordinates are primary/derived. However, such distinction would be useful, but maybe in per-entry level? |
In connection to ranged properties #452, I start to think we have a general issue related to not having any systematic way to provide per entry property metadata. In that PR it is about what the ranges are, etc., here it is metadata about whether the property is derived. We've been touching upon the issue of providing accuracy information, etc. Shouldn't we just agree on some sensible suffix? - e.g., The suffix could also just be We could also go in the opposite direction and place each separate category of metadata as a separate top-level field: |
I think |
Yes, ":" would require us to extend the filter syntax - but it is a quite minor extension. Using "_" means we cannot split metadata up into categories (or, at least not in more than a few categories), as every suffix restricts the namespace of property names for data. But there seems to be very little preventing us to PR:ing a standard Are there any trivially agreeable metadata fields we can start with to just get the segment about the suffix merged? |
I disagree,
Right. Maybe we can think of some other means/separator? Can we use
In #426 (comment) @eimrek suggested specifying calculation methods for derived properties, so that could be a candidate as well. |
As far as I know, the "." is used to address subfields of dictionaries. At least, I have started to use it like that for the trajectories. Perhaps it would be better to separate this discussion from the original issue about derived vs primary properties. |
This is true. However, since we probably will not place property data and meta in a single JSON:API-sh dictionary (as you say, this would be a major change), I thought we could simulate something similar to that by having: "cartesian_site_positions": [
// ...
],
"cartesian_site_positions.meta": {
// ...
}, This would not need any changes to the grammar and would still support filters like |
Good point about the conflict with ":" for zip comparisons, lets stay away from that. I'm reluctant to overload "." to mean both data dictionary subfield and metadata suffix. I get how one can see ".meta" as a generalized subfield, but especially for a properties that are themselves data dictionaries, won't this still get confusing? And, it means no (not even database-specific) property can ever have a data field named "meta". On the other hand, @merkys suggestion is the only one I can see that would allow any property to either come with metadata or not, and give the feeling that data and metadata are "bundled in the same dictionary". Otherwise, wouldn't referencing our most straightforward properties, e.g. Also, in practice, I see barely any difference between ".meta" and "_meta". The response would still look exactly as in @merkys' example, only with Both "_" and "." are attractive in that they require no change to the grammar. On the other hand, a specific character for metadata access has the benefit that the grammar can be changed to specifically separate metadata access on the parser level, which in the long run may be better for implementations. There are also many other characters we could consider that do not have any special meaning in OPTIMADE yet, e.g., "%", "#", "$", etc. But, maybe the best option is '~', since it is the only URL-safe character we have left, and thus would still work well when experimenting in command line/browser address bar, etc. So, e.g.:
|
Indeed, this might be confusing. On the other hand, dictionary-valued properties would be allowed to store their metadata inside dictionaries and have the same filters as non-dictionary-valued properties. Alternatively we may invert the placement of metadata and designate a special dictionary property {
"cartesian_site_positions": [
// ...
],
"nsites": 123,
"meta": {
"cartesian_site_positions": {
"is_derived": true, "derived_from": "fractional_site_positions" // ...
},
"nsites": {
"are_occupancies_accounted_for": false
}
},
} This would not break grammar and would only require to reserve |
Interesting idea. This fits our current design very well - current implementations would essentially not need to change at all to support this. In practice, it will just look and work like any other data member. It also makes it easy to say "include all metadata for this entry" with currently existing features ( On the other hand, metadata wouldn't be "sorted" alongside the data it belongs to. Also, in the shorter term, there is no facility in place today for clients to selectively chose which properties' metadata to include. Queries would take the form To continue the general discussion; we started to discuss this feature for "very meta" metadata such as "is_derived". However, I have a feeling that if we get this in place, it will pull us towards a generally more object-oriented-like design with a lower barrier to what we consider metadata. For example, the following would be a nice syntax (and, I think, in line with what @JPBergsma wants):
whereas these syntaxes will just be seen by users as having to add irrelevant cruft:
But, as we've discussed above,
and a response syntax of:
But, I concede that adding "~" to the language in this way is a larger change. So, after all this discussion, I feel a bit silly coming back to that we could do the
with a response syntax of:
Just like the |
To me, a suffix like “_meta” or “_” seems to be the best solution.
The “_” is a bit of a special character, and we may want to assign a special meaning to a terminal underscore in the future, just as we do now with a leading underscore. There may also be databases that already have properties with an underscore at the end. |
To get back to the original topic of this discussion: To me, the difference between Cartesian/fractional is more of a unit issue, they both describe the same value but in a different way. And no extra error/uncertainty is introduced when you convert from one to the other. Therefore, I would not consider either one as derived from the other. So I do not think this is a good example. I would rather use that flag for a property like the compressibility if it was derived from the radial distribution function, instead of calculated directly by measuring the change in density upon changing the pressure. |
@JPBergsma I suppose you have a point in that assigning a special meaning to a But, if both of you in this thread agree that this kind of mechanism is a good solution to the metadata problem - maybe let us bring the question about the precise details and syntax is to the next web meet?
I think in the discussion that provoked the issue posted here, we were primarily considering derived quantities possible to get at floating point accuracy. However, even in this case the distinction between "original" and "derived" is not meaningless. A database may want to be able to faithfully represent the source data in its tables and distinguish that from quantities that have undergone numerical processing. For example, COD is built from CIF files with, in most cases, experimental data. When I download a COD structure via OPTIMADE, it may for some uses be useful to know what quantities comes directly from that source file.
Here we IMO get into the territory of the "full experimental/computational provenance of the data"-framework mentioned in #455 (in the context of "that would be useful but this is not that"). I do not envision such a framework to fit into even a per-entry metadata property, but would be happy to look at any proposal. (My own old proposal on this goes back to #74; now closed, but probably should be reopened since it also covers this discussion). |
Yes, we can discuss it further on the next web meeting. |
PR #455 is about marking structure origin, i.e., experimental vs. DFT-derived structures. The current issue is about marking properties (or their values) as primary/derived. For example, if DFT code outputs orthogonal coordinates, they would be considered primary. If someone would convert them to fractional, such property would be marked as derived. |
@sauliusg made a remark that it would be good to know whether a property comes directly out of the underlying data source, or if it has been processed (probably to fulfill the requirements of the standard, e.g., providing Cartesian coordinates from the reduced ones).
A suggestion is to add this info to the PR for property definitions #376. There is some discussion whether it is sufficient to say this is a derived property = FALSE/TRUE, or if it is strongly desirable for a database to indicate from what primary properties a derived property has been derived from.
The text was updated successfully, but these errors were encountered: