-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distinguish experimental structures from theoretical #406
Comments
Could we get away with defining a new enum value in |
Sounds good to me. Should it be |
I think we will need to have both What you are proposing does stretch the meaning of the Perhaps this is also a good moment to think about how we want to include more detailed information about how the structure was generated. Especially information that would be interesting for |
It is definitely a property of a data element (one element of the array, as
opposed to the overall set of records). I agree that it is not something to
add on to some existing "structure features" string. It's more important
than that. How about a new key called "nature" within data:
data[i].nature: {"experimental"|"theoretical"}
|
Reading @JPBergsma and @BobHanson responses I am now leaning towards separate property. It could actually provide more information about the origin of a structure. In the COD, we have a CIF data item |
This sounds great to me. But can you have theoretical PD? Re there two concepts here? data[i].nature: {"experimental"|"theoretical"} |
Right. Then these should be separate properties. |
How should we name such a property? Some suggestions:
Personally, |
Yes, I also would not know a good name for this distinction. From the suggestions above I found |
experimental_method?
…On Wed, Jun 1, 2022 at 6:14 PM Johan Bergsma ***@***.***> wrote:
Personally, nature does not sound immediately clear to me, origin might
also be quite ambiguous.
Yes, I also would not know a good name for this distinction. From the
suggestions above I found determination_method the clearest. But perhaps
we can also name it simply experimental_or_theoretical .
—
Reply to this email directly, view it on GitHub
<#406 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEHNCW5BWGG7JETSFPHBJBDVM6D7LANCNFSM5XMWYTBA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Robert M. Hanson
Professor of Chemistry
St. Olaf College
Northfield, MN
http://www.stolaf.edu/people/hansonr
If nature does not answer first what we want,
it is better to take what answer we get.
-- Josiah Willard Gibbs, Lecture XXX, Monday, February 5, 1900
*We stand on the homelands of the Wahpekute Band of the Dakota Nation. We
honor with gratitude the people who have stewarded the land throughout the
generations and their ongoing contributions to this region. We acknowledge
the ongoing injustices that we have committed against the Dakota Nation,
and we wish to interrupt this legacy, beginning with acts of healing and
honest storytelling about this place.*
|
@JPBergsma: @BobHanson: |
Ah, right. This was in reference to experimental_method: {single crystal diffraction | powder diffraction|...} brainstorming... |
cf. computational experiments vs. experimental modeling |
not voting for "computational experiment". I understand the desire to
consider computational approaches "experiments" but I think this is not
well understood.
…On Fri, Jun 3, 2022 at 2:54 PM Evgeny Blokhin ***@***.***> wrote:
cf. *computational experiments* vs. *experimental modeling*
—
Reply to this email directly, view it on GitHub
<#406 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEHNCW4SIE52SO7NPZNI6XLVNH6BXANCNFSM5XMWYTBA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Robert M. Hanson
Professor of Chemistry
St. Olaf College
Northfield, MN
http://www.stolaf.edu/people/hansonr
If nature does not answer first what we want,
it is better to take what answer we get.
-- Josiah Willard Gibbs, Lecture XXX, Monday, February 5, 1900
*We stand on the homelands of the Wahpekute Band of the Dakota Nation. We
honor with gratitude the people who have stewarded the land throughout the
generations and their ongoing contributions to this region. We acknowledge
the ongoing injustices that we have committed against the Dakota Nation,
and we wish to interrupt this legacy, beginning with acts of healing and
honest storytelling about this place.*
|
Yes, you are right, that is not convenient. How about |
Indeed, there is a whole spectrum of methods ranging from purely experimental (can we actually get coordinates without any theoretical assumptions?) to purely theoretical. We probably would need a separate ontology just to identify where a structure sits in that spectrum. |
But for our purposes suggest not reinventing the wheel or overcomplicating.
Go with the ICSD conception here. Keep it simple. Maybe allow for some
ambiguous third category but don't insist that every conceivable possibly
is covered.
…On Mon, Jun 6, 2022, 2:03 PM Andrius Merkys ***@***.***> wrote:
Indeed, there is a whole spectrum of methods ranging from purely
experimental (can we actually get coordinates without any theoretical
assumptions?) to purely theoretical. We probably would need a separate
ontology just to identify where a structure sits in that spectrum.
—
Reply to this email directly, view it on GitHub
<#406 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEHNCW4I4HLX22N2WPIEFZTVNXSKJANCNFSM5XMWYTBA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@BobHanson Is there a link for the said ICSD conception? |
I understand the desire to add something ASAP to help distinguish experimental and theoretical structural data. However, I'd suggest to be careful to not over-design this interface, since it is debatable if this info even belongs in this endpoint. Going forward, we won't be able to stuff all possible experimental and theoretical details related to a structure into the Hence, I suggest this to just be a simple boolean field: The alternative, False, just means that the structure has been obtained some other way. E.g., hypothetical structures through substitutions (perhaps including DFT relaxations, etc., but not necessarily), structure prediction algorithms, just random initialization, etc. No guarantees that these structures "make sense". (Or is there a very strong desire to also distinguish theoretical structures that the database strongly believes are at, or very close, to the convex hull of stability? This, I believe, is the ICSD criterion for inclusion.) |
Unfortunately it starts to be complicated here. Imagine we took an experimental structure and relax it fully with the DFT, ending up with the different cell, symmetry, atomic positions, etc. Is the structure still |
It was my intent to mostly avoid this complexity by a single stringent definition separating everything into "directly from experiment" vs. other. My definition above was meant to say that your example is not an experimental structure. |
I think my suggestion was for actual vs hypothetical which maybe makes this slightly clearer (though shifts the vagueness elsewhere, e.g. whether a DFT database that simply took experimental structures and calculated band gaps without relaxing should report itself as hypothetical or actual). The two relevant axes for filtering seem to me to be whether something has actually been made, and whether the structure is simply the result of minimising or sampling of a Hamiltonian |
Sorry -- that ICSD paper reference:
https://journals.iucr.org/j/issues/2019/05/00/in5024/index.html and
supporting information
Noting that there is a discussion of this in matsci.org
https://matsci.org/t/how-is-the-theoretical-tag-determined/3527
So perhaps the boolean "theoretical" is appropriate (matching ICSD). But
this post does point out the same issue -- that it is not always possible
to distinguish. I think one would just have to trust repositories to do
their best job here. AFLOW could distinguish (perhaps?) between their ICSD
entries (which are presumably NOT theoretical) from their calculations. @
***@***.*** (Cormac)
I do feel strongly that there MUST be some sort of flag regarding this.
Serving up purely calculated structures is not the same as delivering x-ray
crystallographic results. This is a widespread, growing issue throughout
the data world. My recommendation: keep it simple.
Bob
|
Having read the discussion, I tend to agree with those of you favoring single boolean flag. The question now is where to draw the line. However, neither ICSD paper nor related discussion on matsci.org does provide clear criteria (thanks @BobHanson for links, though). @vaitkus, maybe IUCr has put up any criteria? I am a bit skeptical regarding the |
@merkys, as far as I know, the IUCr does not have any such criteria. However, the ICSD paper lists three types of subclasses of theoretical structures:
Based on this, I would say that according to them anything that is not purely experimental is classified as theoretical. |
I think there might be difficulties in drawing the line between refinement with statistical potentials, forcefields and DFT. |
I don't think anyone proposed to make them mandatory for theoretical entries? Just that if you have data or metadata related to the calculation itself for, say, a calculation that started from one structure, and resulted into a couple of output structures, that data would better belong under the |
No. I mistakenly assumed this was the suggested solution for telling experimental structures from theoretical.
Agree. |
Q1: Are theoretical and experimental the correct two options?
I suggest yes:
There is a paper from ICSD: *Recent developments in the Inorganic Crystal
Structure Database: theoretical crystal structure data and related features*
http://scripts.iucr.org/cgi-bin/paper?in5024, where, for example, we see:
In order to be included in the ICSD, a *theoretical structure* has to be
fully characterized, the atomic coordinates determined and the composition
fully specified, similarly to* experimental structures*.
*Table 1*
Comparison of databases containing *experimental and/or theoretical crystal
structures*
(14 uses of "experimental structure")
(26 uses of "theoretical structure")
So, I argue, these are the terms to use.
As for
xxx_yyy = { experimental | theoretical }
I suggest NOT using "structure_type" as that actually means something
different.
Maybe "determination_type"
"experimentally determined structure" Google 20,000 hits.
admittedly,
"theoretically determined crystal structure" has only 3 hits. So many that
is a bit of a problem.
Next idea?
|
Revisiting this ahead of today's meeting... we are currently very open to (and in fact should be encouraging!) huge databases of ML structures (not even necessarily ab initio/MM refined) suddenly swamping out all of our multi-provider queries. I still think the lowest friction way of adding this would be with specific
This would render our default structure with Potential nasty side effects:
I really think that this is the biggest added-value change we could make to OPTIMADE right now, and its something we need to do if we want to scale out to new databases beyond our existing community |
Thanks for reviving this issue. Admittedly, |
I agree with that we should rush this feature. I'd rather not place it in We can do a list the same way as So, here is a concrete proposal: Field name:
If the field is missing or equal to an empty string or Database-specific strings using a database provider prefix (e.g., However, writing the above up, I realize that some kind of info on whether the above classification refers to structures existing at NTP or other conditions would also be useful here. If the reason you want this is because you want to filter on "reasonable structures" to use for your AI model, I'm not sure you want to include someones database of 1M structures relaxed to the convex hull but at one billion bar and 10000 K. |
I agree with @rartino's proposal, but would change the following:
into the following:
Some structures in the COD have so little accompanying metadata that no one can really tell to which class such structures belong. I think "do not assume anything" is a reasonable default value. |
I think we should see this point as part of the larger issue, that we have not standardized anything yet about the methods used to generate the structure. So I think we should think about how we want to include such information in general. (We do not need to discuss which properties should be included. That could be addressed in a later issue.) |
The question here, I think, is: should So, how about we add |
Thanks @rartino @sauliusg @merkys and @JPBergsma for the extended discussion after today's meeting. I will try my best to summarize the issues with adding this and the options we discussed.
I hope that adequately summarizes our discussion. I am personally leaning towards 4 and would be happy to help set this up. |
Having read @ml-evs proposal, option 1 still sounds the most elegant to me. I get that a new specification version (or an RC) has to be released and implemented to support such According to the specification, implementations can serve preview (or RC) versions of OPTIMADE:
So once |
After thinking a bit more on this since that discussion: I sympathize with what @ml-evs wants to do by introducing this feature without having older servers return errors, and going forward the situation with forwards-compatibility needs to be improved. However, for this one change, I still come down on the side that we should be able to accept the breakage:
Now, for the general issue: going forward, I think the most direct solution would be to just demote the error for unrecognized properties without prefixes into a mandatory warning. How about the following changes to section 3.9?: 3.9 Handling unknown property names When an implementation receives a request with a query filter that refers to an unknown property name
|
I think I have been talked off the proverbial ledge here, I must admit that writing out options 1-4 above did seem like overkill. My only concern now with adjusting the unknown property names directly to avoid breakages is that results returned from a unknown property filter will be "wrong" and will require all clients to strictly check warnings (somehow this feels like more of a breaking change now than queries not working across versions 😅). I was fine with this in the specific case of adding a field with a default value that can be sensibly applied backwards (although still concede this is poor API design), but think we should exercise some caution... |
We may use crystal structure prediction outputs as a possible use case, for example this paper by Reilly et al.. Table 2 of the paper gives a variety of methods used. Structure aggregators like TCOD will have to fit them into the categories proposed for OPTIMADE. |
Indeed, CSP was my main initial motivation too. Hopefully we can revisit #455 in 2024... |
As suggested by @BobHanson, there should be standard means to distinguish between experimental and theoretical structures. This could be a property with boolean/enum values. I would suggest "MUST" level of support (maybe even for queries), as I believe this bit of information should always be available.
The text was updated successfully, but these errors were encountered: