-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chemical identity information for non-QM packages #35
Comments
I am very receptive to optionally including this information, and we already have some proposed improvements related to bonding information. It would be great to understand what would be most helpful for the MD side, from my perspective at least. |
An alternative to formal charges would be partial charges. Isomeric SMILES might be harder for you to be able to generate in general (since it requires more detailed chemical perception), though perhaps if you provide them for the parent molecule (input) this could work. However, after discussing it with some others we think maybe the easiest/best for you would be just the bond order matrix (e.g. Wiberg bond orders). Possibly a relatively general solution would be to give the wiberg bond order matrix and the isomeric SMILES for the parent molecule, since after applying chemical perception these should agree. Other properties can be very useful for information purposes for us, such as dipole moments, though the above is probably the most critical. |
I think we can expand the topology sections to (optionally) have a place for bond orders and a SMILES string. This would be different than Wiberg bond-orders which would likely fall in the "results" section of the schema as it requires a QM computation. For dipole moments are you thinking of some sort of localized approach or just the general dipole of the molecule. The former would again need to be a result specification while the latter should already be included. What other kind of chemical perception where you thinking of? |
@davidlmobley and @jchodera do we just want an optional SMILES field as a string or would you possibly want the ability to add more chemical perception than just that. |
Sorry about that, @dgasmith - I missed your last question. The things I'm most concerned about are the bond order matrix and the SMILES string (or if SMILES strings are problematic to generate, I can come up with alternatives). Beyond that, I can see some things being useful for convenience (dipole moment for example) but they are not crucial for me. I don't see the need to add more chemical perception than that:
I just meant that I'm trying to be understanding that you guys might not want to have to do sophisticated chemical perception. If SMILES strings are tricky, then formal charges or partial charges (together with the bond order matrix) could serve. In terms of dipole moment, just the general dipole moment of the molecule. |
Ok, good to know. I think we can add a SMILES section to the molecule class as an optional field no problem. If I think about a charge model (such as RESP) and bond-orders those will have to be outputs of QC programs rather than attached to the molecule specification itself. I think we can straightforwardly add a spec for charge models and bond-orders if the QC program supports it. |
I'm thinking there'll have to be domains w/i the overall molecule spec. The QCprog provides the results it can generate w/o subjectivity. An EFPprog may come along and provide its fragments in a separate domain and interact with the QC portion to the extent of fixing its Cartesian coordinates and using all atoms as input. A RESPprog can come along and add a charge set. And anything that can generate SMILES can read the QC portion and whatever other portions (programs shouldn't amend the molecule JSON unless it can understand it completely) and add its own domain. So, I think SMILES is great, just not directly in the QC molecule domain, where it's (1) not an input or output and (2) any implementations are likely non-expert. |
An optional SMILES string corresponding to the original chemical species the calculation was performed for (if applicable), would be exceptionally helpful in searching large sets of calculation results on many molecules for calculations of interest. Presumably, this wan't apply to all calculations (for example, a transition state will not necessarily correspond to a single chemical species), but it may still be useful attached metadata for many calculations involving small organic molecules. Specifically, a canonical isomeric SMILES string would be optimal. Despite the term "canonical", not all programs produce the same canonical string, so one computed with a specific program (e.g., RDKit) may be ideal. Note that the SMILES string is only useful in identifying whether a specific calculation may be for a molecular species of interest, but will not help identify which atoms correspond to which parts of the molecule (which is often important for tasks like forcefield parameterization). Some additional topology information mapping atoms in the molecular topology to atoms in the calculation would still be necessary. |
Hmm, we may be on slightly different pages. I'm not entirely sure we could support something that would require the following "workflow":
I was thinking more that we would have an option to add a SMILES string before the computation that could just ride through the QC computation. The database tech that all of this is associated with can definitely handle the above workflow however. @cryos Any thoughts here? |
This would be totally adequate! I was just intending to note that
|
Right now we have it so that unknown fields are passed through. Perhaps a better way of thinking of the SMILES field is a "registered" pass through field. For 2) I dont think thats a problem as long as we do not need too many more of these. Would a simple list of integers that index the rest of the molecule spec work? |
Unfortunately, no. There's no unique way to render a SMILES string into a molecular topology, so a list of atom indices would not be sufficient for identifying which atoms correspond to which parts of the molecule. This is essentially why we need some portable way of describing a chemical topology with indexed atoms. In SMARTS strings, it's possible to tag atoms with integers to uniquely identify matched atoms. @davidlmobley : Are SMILES also valid SMARTS? If so, can we have an explicit-hydrogen SMARTS that uniquely tags each atom in the molecule? If so, a single string would be all we need to both create the molecular topology and tag all the atoms. |
SMILES are valid SMARTS, I believe, but I'm not sure how you'd tag the atoms. What exactly do you have in mind? Or maybe I'm missing something obvious. @bannanc? |
Suppose we have ethanol, and would like to specify which atoms belong to which chemically distinct parts of the molecule. The We could specify a corresponding SMARTS string that matches each chemically distinct atom in the molecule, tagging it with a unique index:
This way, we only need to carry through a string that allows us to identify which atoms in the quantum chemical calculation correspond to which atoms in the molecule. |
I'm trying to think through the logistics of using a SMARTS string for this purpose. The typical idea behind SMARTS is that they describe a substructure of a molecule. SMILES are valid SMARTS in that you can use a SMILES string to perform a substructure search. However, the reverse isn't true a SMARTS is not a valid SMILES. That is when parsing SMARTS toolkits expect a SMARTS to describe a substructure query and treat that differently from a molecule. Assuming all atoms are specified explicitly (including hydrogens and bonds), I think this is a reasonable solution to needing the molecule identity and the mapping to the coordinate information, it just might be more complex than you realize to get that information extracted correctly. |
I'd suggest we include both a SMILES string and the corresponding tagged SMARTS string that matches the atoms from the SMILES-generated molecule (in whatever toolkit you use) to the ordered atoms in the quantum chemical calculation. |
I think that makes the sense, I wasn't sure if you were suggesting replacing the SMILES with a SMARTS. |
(Apologies to the QC folks for the degree of back-and-forth needed to come to a consensus!) OK, to summarize our thinking so far: Use casesMany calculations of interest will likely start with a specific small organic molecule in mind. Atomic coordinates are generated, and quantum chemical properties computed. It would be useful for many applications that make use of this data for forcefield parameterization, machine learning, or the study of molecular properties to be able to easy identify (1) whether the calculation was performed for a molecule of potential interest, and (2) which atoms in the calculation correspond to which chemically distinct parts of the original molecule. The proposalWhile it is generally easy to go from a small molecule identity to atomic coordinates of a plausible conformation for that molecule, it is very difficult to go the other way. As such, it would be useful to optionally associate information sufficient for (1) and (2) above with the calculation's metadata. We propose to add two optional string fields to the metadata for the calculation:
These two pass-through string fields should be sufficient to enable a huge amount of QC-derived use of datasets stored in the QC JSON spec. ExampleA calculation for ethanol might contain the following fields:
|
Totally agree with this -- that would be tremendously useful and fix a huge number of problems we have as people who want to put things in to QM packages and then use the output for non-QM things. Thanks, @jchodera ! |
Just saw this now - agreed that it would generally be useful to have SMILES/SMARTS as a pass-through to add identifiers and atom-maps from a connection-table view of the world. In principal, QM programs can estimate this using bond order calculations, but I see the use case as a submitting script / workflow / GUI as embedding the identifier for reading later. |
From @ghutchis it was thought we might have a "identifier" section to the molecules which expands on the amount of tags that we can associate with a given molecule. Other tags can be added, but these would be officially encoded.
@jchodera Would this work for you where we insert your definitions for the the smiles/smarts patterns? Ping @davidlmobley @bannanc |
I think some of these should be optional.
|
I think there is value in having them, but agree that it would be preferable to make them optional. |
I intended these as optional examples of identifiers - that is, there are likely a range of identifiers (SMILES, SMARTS, InChI, etc.) Certainly some programs (e.g., Open Babel) would write some of these. My point was that most QC programs allow some sort of title or comment field, which was included in the schema - but those are just one example of an identifier. Most QC programs also write a formula (in some format) in the output file. |
Agreed, all of these would be optional. Things like name and comment would be up to users/programs and mostly free for all fields as they are ill defined. For SMILES/SMARTS/InChi/Formula are these or can they be deterministic and do all programs produce the same? Without this we may need to attach providence fields to them. |
InChI is a standard. Formula can be standardized. The others are standards, but not canonical/unique. OTOH, I think this thread was indicating that the SMILES or SMARTS should match the atom order in the file. |
I agree with an earlier comment, there needs to be consistency (or no ambiguity) between the SMILES, SMARTS and InChi and InchiKey as the rules are not the same in each case. |
It would be great if we could specify recommended, consistent standards if these fields are included, since this would maximize the potential that searching the database on these keys will return as many useful entries as possible. |
We can certainly recommend programs and algorithms to generate these quantities, but requiring them might be difficult. Can someone up write up a recommended way of computing these quantities to get the ball rolling? |
@dgasmith - computing them given what? I usually use the OpenEye tools so would tend to rely on that; can I assume the user would have those? If not we need to rope in someone with expertise. What tools can we assume? If not OpenEye, what about RDKit? |
My view is that these are all optional. InChI and InChIKey are standardized - it doesn't matter the toolkit, they should be the same regardless. As for SMILES and/or SMARTS, if they're supposed to match the atom order, I would think that Open Eye, RDKit, and Open Babel should all give the same SMILES, but might differ slightly on aromaticity. |
InChI and InChIKey match if bonding information is preserved, but can still be an issue where the QM package strips all that and it needs to be perceived, especially when things move around and you throw in different approaches to perceive it. Agreed on them all being optional. |
I certainly think the should be optional, but what do people think about making at least one of these choices (canonical, isomeric, explicit-hydrogen SMILES or InChI) recommended (if appropriate to the calculation) so that researchers datamining QC databases can hope to make maximal use of the information? This would not be required, but simply encouraged to facilitate data re-use. |
Can someone volunteer to try adding this to the schema? You would want to extend the Molecule definition found here. |
Those of us in the MD community would very much like to be able to take output from QM packages and take it directly into MD engines and chemistry toolkits we use. However, these typically require what I'll call the "chemical identity" of the molecule, as (without QM) we can't infer this simply from the elements/number of electrons.
To that end, I'd like to see how receptive people would be to including in the schema the necessary information, such as formal charges (on atoms), connectivity, and bond orders. Presumably this wouldn't be particularly helpful to people staying in the QM world, but for us it would save a whole bunch of intermediate steps and/or a need to know what molecule is contained in the JSON before we can do anything with it.
Alternatively, providing an isomeric SMILES of the molecule or similar could also work. Basically, we just need some way of knowing what molecule (and charge state) it contains without having to "do chemistry" on the file to determine that.
If people are receptive to this idea I can open up a PR to add this to the
requirements.md
.To be slightly more specific, I am also proposing broadening the concept of topology to also include bonding information and/or chemical identity (beyond just the coordinates and elements for the molecule).
The text was updated successfully, but these errors were encountered: