Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requirements: Chemoinformatics #9

Open
avirshup opened this issue Sep 16, 2016 · 12 comments
Open

Requirements: Chemoinformatics #9

avirshup opened this issue Sep 16, 2016 · 12 comments

Comments

@avirshup
Copy link
Contributor

avirshup commented Sep 16, 2016

Lightly edited discussion, migrated from slack:

@avirshup:

[...] we're trying to limit the scope to 3D geometry-based computational chemistry. So we'd like this file format to support MD, docking, quantum chemistry, etc.
But we don't want to replicate PDB or mmCIF files (they store a lot of experimental crystallographic data that's not computationally relevant); nor do we want to get too far into chemoinformatics (which really requires a robust graph theory foundation, like OEChem has)

@davidlmobley:

[...] can you expand a little on what you mean? I have a little bit of paranoia about what happens if one does NOT think of molecules as connected graphs (as is done in PDB files) because this tends to lead to an underemphasis on connectivity, bond information, etc. that tends to cause downstream problems in small molecule simulations. One way I could interpret what you're saying is that you're focused on that same type of info (where bonds are basically incidental and don't get much attention), though another way I could interpret is that you're not that focused on enabling cheminformatics but you're still going to pay considerable attention to bonds, etc.

@jchodera:

[...] I think it is fine to rely on toolkits like openeye and RDKit for those capabilities, but essential that the "topology" data structure have enough information to initialize molecular objects in these toolkits. That's why we need a minimum amount of information about elements, connectivity (if known), and bond orders (and possibly net formal charge?) in order to instantiate molecule representations in those toolkits.

@avirshup:

[...] I 100% agree that you absolutely want to store bond information and have a complete, connected chemical graph, unlike PDB files. As a design principle for MDT, however, I'm trying to make sure we don't end up replicating RDKit or OpenBabel
[...] And @john.chodera, yes, definitely: we should be storing enough information to build an RDKit or OEChem object
[...] From experience with chemoinformatics, though, we want to avoid having to encode anything like hybridization assignment, or Kekulé structures, or SSSR information, or (especially) implicit hydrogens; results for all of these can vary HUGELY between different informatics programs

@jchodera:

I think you want the option of having extra information like that added without breaking things (eg if we want to include a specific representation of an OpenEye molecule) but we want to just store the requisite minimum in general, which requires a broadly compatible notion of bond order.
Other tool providers may want to handle the JSON representation of their own internal data structures.

@avirshup

Yeah, definitely agree that someone should be able to store implicit hydrogen or SSSR information in the file if they want. But I would argue that, because implementations and interpretations vary so greatly, you would not want it to be part of a base standard

@avirshup avirshup changed the title Chemoinformatics requirements REQUIREMENTS: Chemoinformatics Sep 16, 2016
@avirshup avirshup changed the title REQUIREMENTS: Chemoinformatics Requirements: Chemoinformatics Sep 16, 2016
@avirshup
Copy link
Contributor Author

This may be as simple as needing to store:

  • bond orders
  • atomic numbers
  • formal charges

And having a convention about implicit hydrogens (my preference: no implicit hydrogens allowed; missing hydrogens are fine)

@egonw
Copy link

egonw commented Sep 17, 2016

Strong suggestion: use explicit hydrogens or explicit atom types which define explicit hydrogen counts. For bond orders, make a clear choice, e.g. only integer (formal) bond orders. Also, don't expect any tool to know what you mean with "aromatic" (they don't and you will loose interoperability when you start using this term).

@avirshup
Copy link
Contributor Author

@egonw - I basically 100% agree with all of this.

To do due diligence, though: besides aromaticity (which we can deal with via Kekulization), has anyone run across any other use cases for fractional bond orders?

@avirshup
Copy link
Contributor Author

ParmEd, for instance, uses 1.5 for aromatic and 1.25 for amide. Not sure how this works in actual use cases though:

In [5]: parmed.Bond?
Init signature: parmed.Bond(self, atom1, atom2, type=None, order=1.0)
Docstring:     
[...]
order : float, optional
    The bond order of this bond. Bonds are classified as follows:
        1.0 -- single bond
        2.0 -- double bond
        3.0 -- triple bond
        1.5 -- aromatic bond
        1.25 -- amide bond
    Default is 1.0

@davidlmobley
Copy link
Contributor

I'll get back to the bigger threads here soon, but the bond order thing piqued my interest and connects with something we've been working on, @avirshup . In our new Open Forcefield initiative SMIRFF forcefield format, we just implement support for partial bond orders such as Wiberg bond orders. These are floats; our interest in them is that they allow the potential for interpolation between single/aromatic/double depending on conjugation, etc. (It's worth noting that the AMBER parameters for, say, a carbon-carbon aromatic bond in benzene fall linearly in between the parameters for a single and a double bond so interpolating based on a partial bond order of 1.5 give you essentially exactly the "right" parameters for that specific case.

There's a lot of basic science to be done on this, but the idea of allowing for this is that it allows the surrounding chemistry (which impacts the partial bond order) to potentially impact the bond or torsional parameters that are assigned in a sensible way without having explicitly specify parameters for every bond order which might be of interest -- you just hit the "normal" ones and let the fractional bond order do the rest of the work via interpolation.

@davidlmobley
Copy link
Contributor

So, um, I'd advocate for supporting fractional bond orders, honestly -- if for no other reason than that we have major plans involving fractional bond orders. :)

@avirshup
Copy link
Contributor Author

avirshup commented Sep 23, 2016

@davidlmobley - that sounds really cool!

Could this all be addressed with explicitly alchemical data structures? These would be objects that break all sorts of normal chemical rules - allowing for non-integer atomic numbers, non-integer bond orders, atom types that are linear combinations of other atom types, etc.

EDIT: sorry, got hung up on the "interpolation" thing, when this is actually much more interesting than that - see below

@davidlmobley
Copy link
Contributor

@avirshup -- this is actually a separate thing from free energy calculations entirely. The point is that, in "real chemistry" due to conjugation, etc., bond orders (i.e. Wiberg bond orders) often end up being non-integer due to electrons getting pulled here and there. That is, bonds don't end up being strictly single or double or whatever.

This is related to the issue where nitrogen with three connections can be tetrahedral or planar depending on what's connected to it, or even in between tetrahedral and planar with the right environment. Existing forcefields basically just have to bin everything into "these atom types are tetrahedral" and "these are planar". The though we're operating on here is that building a bit more chemistry -- i.e. via things like partial bond order -- can allow us to catch the "in between cases" just via interpolation.

Relating to alchemical calculations -- I agree there could be interest in defining atom types that are explicitly alchemical (in between other types, etc.), though that's really more an issue of force field parameters than of anything else.

However, that's not what I was talking about here -- here I was talking about tracking information that we will need for molecules in order to be able to assign our force field parameters to the molecules themselves. The one we're thinking about the most at the moment is the partial bond order... :)

@davidlmobley
Copy link
Contributor

I have to run off, but in case I still didn't explain clearly enough, you might just check out this comment as it illustrates a specific example for the case of carbon-carbon bonds like I mentioned above:

openforcefield/smarty#134 (comment)

I can also point you to an IPython notebook to play with this if you get interested.

@avirshup
Copy link
Contributor Author

@davidlmobley: Gotcha. When you've got time, yeah, I'd love to see a notebook! And, scientifically, it sounds seems like it could be an interesting intersection of chemical graph theory, force fields, and quantum chemistry - if you have a recommended reference, would definitely be interested in reading up on it.

@davidlmobley
Copy link
Contributor

@avirshup -- here's a notebook, though you'd need an OpenEye license (and the October betas of OpenEye tools) installed to be able to actually fiddle with it: https://github.com/open-forcefield-group/smarty/blob/master/examples/partial_bondorder/test_partialbondorder.ipynb

I don't have a reference yet. We're writing this up starting around now, but it's basically something we came up with Christopher Bayly over the summer that I think has been stuck in his head for a decade or two and is finally coming out... :)

@davidlmobley
Copy link
Contributor

This may be as simple as needing to store:

  • bond orders

and connectivity! (Though it's not clear to me how you would denote bond orders if you didn't have this)

  • atomic numbers
  • formal charges
  • And having a convention about implicit hydrogens (my preference: no implicit hydrogens allowed; missing hydrogens are fine)

I think I agree with this, @avirshup , except that extensions probably are presently needed and it also needs to be easily extendable (via something like SD tags used in SDF files) to include any other type of info someone would want to carry along. Obvious candidates for right now would be partial charges and partial bond orders; perhaps partial charges should be provided always but be empty if no charges are available (as in .mol2 files)

You probably also want to allow people to carry around atom types and atom names if they want to...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants