-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When reading hypervalent nitrogen from SMILES, valence should be rounded up to 5 with Hs #3
Comments
That is a very hard one to know how to handle properly. Does this molecule exist? If so, does it have a Hydrogen in real life? It would not be hard to change this behaviour. If you look at atom.cc near line 976 you will see where I specifically prevent 4 connected Nitrogens. Now, if we wanted to allow that, it would be easy to change result = 0 to result = 1. |
The background to this is that I am currently comparing the SMILES readers of several toolkits, mainly to identify ambiguous areas of the spec and corner cases. However, the SMILES valence model for nitrogen is not an area of ambiguity - it's round up to 3, or round up to 5. If there was a hydrogen there (for example I drew it in with ChemDraw), and I wrote out that SMILES with program A, then I expect the hydrogen to be there when it is read in by program B. If readers and writers treat areas of the spec as optional, then any hope for using SMILES as an interchange format is lost. Regarding the aromaticity type, does that setting affect reading aromatic SMILES or just writing? I'll certainly do it when writing. |
Agree that programmes should respect their input. The problem with the molecule above, CN1=NC=CN1, is that whether the N has a Hydrogen is not specified; so it is up to the implementation to make its own determination. We have chosen to make that guess No Hydrogen. But you can easily change that by altering the line in atom.cc. I should make that behaviour settable via a setter function. And if everyone else is doing that, then it would be a good idea. Note that if the input is C[NH]1=NC=CN1, then that will be respected - because you have said that this atom has an H atom. It is only when you ask the software to make a choice that problems arise. Easily changeable. We went back and forward on this one several times, it is a tough call. Reading aromatic smiles is affected by these settings, because it will be asking whether or not an atom is capable of receiving a Hydrogen atom or not, so yes, there will be interplay with reading aromatic smiles. The Molecule object will not aromatise that particular molecule - with a H atom, the Nitrogen would be tetrahedral. We do not recommend using aromatic smiles. The reason is that different implementations have different definitions of aromaticity. If you pass an "aromatic ring" generated by tool A to tool B, which does not consider that kind of ring to be aromatic, then you have problems. This is avoided if you mostly use Kekule forms, and then ask each tool to compute whatever it thinks aromaticity should be. We have run into all kinds of problems over the years with this. Keep it simple. That said, reading complex aromatic forms is a weakness. When we test against Pubchem, there are quite a few molecules where we cannot discern the aromatic forms. Mostly these are horrible looking things that are not relevant to drug discovery, so we ignore them. Definite limitations there... We have made a lot of changes to the code recently and I am hoping to get a new version out by year end. Fixes many limitations, bugs and improves speed. Very happy to answer questions, fix bugs, etc. Curious to hear how the project progresses too. |
Well, let's agree to disagree on the nitrogen. :-) Regarding reading aromatic SMILES, I have heard that advice before and even given it. I'm not so sure now that it's important. If reading is implemented as Daylight did (and described in my talk https://baoilleach.blogspot.co.uk/2017/08/my-acs-talk-on-kekulization-and.html) then it doesn't matter what definition of aromaticity was used. The problems arise when the reader tests for aromaticity. |
There still was a considerable number of entries in the third of the three tables where either the simulatenous presence of active ingredients was not split as the others. Or, that the name still contained spaces not yet replaced by underscores. Both issues are addressed by this commit.
update from branch by IanAWatson into local branch by nbehrnd
For the following SMILES string, according to the SMILES valence model, there should be one hydrogen on the first nitrogen, but the toolkit reports 0.
CN1=NC=CN1
I am using code like the following:
The text was updated successfully, but these errors were encountered: