Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When reading hypervalent nitrogen from SMILES, valence should be rounded up to 5 with Hs #3

Open
baoilleach opened this issue Sep 12, 2017 · 4 comments

Comments

@baoilleach
Copy link

For the following SMILES string, according to the SMILES valence model, there should be one hydrogen on the first nitrogen, but the toolkit reports 0.

CN1=NC=CN1

I am using code like the following:

  IWString smi(argv[2]);
  Molecule m;
  m.build_from_smiles(smi);

  const int N = m.natoms();
  for(int i=0; i<N; ++i) {
    printf(" %d", m.implicit_hydrogens(i));
  }
  printf("\n");
@baoilleach baoilleach changed the title When reading hypervalent nitrgoen from SMILES, valence should be rounded up to 5 with Hs When reading hypervalent nitrogen from SMILES, valence should be rounded up to 5 with Hs Sep 12, 2017
@IanAWatson
Copy link
Owner

That is a very hard one to know how to handle properly. Does this molecule exist? If so, does it have a Hydrogen in real life? It would not be hard to change this behaviour. If you look at atom.cc near line 976 you will see where I specifically prevent 4 connected Nitrogens. Now, if we wanted to allow that, it would be easy to change result = 0 to result = 1.
But I am not sure this would be optimal in all cases. This might be a case where the behaviour could be optional, so people could choose what they would like to see.
Reading the aromatic form, C[nH]1ncc[nH]1, would require more significant changes in aromatic.cc. Generally you will be better off using Kekule forms in files, and just compute aromaticity in programs.
You probably want set_global_aromaticity_type(Daylight) in your main program to get definitions of aromaticity that will be closest to what people expect. The default definition is very restrictive.

@baoilleach
Copy link
Author

The background to this is that I am currently comparing the SMILES readers of several toolkits, mainly to identify ambiguous areas of the spec and corner cases. However, the SMILES valence model for nitrogen is not an area of ambiguity - it's round up to 3, or round up to 5. If there was a hydrogen there (for example I drew it in with ChemDraw), and I wrote out that SMILES with program A, then I expect the hydrogen to be there when it is read in by program B. If readers and writers treat areas of the spec as optional, then any hope for using SMILES as an interchange format is lost.

Regarding the aromaticity type, does that setting affect reading aromatic SMILES or just writing? I'll certainly do it when writing.

@IanAWatson
Copy link
Owner

Agree that programmes should respect their input. The problem with the molecule above, CN1=NC=CN1, is that whether the N has a Hydrogen is not specified; so it is up to the implementation to make its own determination. We have chosen to make that guess No Hydrogen. But you can easily change that by altering the line in atom.cc. I should make that behaviour settable via a setter function. And if everyone else is doing that, then it would be a good idea.

Note that if the input is C[NH]1=NC=CN1, then that will be respected - because you have said that this atom has an H atom. It is only when you ask the software to make a choice that problems arise. Easily changeable.

We went back and forward on this one several times, it is a tough call.

Reading aromatic smiles is affected by these settings, because it will be asking whether or not an atom is capable of receiving a Hydrogen atom or not, so yes, there will be interplay with reading aromatic smiles. The Molecule object will not aromatise that particular molecule - with a H atom, the Nitrogen would be tetrahedral.

We do not recommend using aromatic smiles. The reason is that different implementations have different definitions of aromaticity. If you pass an "aromatic ring" generated by tool A to tool B, which does not consider that kind of ring to be aromatic, then you have problems. This is avoided if you mostly use Kekule forms, and then ask each tool to compute whatever it thinks aromaticity should be. We have run into all kinds of problems over the years with this. Keep it simple.

That said, reading complex aromatic forms is a weakness. When we test against Pubchem, there are quite a few molecules where we cannot discern the aromatic forms. Mostly these are horrible looking things that are not relevant to drug discovery, so we ignore them. Definite limitations there...

We have made a lot of changes to the code recently and I am hoping to get a new version out by year end. Fixes many limitations, bugs and improves speed.

Very happy to answer questions, fix bugs, etc. Curious to hear how the project progresses too.

@baoilleach
Copy link
Author

Well, let's agree to disagree on the nitrogen. :-)

Regarding reading aromatic SMILES, I have heard that advice before and even given it. I'm not so sure now that it's important. If reading is implemented as Daylight did (and described in my talk https://baoilleach.blogspot.co.uk/2017/08/my-acs-talk-on-kekulization-and.html) then it doesn't matter what definition of aromaticity was used. The problems arise when the reader tests for aromaticity.

nbehrnd referenced this issue in nbehrnd/Lilly-Medchem-Rules Apr 30, 2020
There still was a considerable number of entries in the third of
the three tables where either the simulatenous presence of active
ingredients was not split as the others.  Or, that the name still
contained spaces not yet replaced by underscores.  Both issues
are addressed by this commit.
IanAWatson pushed a commit that referenced this issue May 25, 2020
update from branch by IanAWatson into local branch by nbehrnd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants