When reading hypervalent nitrogen from SMILES, valence should be rounded up to 5 with Hs #3

baoilleach · 2017-09-12T08:42:29Z

For the following SMILES string, according to the SMILES valence model, there should be one hydrogen on the first nitrogen, but the toolkit reports 0.

CN1=NC=CN1

I am using code like the following:

  IWString smi(argv[2]);
  Molecule m;
  m.build_from_smiles(smi);

  const int N = m.natoms();
  for(int i=0; i<N; ++i) {
    printf(" %d", m.implicit_hydrogens(i));
  }
  printf("\n");

The text was updated successfully, but these errors were encountered:

IanAWatson · 2017-09-13T02:58:07Z

That is a very hard one to know how to handle properly. Does this molecule exist? If so, does it have a Hydrogen in real life? It would not be hard to change this behaviour. If you look at atom.cc near line 976 you will see where I specifically prevent 4 connected Nitrogens. Now, if we wanted to allow that, it would be easy to change result = 0 to result = 1.
But I am not sure this would be optimal in all cases. This might be a case where the behaviour could be optional, so people could choose what they would like to see.
Reading the aromatic form, C[nH]1ncc[nH]1, would require more significant changes in aromatic.cc. Generally you will be better off using Kekule forms in files, and just compute aromaticity in programs.
You probably want set_global_aromaticity_type(Daylight) in your main program to get definitions of aromaticity that will be closest to what people expect. The default definition is very restrictive.

baoilleach · 2017-09-13T08:30:26Z

The background to this is that I am currently comparing the SMILES readers of several toolkits, mainly to identify ambiguous areas of the spec and corner cases. However, the SMILES valence model for nitrogen is not an area of ambiguity - it's round up to 3, or round up to 5. If there was a hydrogen there (for example I drew it in with ChemDraw), and I wrote out that SMILES with program A, then I expect the hydrogen to be there when it is read in by program B. If readers and writers treat areas of the spec as optional, then any hope for using SMILES as an interchange format is lost.

Regarding the aromaticity type, does that setting affect reading aromatic SMILES or just writing? I'll certainly do it when writing.

IanAWatson · 2017-09-14T02:55:17Z

Agree that programmes should respect their input. The problem with the molecule above, CN1=NC=CN1, is that whether the N has a Hydrogen is not specified; so it is up to the implementation to make its own determination. We have chosen to make that guess No Hydrogen. But you can easily change that by altering the line in atom.cc. I should make that behaviour settable via a setter function. And if everyone else is doing that, then it would be a good idea.

Note that if the input is C[NH]1=NC=CN1, then that will be respected - because you have said that this atom has an H atom. It is only when you ask the software to make a choice that problems arise. Easily changeable.

We went back and forward on this one several times, it is a tough call.

Reading aromatic smiles is affected by these settings, because it will be asking whether or not an atom is capable of receiving a Hydrogen atom or not, so yes, there will be interplay with reading aromatic smiles. The Molecule object will not aromatise that particular molecule - with a H atom, the Nitrogen would be tetrahedral.

We do not recommend using aromatic smiles. The reason is that different implementations have different definitions of aromaticity. If you pass an "aromatic ring" generated by tool A to tool B, which does not consider that kind of ring to be aromatic, then you have problems. This is avoided if you mostly use Kekule forms, and then ask each tool to compute whatever it thinks aromaticity should be. We have run into all kinds of problems over the years with this. Keep it simple.

That said, reading complex aromatic forms is a weakness. When we test against Pubchem, there are quite a few molecules where we cannot discern the aromatic forms. Mostly these are horrible looking things that are not relevant to drug discovery, so we ignore them. Definite limitations there...

We have made a lot of changes to the code recently and I am hoping to get a new version out by year end. Fixes many limitations, bugs and improves speed.

Very happy to answer questions, fix bugs, etc. Curious to hear how the project progresses too.

baoilleach · 2017-09-14T06:08:03Z

Well, let's agree to disagree on the nitrogen. :-)

Regarding reading aromatic SMILES, I have heard that advice before and even given it. I'm not so sure now that it's important. If reading is implemented as Daylight did (and described in my talk https://baoilleach.blogspot.co.uk/2017/08/my-acs-talk-on-kekulization-and.html) then it doesn't matter what definition of aromaticity was used. The problems arise when the reader tests for aromaticity.

There still was a considerable number of entries in the third of the three tables where either the simulatenous presence of active ingredients was not split as the others. Or, that the name still contained spaces not yet replaced by underscores. Both issues are addressed by this commit.

update from branch by IanAWatson into local branch by nbehrnd

baoilleach changed the title ~~When reading hypervalent nitrgoen from SMILES, valence should be rounded up to 5 with Hs~~ When reading hypervalent nitrogen from SMILES, valence should be rounded up to 5 with Hs Sep 12, 2017

IanAWatson pushed a commit that referenced this issue May 25, 2020

Merge pull request #3 from IanAWatson/master

019d071

update from branch by IanAWatson into local branch by nbehrnd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When reading hypervalent nitrogen from SMILES, valence should be rounded up to 5 with Hs #3

When reading hypervalent nitrogen from SMILES, valence should be rounded up to 5 with Hs #3

baoilleach commented Sep 12, 2017

IanAWatson commented Sep 13, 2017

baoilleach commented Sep 13, 2017

IanAWatson commented Sep 14, 2017

baoilleach commented Sep 14, 2017

When reading hypervalent nitrogen from SMILES, valence should be rounded up to 5 with Hs #3

When reading hypervalent nitrogen from SMILES, valence should be rounded up to 5 with Hs #3

Comments

baoilleach commented Sep 12, 2017

IanAWatson commented Sep 13, 2017

baoilleach commented Sep 13, 2017

IanAWatson commented Sep 14, 2017

baoilleach commented Sep 14, 2017