Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extended SMILES saved from Ketcher might be invalid for RDKit #1865

Closed
xuzuodong opened this issue Nov 23, 2022 · 7 comments · Fixed by #2498
Closed

Extended SMILES saved from Ketcher might be invalid for RDKit #1865

xuzuodong opened this issue Nov 23, 2022 · 7 comments · Fixed by #2498
Assignees
Labels

Comments

@xuzuodong
Copy link

xuzuodong commented Nov 23, 2022

Steps to Reproduce

  1. Click "Open..." button
  2. Click "PASTE FROM CLIPBOARD"
  3. Input SMILES c1cccc(-c2ccc(Nc3cccc4c(=O)[nH]ccc34)nc2)c1
  4. Save as Extended SMILES, and the result would be C1C=C(C2C=NC(Nc3c4c(c(ncc4)=O)ccc3)=CC=2)C=CC=1
  5. Go to RDKit.js official website, in their online code demo, input and run:
var smiles = "C1C=C(C2C=NC(Nc3c4c(c(ncc4)=O)ccc3)=CC=2)C=CC=1"; // generated by Ketcher in step 4
var mol = RDKitModule.get_mol(smiles);
console.log(mol.is_valid())

image

Actual behavior

  1. Value of mol.is_valid() in RDKit.js website is false.
  2. Also, using RDKit to draw molecule image would fail.

Expected behavior
Smiles generated by Ketcher should all be valid to RDKit?

Ketcher version .
2.6.2

@AlexanderSavelyev
Copy link
Contributor

yes, aromatic bonds were not converted correctly for double bond Oxygen and aromaticity is kept as atom - which should not be a case (it should be converted to ":" bonds). It is suggested to un-aromatize such structures

@paulsmirnov
Copy link
Member

@AlexanderSavelyev - I have a report from users that led me to this issue. Could you confirm that it is the same?

Load N#Cc1cn[nH]c1N, it will be saved as N#Cc1c(N)nnc1. This SMILES string is recognized by Biovia Draw and ChemDraw, but not RDKit 2020.09.5 (Python):

>>> m = Chem.MolFromSmiles('N#Cc1c(N)nnc1')
[19:51:56] Can't kekulize mol.  Unkekulized atoms: 2 3 5 6 7

@paulsmirnov
Copy link
Member

A colleague helped me with the reasoning in terms of chemistry :)

It seems related to tautomers and aromaticity. The exported SMILES is ambiguous, and RDKit does not make extra assumptions in order to generate a valid structure, while it seems that Biovia Draw and ChemDraw do.
In our example, either nitrogen in the ring could have an H attached. N#Cc1cn[nH]c1N specifies its location, while N#Cc1c(N)nnc1 does not. When Ketcher processes the aromaticity of the ring, the location of the H is lost. RDKit does not restore the H, leading to an invalid structure. Similarly, in this GitHub issue, the original SMILES specifies a tautomeric [nH], but the information is lost when Ketcher processes the aromaticity of the molecule.

@paulsmirnov
Copy link
Member

BTW, no need to use Extended SMILES, simple Daylight is enough (perhaps, it is good idea to correct the issue title).

With the OP's input:

  • load: c1cccc(-c2ccc(Nc3cccc4c(=O)[nH]ccc34)nc2)c1
  • save: c1cc(-c2cnc(Nc3c4c(c(ncc4)=O)ccc3)cc2)ccc1
  • ketcher warnings: Structure contains query properties of atoms and bonds that are not supported in the SMILES. Query properties will not be reflected in the file saved.
  • rdkit log: Can't kekulize mol. Unkekulized atoms: 8 9 10 12 13 14 16 17 18

@AlexanderSavelyev
Copy link
Contributor

Need to switch to indigo for smiles generation

@even1024
Copy link
Collaborator

even1024 commented Mar 16, 2023

The bug appears because interchange KET-format doesn't support explicit implicit hydrogens count which can be specified in bracketed SMILES atoms as a virtual hydrogens counter. Typically it's not an issue but there are special cases when the standard valence model fails to determine the number of suppressed hydrogens. For instance In the example above, N-atom is connected to aromatic ring, so the automatic hydrogen counting is not possible. To avoid the ambiguousness [nH] explicitly specifies the number of implicit hydrogens = 1 for the nitrogen atom. To fix the issue on the ketcher's side:

1) add implicitHCount field to the atom entity of the ket-format json schema:

    "ImplicitHCount": {
      "type": "integer",
      "enum": [0, 1, 2, 3, 4, 5]
    },

2) As ketcher has own parser/generator of MOL V2000, corresponding conversion of virtual hydrogens counter ImplicitHCount to the "chemaxon style" Data S-Group should be implemented. I.e. if a MOL V2000 file has a data group as below:

M STY 1 1 DAT
M SLB 1 1 1
M SAL 1 1 18
M SDT 1 MRV_IMPLICIT_H
M SDD 1 0.0000 0.0000 DA ALL 1 1
M SED 1 IMPL_H1

it should be converted to an atom's property implicitHCount and for generating of MOL V2000 the data S-Groups should be added basing on the implicitHCount value.

Some info about MRV_IMPLICIT_H data s-group:

http://www.scfbio-iitd.res.in/software/utility/marvin_new/marvin/help/FF/Chemaxon-specific-information-in-MDL-MOL-files_19693843.html

3) In editing mode when a heteroatom connects to an aromatic ring it's necessary to add a ImplicitHCount property to this atom to specify the number of hydrogens on it.

KonstantinEpam23 added a commit that referenced this issue Apr 20, 2023
…2498)

* #1865 Extended SMILES saved from Ketcher might be invalid for RDKit

* #1865 fix conflicts

* #1865 remove IMPLICIT_V for molfile generation
@KonstantinEpam23
Copy link
Collaborator

Functionality for supporting implicit hydrogens for mol v2000 format will be implemented separately as part of #2500

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants