rdkit_molecule_from_smile needs an option to change the random seed for molecule embedding #445

victorl25 · 2024-06-28T14:24:49Z

rdkit_molecule_from_smile needs to vary the random seed for molecule embedding

Describe the bug
rdkit_molecule_from_smile occasionally fails to embed a molecule, however, the issue can be avoided by rerunning AllChem.EmbedMolecule with a different random seed. E.g. consider smiles 'CC(C)Cn1cnc2c(Nc3nc(NCc4ccc(CN5CCCC5=O)cc4)nc(NC@@HCc4ccc(Cl)cc4Cl)n3)nc3ccccc3c21' - embedding fails with the hardcoded random seed 11 but works with pretty much any other random seed.

To Reproduce
smiles = 'CC(C)Cn1cnc2c(Nc3nc(NCc4ccc(CN5CCCC5=O)cc4)nc(NC@@HCc4ccc(Cl)cc4Cl)n3)nc3ccccc3c21'
rdkit_mol = rdkit_molecule_from_smiles(smiles, minimisation_method="MMFF94")

Error is reported at the kallisto_molecule_from_rdkit_molecule step:
Jazzy ERROR: [18:22:55] The RDKit embedding has failed for the molecule: ...

Trying the following code extracted from rdkit_molecule_from_smiles generates an error with randomSeed=11 but works with randomSeed=12:

from rdkit import Chem
from rdkit.Chem import AllChem
smiles = 'CC(C)Cn1cnc2c(Nc3nc(NCc4ccc(CN5CCCC5=O)cc4)nc(NC@@HCc4ccc(Cl)cc4Cl)n3)nc3ccccc3c21'
m = Chem.MolFromSmiles(smiles)
mh = Chem.AddHs(m)
emb_code = AllChem.EmbedMolecule(mh, randomSeed=11)
if emb_code == -1:
print(f"The RDKit embedding has failed for the molecule: {smiles}")
else:
print("Success!")

Expected behavior
Please add a cycle to try embedding a couple of times with a different random seed as a simple workaround.
I suppose this is more of an rdkit issue, but it would be much harder to fix it there.

ghiander · 2024-07-18T12:24:08Z

Hi @victorl25,

I have come up with an implementation of embedding_seed as you required. I wanted to use your example as part of the tests in the repo but I cannot reproduce your behaviour:

[12:18:43] SMILES Parse Error: syntax error while parsing: CC(C)Cn1cnc2c(Nc3nc(NCc4ccc(CN5CCCC5=O)cc4)nc(NC@@HCc4ccc(Cl)cc4Cl)n3)nc3ccccc3c21

The SMILES that you provided seems to be not valid rather than generating a molecule that fails the embedding. Would you be able to review and provide such example?

ghiander · 2024-07-18T16:41:22Z

Fixed in https://github.com/AstraZeneca/jazzy/releases/tag/v0.0.14

victorl25 · 2024-07-23T15:24:40Z

Thanks for fixing! I just noticed that the smiles string was mangled by GitHub or somewhere else. The correct smiles string is 'CC(C)Cn1cnc2c(Nc3nc(NCc4ccc(CN5CCCC5=O)cc4)nc(NC@@HCc4ccc(Cl)cc4Cl)n3)nc3ccccc3c21'.

victorl25 · 2024-07-23T15:27:00Z

Oops, it got mangled again, see it in the attached file.
smiles.txt

ghiander · 2024-07-23T16:21:04Z

I'll implement that as a test. Thanks

ghiander · 2024-07-23T16:21:19Z

@victorl25 let me know if the library works now

victorl25 · 2024-07-25T02:09:08Z

@ghiander it does work, I appreciate you putting it out there for a broader use! However, something strange is going on with AllChem.EmbedMolecule so that you know. In a multi-threaded scenario it can fail several times (I've seen up to 5) before succeeding. The same smiles string would get embedded just fine without the multi-threading or in a debug mode, go figure. Here is what I had to do:

    embedding_seed = 11
    for i in range(0,10):
        rdkit_mol = rdkit_molecule_from_smiles(smiles, minimisation_method="MMFF94", embedding_seed=embedding_seed)
        if rdkit_mol is None:
            embedding_seed += 1
            continue
        break

ghiander · 2024-07-25T06:13:17Z

Hey @victorl25, thanks for sharing this trick. I'd be interested to understand why this only happens when multithreading though. Would it be the case to open an issue in the RDKit Github? I am fairly sure it's not due to Jazzy as RDKit is used to preprocess input SMILES.

victorl25 · 2024-07-25T21:21:14Z

I experimented a bit more and found that setting maxIterations to a high number like 100,000 is sufficient to avoid these embedding failures. I get iteration counts in 20-30K range for some random seeds and the same molecule. So RDKit already has a solution.

ghiander · 2024-07-26T07:12:59Z

Great. If you set maxIterations to 0, there will be no limit which is the default in RDKit. I'm closing this issue for now then.

victorl25 · 2024-07-26T13:21:18Z

Setting maxIterations to 0 doesn't work, the default limit seems to be 1110.

ghiander · 2024-07-26T13:23:52Z

Hi @victorl25, what do you mean by that? The current default is zero: https://github.com/AstraZeneca/jazzy/blob/master/src/jazzy/core.py#L62

victorl25 · 2024-07-27T02:41:38Z

Try running the attached script. For me it fails with maxIterations=0 and succeeds with maxIterations > 15000. INITIAL_COORDS gives you the number of iterations it took to succeed or give up.
test.py.txt

ghiander · 2024-07-29T10:49:51Z

Hi Victor, thanks for clarifying that. Then it's convenient to have the possibility to control maxIterations. I would have thought that the max was higher than 1110.

ghiander mentioned this issue Jul 16, 2024

Introduce parameters to control speed of embedding #454

Closed

ghiander closed this as completed Jul 18, 2024

ghiander mentioned this issue Jul 23, 2024

Introduce test for random seed that makes compound fail #469

Closed

ghiander reopened this Jul 23, 2024

ghiander closed this as completed Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rdkit_molecule_from_smile needs an option to change the random seed for molecule embedding #445

rdkit_molecule_from_smile needs an option to change the random seed for molecule embedding #445

victorl25 commented Jun 28, 2024

ghiander commented Jul 18, 2024

ghiander commented Jul 18, 2024

victorl25 commented Jul 23, 2024

victorl25 commented Jul 23, 2024

ghiander commented Jul 23, 2024

ghiander commented Jul 23, 2024

victorl25 commented Jul 25, 2024

ghiander commented Jul 25, 2024

victorl25 commented Jul 25, 2024

ghiander commented Jul 26, 2024

victorl25 commented Jul 26, 2024

ghiander commented Jul 26, 2024

victorl25 commented Jul 27, 2024

ghiander commented Jul 29, 2024

rdkit_molecule_from_smile needs an option to change the random seed for molecule embedding #445

rdkit_molecule_from_smile needs an option to change the random seed for molecule embedding #445

Comments

victorl25 commented Jun 28, 2024

ghiander commented Jul 18, 2024

ghiander commented Jul 18, 2024

victorl25 commented Jul 23, 2024

victorl25 commented Jul 23, 2024

ghiander commented Jul 23, 2024

ghiander commented Jul 23, 2024

victorl25 commented Jul 25, 2024

ghiander commented Jul 25, 2024

victorl25 commented Jul 25, 2024

ghiander commented Jul 26, 2024

victorl25 commented Jul 26, 2024

ghiander commented Jul 26, 2024

victorl25 commented Jul 27, 2024

ghiander commented Jul 29, 2024