Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdkit_molecule_from_smile needs an option to change the random seed for molecule embedding #445

Closed
victorl25 opened this issue Jun 28, 2024 · 14 comments

Comments

@victorl25
Copy link

rdkit_molecule_from_smile needs to vary the random seed for molecule embedding

Describe the bug
rdkit_molecule_from_smile occasionally fails to embed a molecule, however, the issue can be avoided by rerunning AllChem.EmbedMolecule with a different random seed. E.g. consider smiles 'CC(C)Cn1cnc2c(Nc3nc(NCc4ccc(CN5CCCC5=O)cc4)nc(NC@@HCc4ccc(Cl)cc4Cl)n3)nc3ccccc3c21' - embedding fails with the hardcoded random seed 11 but works with pretty much any other random seed.

To Reproduce
smiles = 'CC(C)Cn1cnc2c(Nc3nc(NCc4ccc(CN5CCCC5=O)cc4)nc(NC@@HCc4ccc(Cl)cc4Cl)n3)nc3ccccc3c21'
rdkit_mol = rdkit_molecule_from_smiles(smiles, minimisation_method="MMFF94")

Error is reported at the kallisto_molecule_from_rdkit_molecule step:
Jazzy ERROR: [18:22:55] The RDKit embedding has failed for the molecule: ...

Trying the following code extracted from rdkit_molecule_from_smiles generates an error with randomSeed=11 but works with randomSeed=12:

from rdkit import Chem
from rdkit.Chem import AllChem
smiles = 'CC(C)Cn1cnc2c(Nc3nc(NCc4ccc(CN5CCCC5=O)cc4)nc(NC@@HCc4ccc(Cl)cc4Cl)n3)nc3ccccc3c21'
m = Chem.MolFromSmiles(smiles)
mh = Chem.AddHs(m)
emb_code = AllChem.EmbedMolecule(mh, randomSeed=11)
if emb_code == -1:
print(f"The RDKit embedding has failed for the molecule: {smiles}")
else:
print("Success!")

Expected behavior
Please add a cycle to try embedding a couple of times with a different random seed as a simple workaround.
I suppose this is more of an rdkit issue, but it would be much harder to fix it there.

@ghiander
Copy link
Collaborator

Hi @victorl25,

I have come up with an implementation of embedding_seed as you required. I wanted to use your example as part of the tests in the repo but I cannot reproduce your behaviour:

[12:18:43] SMILES Parse Error: syntax error while parsing: CC(C)Cn1cnc2c(Nc3nc(NCc4ccc(CN5CCCC5=O)cc4)nc(NC@@HCc4ccc(Cl)cc4Cl)n3)nc3ccccc3c21

The SMILES that you provided seems to be not valid rather than generating a molecule that fails the embedding. Would you be able to review and provide such example?

@ghiander
Copy link
Collaborator

@victorl25
Copy link
Author

Thanks for fixing! I just noticed that the smiles string was mangled by GitHub or somewhere else. The correct smiles string is 'CC(C)Cn1cnc2c(Nc3nc(NCc4ccc(CN5CCCC5=O)cc4)nc(NC@@HCc4ccc(Cl)cc4Cl)n3)nc3ccccc3c21'.

@victorl25
Copy link
Author

Oops, it got mangled again, see it in the attached file.
smiles.txt

@ghiander
Copy link
Collaborator

I'll implement that as a test. Thanks

@ghiander
Copy link
Collaborator

@victorl25 let me know if the library works now

@victorl25
Copy link
Author

@ghiander it does work, I appreciate you putting it out there for a broader use! However, something strange is going on with AllChem.EmbedMolecule so that you know. In a multi-threaded scenario it can fail several times (I've seen up to 5) before succeeding. The same smiles string would get embedded just fine without the multi-threading or in a debug mode, go figure. Here is what I had to do:

    embedding_seed = 11
    for i in range(0,10):
        rdkit_mol = rdkit_molecule_from_smiles(smiles, minimisation_method="MMFF94", embedding_seed=embedding_seed)
        if rdkit_mol is None:
            embedding_seed += 1
            continue
        break

@ghiander
Copy link
Collaborator

Hey @victorl25, thanks for sharing this trick. I'd be interested to understand why this only happens when multithreading though. Would it be the case to open an issue in the RDKit Github? I am fairly sure it's not due to Jazzy as RDKit is used to preprocess input SMILES.

@victorl25
Copy link
Author

I experimented a bit more and found that setting maxIterations to a high number like 100,000 is sufficient to avoid these embedding failures. I get iteration counts in 20-30K range for some random seeds and the same molecule. So RDKit already has a solution.

@ghiander
Copy link
Collaborator

Great. If you set maxIterations to 0, there will be no limit which is the default in RDKit. I'm closing this issue for now then.

@victorl25
Copy link
Author

Setting maxIterations to 0 doesn't work, the default limit seems to be 1110.

@ghiander
Copy link
Collaborator

Hi @victorl25, what do you mean by that? The current default is zero: https://github.com/AstraZeneca/jazzy/blob/master/src/jazzy/core.py#L62

@victorl25
Copy link
Author

Try running the attached script. For me it fails with maxIterations=0 and succeeds with maxIterations > 15000. INITIAL_COORDS gives you the number of iterations it took to succeed or give up.
test.py.txt

@ghiander
Copy link
Collaborator

Hi Victor, thanks for clarifying that. Then it's convenient to have the possibility to control maxIterations. I would have thought that the max was higher than 1110.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants