Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

isomeriSmiles= False #12

Open
LivC193 opened this issue Feb 17, 2021 · 3 comments
Open

isomeriSmiles= False #12

LivC193 opened this issue Feb 17, 2021 · 3 comments

Comments

@LivC193
Copy link

LivC193 commented Feb 17, 2021

From what I understand you set isomericSmiles = False in your preprocessing (filter_and_canonicalize function).

This means you don't take into account any isomeric information. Do you think this might be an issue, especially since isomers don't necessarily have similar chemical or physical properties?

@JoshuaMeyers
Copy link
Collaborator

Hey @LivC182, thanks for your interest in our code! Yes isomericSmiles is set to False although this name can be somewhat misleading since it only affects stereoisomers (see rdkit doc). So for example this wouldn't preclude the differentiation between pentane and isopentane since these have different connectivities.

However, your point is valid in the more specific case of stereochemistry. This is an area for future work. I can only speculate on the authors' reason for excluding stereochemistry:

(a) for simplification since @@ and @ encode stereochemistry relatively (different canonicalization can flip this sign), and therefore does not encode the absolute (R/S) stereochemistry of the molecule. Therefore its not good enough to merely encode single and double @ individually)

(b) because many scoring functions (such as Morgan fingerprints which are used for the rediscovery and similarity benchmarks) also ignore stereochemistry by default

@LivC193
Copy link
Author

LivC193 commented Feb 22, 2021

HI @JoshuaMeyers,
I completely agree with all your points. I just wanted to make sure if isomers (except structural ones) are taken into account. If you have carbohydrate entries in your dataset not taking isomers into account might be problematic.

galactose_smiles = ''C([C@@H]1[C@@H]([C@@H]([C@H]([C@@H](O1)O)O)O)O)O''
glucose_smiles="C([C@@H]1[C@H]([C@@H]([C@H]([C@H](O1)O)O)O)O)O'"
gal = Chem.MolFromSmiles(galactose_smiles)
glc = Chem.MolFromSmiles(glucose_smiles)
print(Chem.MolToSmiles(gal, isomericSmiles=False))
'OCC1OC(O)C(O)C(O)C1O'
print(Chem.MolToSmiles(glc, isomericSmiles=False))
'OCC1OC(O)C(O)C(O)C1O'

@JoshuaMeyers
Copy link
Collaborator

Yes this is true. In the GuacaMol v1 SMILES training dataset, there are in fact 45 occurrences of the SMILES you give as an example (as substrings of larger molecules). Thanks for highlighting this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants