-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-canonical smiles confound string-based classifiers #15
Comments
Oh wow, that's quite the find! Yes, this would definitely need to be fixed as we're overhauling MoleculeNet for the next v2 release. I'll mark this as a bug |
@cyrusmaher do you have any insight as to what the bias in the SMILES is? |
Starting to think about this anymore, I think we canonicalize smiles before computing descriptors in DeepChem which should handle this (but I'm not sure). @cyrusmaher Would it be possible to provide a brief reproducing code snippet? That would help us figure out what's happening :) |
One consideration : maybe a character frequency count between the raw and canonicalized forms? Maybe there's extra parentheses or aromatic operators added? (:) |
@rbharath The easiest way to reproduce this will be to run a string model on ClinTox with and without smiles canonicalization. Here is an example for canonicalization: from rdkit import Chem
canonical_smi = Chem.MolToSmiles(Chem.MolFromSmiles(smi), canonical=True) Without bloating this with the helper code for the string kernel, etc., here's what I ran: @gabegrand I'm not sure precisely, but delocalization, tautomers, salts, etc can all be handled differently in systematic ways |
@rbharath It's worth considering that canonicalization would not entirely eliminate this bias, e.g. if one database is more likely to include charged species (assuming a different pH or preparation). You can see evidence for this in the different levels of "+" and "." characters between positive and negative examples in clintox: Edit: it appears that much of this significance is driven by smiles that turn out to be duplicated once they're canonicalized. |
Firstly, thanks to the DeepChem & MoleculeNet contributers, it is a great library and a great benchmark! However, I think this issue really needs to be fixed, before people publish papers (if they haven't already). And potentially the clintox dataset should be dropped altogether. I was reproducing the textcnn result on the clintox dataset. I was very pleased to reproduce the benchmark AUC of ~0.995! However, when I examined the underlying dataset I found that there were severe biases in the dataset, which should be fixed in the overall benchmark for clintox. The benchmark shows textcnn wining by around 11%, which is very unlikely to be true. I observed the following in my experiments.
I think the most surprising thing was only using the first and last 2 characters of the smiles. You can still achieve a very high (apparently +7% on SOTA) test AUC of 0.956, however would you really trust such a classifier to detect toxic molecules?! The third experiment about canon smiles agrees with the findings of @cyrusmaher The model |
Thanks for the detailed analysis! We're working towards a MoleculeNet 2.0 paper. We will update recommendations and benchmark analysis for Clintox as part of this release |
Thanks for your reply, that's good to know 😄 |
Any updates on this? Thanks! |
Running a string kernel classifier on the clintox dataset, I can obtain an AUROC of 0.96. When I canonicalize the smiles, my AUROC drops to 0.69. This implies that there is a bias in the smiles format between positive and negative examples that string-based classifiers can exploit to obtain unrealistically high performance, thereby tainting downstream benchmarks.
A solution to this would be to update the dataset to include only canonicalized smiles.
The text was updated successfully, but these errors were encountered: