You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for raising this! I made a change to the photoswitch dataset to canonicalise all SMILES as a preprocessing step a couple of weeks ago, will make sure this is implemented for the other datasets!
It's possible canonicalization doesn't fully eliminate the bias (e.g. if one set calculates smiles at a different pH or is more likely to include salt forms). You can see that in the enrichment for "." and "+" characters between positive and negative examples in clintox.
I wonder if this aligns with the observed lack of improvements we were getting when augmenting the data by adding extra (non-canonical) SMILES. Basically, we could only learn from training data in canonical form, as our test data was also canonical. Even increasing the data x5 (through augmentation), we couldn't improve performance.
Just a heads up on this issue:
deepchem/moleculenet#15
I propose that string-based classifiers canonicalize smiles prior to processing to prevent confounded performance, CI, etc. estimates.
The text was updated successfully, but these errors were encountered: