String kernels can exploit biases in SMILES string format, skewing performance metrics #26

cyrusmaher · 2020-12-15T00:55:54Z

Just a heads up on this issue:
deepchem/moleculenet#15

I propose that string-based classifiers canonicalize smiles prior to processing to prevent confounded performance, CI, etc. estimates.

Ryan-Rhys · 2020-12-18T20:31:18Z

Thanks for raising this! I made a change to the photoswitch dataset to canonicalise all SMILES as a preprocessing step a couple of weeks ago, will make sure this is implemented for the other datasets!

cyrusmaher · 2020-12-22T00:28:41Z

It's possible canonicalization doesn't fully eliminate the bias (e.g. if one set calculates smiles at a different pH or is more likely to include salt forms). You can see that in the enrichment for "." and "+" characters between positive and negative examples in clintox.

Ryan-Rhys · 2020-12-22T00:44:37Z

Interesting! @henrymoss and I will keep track of this conversation you guys are having in DeepChem!

henrymoss · 2020-12-22T10:01:06Z

This is interesting indeed!

I wonder if this aligns with the observed lack of improvements we were getting when augmenting the data by adding extra (non-canonical) SMILES. Basically, we could only learn from training data in canonical form, as our test data was also canonical. Even increasing the data x5 (through augmentation), we couldn't improve performance.

Ryan-Rhys self-assigned this Dec 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String kernels can exploit biases in SMILES string format, skewing performance metrics #26

String kernels can exploit biases in SMILES string format, skewing performance metrics #26

cyrusmaher commented Dec 15, 2020

Ryan-Rhys commented Dec 18, 2020

cyrusmaher commented Dec 22, 2020 •

edited

Loading

Ryan-Rhys commented Dec 22, 2020

henrymoss commented Dec 22, 2020

String kernels can exploit biases in SMILES string format, skewing performance metrics #26

String kernels can exploit biases in SMILES string format, skewing performance metrics #26

Comments

cyrusmaher commented Dec 15, 2020

Ryan-Rhys commented Dec 18, 2020

cyrusmaher commented Dec 22, 2020 • edited Loading

Ryan-Rhys commented Dec 22, 2020

henrymoss commented Dec 22, 2020

cyrusmaher commented Dec 22, 2020 •

edited

Loading