Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Meaningfulness and Interpretability of Fragments Generated by SMILES Pair Encoding (SPE) #22

Open
madiha1ahmed opened this issue Jul 29, 2023 · 0 comments

Comments

@madiha1ahmed
Copy link

Hello,

I recently came across the study "SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning," and I found it quite interesting. However, after implementing the SMILES Pair Encoding (SPE) approach and generating fragments of molecules using the learned vocabulary, I observed that a significant portion of the fragments do not map to valid molecules when processed with RDKit's Chem.MolFromSmiles().

Issue:

  1. The learned SPE Vocabulary generates fragments that lack meaningful chemical interpretation.
  2. A large number of fragments obtained using the learned vocabulary map to "None" when processed with RDKit's Chem.MolFromSmiles().
  3. This raises concerns about the interpretability and accuracy of the tokenization process.
  4. A majority of learned substructures contained in "SPE_ChEMBL.txt" amp to "None" as well: Precisely, 2232/3002 learned substrings ar enot RDKit Sanitazable.

Expected Behavior:
Ideally, the fragments generated by SPE should correspond to chemically meaningful and interpretable substructures, and they should be convertible into valid molecules when processed with RDKit's Chem.MolFromSmiles().

Screenshots for reference
WhatsApp Image 2023-07-29 at 10 27 55 AM
WhatsApp Image 2023-07-29 at 10 28 00 AM

Impact:
In my opinion, the lack of meaningful and interpretable fragments generated by SPE may limit its utility in various cheminformatics applications, including molecular generation and predictive modeling. It could also hinder the interpretability of deep learning models relying on SPE tokenization.

Additional Information:

  • Python version: 3.7.16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant