Issue with Meaningfulness and Interpretability of Fragments Generated by SMILES Pair Encoding (SPE) #22

madiha1ahmed · 2023-07-29T14:35:13Z

Hello,

I recently came across the study "SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning," and I found it quite interesting. However, after implementing the SMILES Pair Encoding (SPE) approach and generating fragments of molecules using the learned vocabulary, I observed that a significant portion of the fragments do not map to valid molecules when processed with RDKit's Chem.MolFromSmiles().

Issue:

The learned SPE Vocabulary generates fragments that lack meaningful chemical interpretation.
A large number of fragments obtained using the learned vocabulary map to "None" when processed with RDKit's Chem.MolFromSmiles().
This raises concerns about the interpretability and accuracy of the tokenization process.
A majority of learned substructures contained in "SPE_ChEMBL.txt" amp to "None" as well: Precisely, 2232/3002 learned substrings ar enot RDKit Sanitazable.

Expected Behavior:
Ideally, the fragments generated by SPE should correspond to chemically meaningful and interpretable substructures, and they should be convertible into valid molecules when processed with RDKit's Chem.MolFromSmiles().

Screenshots for reference

Impact:
In my opinion, the lack of meaningful and interpretable fragments generated by SPE may limit its utility in various cheminformatics applications, including molecular generation and predictive modeling. It could also hinder the interpretability of deep learning models relying on SPE tokenization.

Additional Information:

Python version: 3.7.16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Meaningfulness and Interpretability of Fragments Generated by SMILES Pair Encoding (SPE) #22

Issue with Meaningfulness and Interpretability of Fragments Generated by SMILES Pair Encoding (SPE) #22

madiha1ahmed commented Jul 29, 2023

Issue with Meaningfulness and Interpretability of Fragments Generated by SMILES Pair Encoding (SPE) #22

Issue with Meaningfulness and Interpretability of Fragments Generated by SMILES Pair Encoding (SPE) #22

Comments

madiha1ahmed commented Jul 29, 2023