You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I recently came across the study "SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning," and I found it quite interesting. However, after implementing the SMILES Pair Encoding (SPE) approach and generating fragments of molecules using the learned vocabulary, I observed that a significant portion of the fragments do not map to valid molecules when processed with RDKit's Chem.MolFromSmiles().
Issue:
The learned SPE Vocabulary generates fragments that lack meaningful chemical interpretation.
A large number of fragments obtained using the learned vocabulary map to "None" when processed with RDKit's Chem.MolFromSmiles().
This raises concerns about the interpretability and accuracy of the tokenization process.
A majority of learned substructures contained in "SPE_ChEMBL.txt" amp to "None" as well: Precisely, 2232/3002 learned substrings ar enot RDKit Sanitazable.
Expected Behavior:
Ideally, the fragments generated by SPE should correspond to chemically meaningful and interpretable substructures, and they should be convertible into valid molecules when processed with RDKit's Chem.MolFromSmiles().
Screenshots for reference
Impact:
In my opinion, the lack of meaningful and interpretable fragments generated by SPE may limit its utility in various cheminformatics applications, including molecular generation and predictive modeling. It could also hinder the interpretability of deep learning models relying on SPE tokenization.
Additional Information:
Python version: 3.7.16
The text was updated successfully, but these errors were encountered:
Hello,
I recently came across the study "SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning," and I found it quite interesting. However, after implementing the SMILES Pair Encoding (SPE) approach and generating fragments of molecules using the learned vocabulary, I observed that a significant portion of the fragments do not map to valid molecules when processed with RDKit's Chem.MolFromSmiles().
Issue:
Expected Behavior:
Ideally, the fragments generated by SPE should correspond to chemically meaningful and interpretable substructures, and they should be convertible into valid molecules when processed with RDKit's Chem.MolFromSmiles().
Screenshots for reference
Impact:
In my opinion, the lack of meaningful and interpretable fragments generated by SPE may limit its utility in various cheminformatics applications, including molecular generation and predictive modeling. It could also hinder the interpretability of deep learning models relying on SPE tokenization.
Additional Information:
The text was updated successfully, but these errors were encountered: