-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data pipeline output is sometimes unreadable by RDKit #183
Comments
The problem is that SMILES have an unfortunated variaty of number labelling (they are not principally for ring although that is the typical use case): RDKit will except all this form and I think there is no canonical form. |
Also, the %21 probably means that you have that many rings... On the other hand, the percentage of such SMILES is fairly low, also in PubChem, I believe. |
RDKit always parses %813 as [%81, 3] from what I can tell, so %2112 is equivalent to 12%21. The OpenSMILES specification (and the original Daylight specification, from what I can tell) disallows three-digit ring numbers, which should remove any ambiguity. Take the SMILES
Yes, it is not a huge problem in practice! It only affects certain SMILES for molecules with more than 10 rings. #184 should nevertheless fix the issue, small as it may be. |
You have it backwards. Specifications do not matter. What matters is that he code can deal with number labels as they appear in the wild. The comment you have deleted in your PR actually hints to that. The code needs to
E.g. |
Fair point! Both SMILES readers and writers are written with a specification in mind, though. The data pipeline uses RDKit and thus makes use of RDKit's SMILES syntax both when converting the source SMILES to a Mol object and when converting the processed Mol object to a SMILES string. Taking the example of Is there some additional reason for ring number reordering beyond proper tokenization that I am unaware of? |
Ok, I just checked: the parsing is indeed regular in the sense that double digits are preceeded by "%" while single digits must not be. I will fix this in the internal code because there are other things to clean up. Many thanks for the heads-up. |
The ring number reordering in the data pipeline occasionally breaks SMILES strings for molecules with many rings.
This SMILES...
COC1CC(=O)CC23C4=c5c6c7c8c9c%10c%11c%12c%13c%14c%15c%16c%17c%18c%19c%20c(c%21c%22c%23c(c5c5c6c6c8c%11c8c%11c%12c%12c%14c(c%15%19)c%14c%20c%22c%15c%19c%23c5c(c68)c%19c%11c%12c%14%15)C%2112)C%18C3C1(C4)C(OC)CC(=O)CC%171C(=C9C7)C%16C%10%13
...becomes this:
COC1CC(=O)CC23C4=c5c6c7c8c9c%10c%11c%12c%13c%14c%15c%16c%17c%18c%19c%20c(c%21c%22c%23c(c5c5c6c6c8c%11c8c%11c%12c%12c%14c(c%15%19)c%14c%20c%22c%15c%19c%23c5c(c68)c%19c%11c%12c%14%15)C%21%12)C%18C3C1(C4)C(OC)CC(=O)CC1%17C(=C9C7)C%16C%10%13
This occurs for 7 molecules in ChEMBL34, including the one above, when using RDKit 2024.09.4. The issue here is that %2112 is changed to %21%12 rather than 12%21.
The text was updated successfully, but these errors were encountered: