Data pipeline output is sometimes unreadable by RDKit #183

NicklasOsterbacka · 2025-01-17T15:52:49Z

The ring number reordering in the data pipeline occasionally breaks SMILES strings for molecules with many rings.

This SMILES...
COC1CC(=O)CC23C4=c5c6c7c8c9c%10c%11c%12c%13c%14c%15c%16c%17c%18c%19c%20c(c%21c%22c%23c(c5c5c6c6c8c%11c8c%11c%12c%12c%14c(c%15%19)c%14c%20c%22c%15c%19c%23c5c(c68)c%19c%11c%12c%14%15)C%2112)C%18C3C1(C4)C(OC)CC(=O)CC%171C(=C9C7)C%16C%10%13
...becomes this:
COC1CC(=O)CC23C4=c5c6c7c8c9c%10c%11c%12c%13c%14c%15c%16c%17c%18c%19c%20c(c%21c%22c%23c(c5c5c6c6c8c%11c8c%11c%12c%12c%14c(c%15%19)c%14c%20c%22c%15c%19c%23c5c(c68)c%19c%11c%12c%14%15)C%21%12)C%18C3C1(C4)C(OC)CC(=O)CC1%17C(=C9C7)C%16C%10%13

This occurs for 7 molecules in ChEMBL34, including the one above, when using RDKit 2024.09.4. The issue here is that %2112 is changed to %21%12 rather than 12%21.

The text was updated successfully, but these errors were encountered:

halx · 2025-01-17T17:01:34Z

The problem is that SMILES have an unfortunated variaty of number labelling (they are not principally for ring although that is the typical use case):
8%13 which can also be %813 or even %138 (I believe)
24%12 which can also be %2412 or %1224

RDKit will except all this form and I think there is no canonical form.

halx · 2025-01-17T17:04:25Z

Also, the %21 probably means that you have that many rings... On the other hand, the percentage of such SMILES is fairly low, also in PubChem, I believe.

NicklasOsterbacka · 2025-01-17T17:26:36Z

RDKit always parses %813 as [%81, 3] from what I can tell, so %2112 is equivalent to 12%21. The OpenSMILES specification (and the original Daylight specification, from what I can tell) disallows three-digit ring numbers, which should remove any ambiguity.

Take the SMILES c12ccccc1cccc2 as an example. RDKit parses this renumbering just fine:
c%813ccccc%81cccc3
This, on the other hand, has an unclosed ring:
c%813ccccc%13cccc8

Also, the %21 probably means that you have that many rings... On the other hand, the percentage of such SMILES is fairly low, also in PubChem, I believe.

Yes, it is not a huge problem in practice! It only affects certain SMILES for molecules with more than 10 rings. #184 should nevertheless fix the issue, small as it may be.

halx · 2025-01-18T07:58:48Z

You have it backwards. Specifications do not matter. What matters is that he code can deal with number labels as they appear in the wild. The comment you have deleted in your PR actually hints to that.

The code needs to

Normalize the labels such that RDKit doesn't fall over
Extract tokens sensibly such that no extraneous tokens are generated

E.g. CC%139%2312%99 would need to be transformed to CC9%13%23%12%99 with tokens ["C", "9", "%13", "%23", "%12", "%99"]. The current code doesn't handle this yet but your suggestsions makes it even worse.

NicklasOsterbacka · 2025-01-18T14:14:38Z

Fair point! Both SMILES readers and writers are written with a specification in mind, though. The data pipeline uses RDKit and thus makes use of RDKit's SMILES syntax both when converting the source SMILES to a Mol object and when converting the processed Mol object to a SMILES string.

Taking the example of P123456CC1CC2CC3CC4CC5CC6 we can do the following renumbering:
P%139%2312%99CC%13CC9CC%23CC1CC2CC%99
The regular expression in my PR tokenizes this into [‘C’, ‘P’, ‘1’, ‘2’, ‘9’, ‘%13’, ‘%23’, ‘%99’] and interpreting %2312 as %12%23 introduces unclosed rings, just as in the original example in this issue. (This is admittedly a bit of a contrived example. I don’t think it is very stable and it would not make it through the data pipeline if one uses a recent RDKit version due to valency constraints.)

Is there some additional reason for ring number reordering beyond proper tokenization that I am unaware of?

halx · 2025-01-26T12:43:05Z

Ok, I just checked: the parsing is indeed regular in the sense that double digits are preceeded by "%" while single digits must not be. I will fix this in the internal code because there are other things to clean up. Many thanks for the heads-up.

NicklasOsterbacka mentioned this issue Jan 17, 2025

Modified SMILES regex in data pipeline to disallow three-digit ring numbers. #184

Closed

halx closed this as completed Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data pipeline output is sometimes unreadable by RDKit #183

Data pipeline output is sometimes unreadable by RDKit #183

NicklasOsterbacka commented Jan 17, 2025

halx commented Jan 17, 2025

halx commented Jan 17, 2025

NicklasOsterbacka commented Jan 17, 2025

halx commented Jan 18, 2025

NicklasOsterbacka commented Jan 18, 2025 •

edited

Loading

halx commented Jan 26, 2025

Data pipeline output is sometimes unreadable by RDKit #183

Data pipeline output is sometimes unreadable by RDKit #183

Comments

NicklasOsterbacka commented Jan 17, 2025

halx commented Jan 17, 2025

halx commented Jan 17, 2025

NicklasOsterbacka commented Jan 17, 2025

halx commented Jan 18, 2025

NicklasOsterbacka commented Jan 18, 2025 • edited Loading

halx commented Jan 26, 2025

NicklasOsterbacka commented Jan 18, 2025 •

edited

Loading