Skip to content

Commit

Permalink
Add test for organometallic pubchem structures;
Browse files Browse the repository at this point in the history
  • Loading branch information
fbaensch-beilstein committed Feb 13, 2025
1 parent 31b2ab3 commit f87f47d
Show file tree
Hide file tree
Showing 2 changed files with 72 additions and 0 deletions.
Binary file not shown.
72 changes: 72 additions & 0 deletions INCHI-1-TEST/tests/test_executable/test_organometallics_pubchem.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
import pytest
from pathlib import Path
from helpers import parse_inchi_from_executable_output
from sdf_pipeline.utils import read_records_from_gzipped_sdf
import tempfile


@pytest.fixture
def sdf_path():
return (
Path(__file__)
.parent.absolute()
.joinpath("data/organometallic_structures_pubchem.sdf.gz")
)


def test_organometallics_pubchem(sdf_path, run_inchi_exe):
exectuable_results = [
run_inchi_exe(mol, "-RecMet") for mol in read_records_from_gzipped_sdf(sdf_path)
]
inchis = [
parse_inchi_from_executable_output(result.output)
for result in exectuable_results
]
assert all(inchis)


@pytest.mark.skip(reason="used to get statistics and insights")
def test_organometallics_pubchem_statistics(sdf_path, run_inchi_exe):
error_ids = []
inchis = {}
inchi_ids = []
error_mols = []
errors = {}
mol_counter = 0

for mol in read_records_from_gzipped_sdf(sdf_path):
mol_counter += 1
result = run_inchi_exe(mol, "-RecMet")
inchi = parse_inchi_from_executable_output(result.output)

id = mol.split("\n")[0]
if inchi == "":
error_ids.append(id)
error_mols.append(mol)
for line in result.stderr.splitlines():
if line.startswith("Error"):
errors[id] = line
else:
inchi_ids.append(id)
inchis[id] = inchi

with open(f"{tempfile.gettempdir()}/inchis.txt", "w") as f:
for id, inchi in inchis.items():
f.write(f"{id}: {inchi}\n")

with open(f"{tempfile.gettempdir()}/inchi_ids.txt", "w") as f:
for id in inchi_ids:
f.write(f"{id}\n")

with open(f"{tempfile.gettempdir()}/error_ids.txt", "w") as f:
for id in error_ids:
f.write(f"{id}\n")

with open(f"{tempfile.gettempdir()}/error_mols.sdf", "w") as f:
for mol in error_mols:
f.write(f"{mol}$$$$\n")

with open(f"{tempfile.gettempdir()}/errors.txt", "w") as f:
for id, error in errors.items():
f.write(f"{id}: {error}\n")
print(f"{len(inchis)}/{mol_counter} InChIs converted.")

15 comments on commit f87f47d

@schatzsc
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fbaensch-beilstein Where are these structures from? Unfortunately, most of the organometal and coordination compounds in PubChem are total nonsense due to the "disconnected metal" approach.

See for example "ferrocene" which isn't really ferrocene but a combination of two cyclopentadienyl anions and an iron(II) cation.

Even worse is 'cisplatin' which is listed as a synonym for transplatin, although these are distinct compounds which for example are strongly different in their biological activity (cisplatin is a multi-billion dollar anticancer drug while transplatin is totally inactive)

@fbaensch-beilstein
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schatzsc The structures are "PubChem substances" not compounds. Compounds are PubChem wise standardized and normalized in way that they produce a standard InChI. That is why there are no metal bonds in compounds. The substances have more raw data like character. So if you want to search for organometallics with metal bonds in PubChem search for substances not compound.

@schatzsc
Copy link

@schatzsc schatzsc commented on f87f47d Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fbaensch-beilstein Interesting - I was not aware of the fact that PubChem makes this distinction between "compounds" and "substances" so far.

Still, although the "substance entry" for ferrocene at least shows the correct structure with 10 Fe-C bonds as CID 11985121 at the top of the list now, there are also in the "compounds list" many other structures that are definitely not ferrocene (= bis(η5-cyclopentadienyl)iron(II) ) but rather for example bis(η1-cyclopentadienyl)iron(II), see the last entry in the screenshot below:

Image

And the "substance entry" for "cisplatin" is definitely wrong, since it also shows transplatin:

Image

Where and how were the "InChI test/reference structures" (= molfiles) collected? Because if the InChI is validated against the incorrect structures, then of course also the output will be incorrect.

@fbaensch-beilstein
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schatzsc

@fbaensch-beilstein Interesting - I was not aware of the fact that PubChem makes this distinction between "compounds" and "substances" so far.

Here you can find more information about the differences of compounds and substances in PubChem: https://pubchem.ncbi.nlm.nih.gov/docs/compound-vs-substance

[...] many other structures that are definitely not ferrocene (= bis(η5-cyclopentadienyl)iron(II) ) but rather for example bis(η1-cyclopentadienyl)iron(II), see the last entry in the screenshot below:

If I remember correctly, this is a legacy way of representing ferrocene (please ask @gblanke02 for further information). But please feel free to report this to PubChem.

And the "substance entry" for "cisplatin" is definitely wrong, since it also shows transplatin:

Again, please feel free to report this to PubChem.

Where and how were the "InChI test/reference structures" (= molfiles) collected? Because if the InChI is validated against the incorrect structures, then of course also the output will be incorrect.

The structures are extracted out of an "old" prototype by Alex Clark. Good reads about it in his blog:

The structures were obtained from the PubChem-selection. For this purpose, the CIDs were extracted, the SIDs (several per CID possible) were collected and the substances were downloaded from PubChem via the SIDs. All with the help of a "small" python script that uses the pubchempy library.

Anyways, you are right, before we use this as a test set, we need to validate the structures. At the moment this is just a 'placeholder', to store the structures and it is only tested if an InChI can be created not the actual InChI.

@schatzsc
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fbaensch-beilstein

The structures are extracted out of an "old" prototype by Alex Clark

That's the collection with the non-standard format, right?!?

The structures were obtained from the PubChem-selection. For this purpose, the CIDs were extracted, the SIDs (several per CID possible) were collected and the substances were downloaded from PubChem via the SIDs. All with the help of a "small" python script that uses the pubchempy library.

Anyways, you are right, before we use this as a test set, we need to validate the structures. At the moment this is just a 'placeholder', to store the structures and it is only tested if an InChI can be created not the actual InChI.

Is there a "converted-to-molfile" collection of this somewhere? Or alternatively at least the conversion script?

@fbaensch-beilstein
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schatzsc

That's the collection with the non-standard format, right?!?

What do you mean with non-standard format?

Is there a "converted-to-molfile" collection of this somewhere? Or alternatively at least the conversion script?

You can download it here: https://github.com/IUPAC-InChI/InChI/blob/rwth/INCHI-1-TEST/tests/test_executable/data/organometallic_structures_pubchem.sdf.gz
It is a zipped sdf with the SID structures.

@schatzsc
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fbaensch-beilstein

That's the collection with the non-standard format, right?!?

What do you mean with non-standard format?

non-molfile

If I recall correctly, this collection was in a special format that is not routinely/easily converted to molfile

Is there a "converted-to-molfile" collection of this somewhere? Or alternatively at least the conversion script?

You can download it here: https://github.com/IUPAC-InChI/InChI/blob/rwth/INCHI-1-TEST/tests/test_executable/data/organometallic_structures_pubchem.sdf.gz It is a zipped sdf with the SID structures.

Thanks - can download and open it with ChemDraw in absence of a better viewer.

Not sure if this is from the multiple conversions (original format -> molfile -> ChemDraw import) or there in the original file, but it looks like utter nonsense - can only roughly assume that this is supposed to be homobimetallic chlorido-bridged [(h3-allyl)Pd(µ-Cl)2Pd(h3-allyl)]):

Image

@schatzsc
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As there are only 187 structures in the sdf file - want me to re-draw them to a proper representation and export as individual molfiles ?!?

@fbaensch-beilstein
Copy link
Collaborator Author

@fbaensch-beilstein fbaensch-beilstein commented on f87f47d Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schatzsc

non-molfile

If I recall correctly, this collection was in a special format that is not routinely/easily converted to molfile

Exactly, but the structures in the sdf are directly downloaded from PubChem.

Not sure if this is from the multiple conversions (original format -> molfile -> ChemDraw import) or there in the original file, but it looks like utter nonsense - can only roughly assume that this is supposed to be homobimetallic chlorido-bridged [(h3-allyl)Pd(µ-Cl)2Pd(h3-allyl)]):

Those are original files from PubChem, maveb the ChemDraw import mixed up the coordinates, but not the bonds.

As there are only 187 structures in the sdf file - want me to re-draw them to a proper representation and export as individual molfiles ?!?

It would be awesome to have at least one correct structure for each entry. Thanks a lot!
You can export them as individual molfiles or all in one sdf, whatever is more convenient for you.

@schatzsc
Copy link

@schatzsc schatzsc commented on f87f47d Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be awesome to have at least one correct structure for each entry. Thanks a lot! You can export them as individual molfiles or all in one sdf, whatever is more convenient for you.

What nomenclature for the filenames to use?

Simply call them molfile1.mol to molfile187.mol or some indicator of the structure such as Pd-allyl-dimer.mol ?

Below an example for the Pd allyl complex also mentioned above which I would consider correct within the limits of the current InChI (= metal-ligand bond disconnection).

The delocalized nature of the allyl anion cannot be represented, as rings and chains with an uneven number of atoms fail to give alternating double and single bonds, but setting the valence of the allyl carbon atoms to four fixes potential problems (as it is essentially equivalent to the "simple bonds" = lines only represent bonding interactions but are not considered in implicit hydrogen atom calculations of TUCAN). Also note that no local charges have to be assigned to any of the atoms when using this approach:

Image

@fbaensch-beilstein
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schatzsc I apologize for the late reply, I had an extended weekend.
You can keep the nomenclature as you like.
Please draw the structures as ‘correctly’ as possible, i.e. also with haptic bonds.

@schatzsc
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fbaensch-beilstein What do you mean with "haptic bonds" - multi-attachment instead of explicitly drawn single bonds?!?

@fbaensch-beilstein
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schatzsc yes, would be nice to have both versions - with multi-attachment and the single bonds version

@schatzsc
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fbaensch-beilstein Is there a place to upload the molfiles here to the repro or would you rather prefer to receive all of them together in a *.zip file?

@fbaensch-beilstein
Copy link
Collaborator Author

@fbaensch-beilstein fbaensch-beilstein commented on f87f47d Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schatzsc I think the easiest way is you send me a zip file, so I can arrange the files for our tests.

Please sign in to comment.