-
Notifications
You must be signed in to change notification settings - Fork 13
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add test for organometallic pubchem structures;
- Loading branch information
1 parent
31b2ab3
commit f87f47d
Showing
2 changed files
with
72 additions
and
0 deletions.
There are no files selected for viewing
Binary file added
BIN
+85.6 KB
INCHI-1-TEST/tests/test_executable/data/organometallic_structures_pubchem.sdf.gz
Binary file not shown.
72 changes: 72 additions & 0 deletions
72
INCHI-1-TEST/tests/test_executable/test_organometallics_pubchem.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
import pytest | ||
from pathlib import Path | ||
from helpers import parse_inchi_from_executable_output | ||
from sdf_pipeline.utils import read_records_from_gzipped_sdf | ||
import tempfile | ||
|
||
|
||
@pytest.fixture | ||
def sdf_path(): | ||
return ( | ||
Path(__file__) | ||
.parent.absolute() | ||
.joinpath("data/organometallic_structures_pubchem.sdf.gz") | ||
) | ||
|
||
|
||
def test_organometallics_pubchem(sdf_path, run_inchi_exe): | ||
exectuable_results = [ | ||
run_inchi_exe(mol, "-RecMet") for mol in read_records_from_gzipped_sdf(sdf_path) | ||
] | ||
inchis = [ | ||
parse_inchi_from_executable_output(result.output) | ||
for result in exectuable_results | ||
] | ||
assert all(inchis) | ||
|
||
|
||
@pytest.mark.skip(reason="used to get statistics and insights") | ||
def test_organometallics_pubchem_statistics(sdf_path, run_inchi_exe): | ||
error_ids = [] | ||
inchis = {} | ||
inchi_ids = [] | ||
error_mols = [] | ||
errors = {} | ||
mol_counter = 0 | ||
|
||
for mol in read_records_from_gzipped_sdf(sdf_path): | ||
mol_counter += 1 | ||
result = run_inchi_exe(mol, "-RecMet") | ||
inchi = parse_inchi_from_executable_output(result.output) | ||
|
||
id = mol.split("\n")[0] | ||
if inchi == "": | ||
error_ids.append(id) | ||
error_mols.append(mol) | ||
for line in result.stderr.splitlines(): | ||
if line.startswith("Error"): | ||
errors[id] = line | ||
else: | ||
inchi_ids.append(id) | ||
inchis[id] = inchi | ||
|
||
with open(f"{tempfile.gettempdir()}/inchis.txt", "w") as f: | ||
for id, inchi in inchis.items(): | ||
f.write(f"{id}: {inchi}\n") | ||
|
||
with open(f"{tempfile.gettempdir()}/inchi_ids.txt", "w") as f: | ||
for id in inchi_ids: | ||
f.write(f"{id}\n") | ||
|
||
with open(f"{tempfile.gettempdir()}/error_ids.txt", "w") as f: | ||
for id in error_ids: | ||
f.write(f"{id}\n") | ||
|
||
with open(f"{tempfile.gettempdir()}/error_mols.sdf", "w") as f: | ||
for mol in error_mols: | ||
f.write(f"{mol}$$$$\n") | ||
|
||
with open(f"{tempfile.gettempdir()}/errors.txt", "w") as f: | ||
for id, error in errors.items(): | ||
f.write(f"{id}: {error}\n") | ||
print(f"{len(inchis)}/{mol_counter} InChIs converted.") |
f87f47d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fbaensch-beilstein Where are these structures from? Unfortunately, most of the organometal and coordination compounds in PubChem are total nonsense due to the "disconnected metal" approach.
See for example "ferrocene" which isn't really ferrocene but a combination of two cyclopentadienyl anions and an iron(II) cation.
Even worse is 'cisplatin' which is listed as a synonym for transplatin, although these are distinct compounds which for example are strongly different in their biological activity (cisplatin is a multi-billion dollar anticancer drug while transplatin is totally inactive)
f87f47d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@schatzsc The structures are "PubChem substances" not compounds. Compounds are PubChem wise standardized and normalized in way that they produce a standard InChI. That is why there are no metal bonds in compounds. The substances have more raw data like character. So if you want to search for organometallics with metal bonds in PubChem search for substances not compound.
f87f47d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fbaensch-beilstein Interesting - I was not aware of the fact that PubChem makes this distinction between "compounds" and "substances" so far.
Still, although the "substance entry" for ferrocene at least shows the correct structure with 10 Fe-C bonds as CID 11985121 at the top of the list now, there are also in the "compounds list" many other structures that are definitely not ferrocene (= bis(η5-cyclopentadienyl)iron(II) ) but rather for example bis(η1-cyclopentadienyl)iron(II), see the last entry in the screenshot below:
And the "substance entry" for "cisplatin" is definitely wrong, since it also shows transplatin:
Where and how were the "InChI test/reference structures" (= molfiles) collected? Because if the InChI is validated against the incorrect structures, then of course also the output will be incorrect.
f87f47d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@schatzsc
Here you can find more information about the differences of compounds and substances in PubChem: https://pubchem.ncbi.nlm.nih.gov/docs/compound-vs-substance
If I remember correctly, this is a legacy way of representing ferrocene (please ask @gblanke02 for further information). But please feel free to report this to PubChem.
Again, please feel free to report this to PubChem.
The structures are extracted out of an "old" prototype by Alex Clark. Good reads about it in his blog:
The structures were obtained from the PubChem-selection. For this purpose, the CIDs were extracted, the SIDs (several per CID possible) were collected and the substances were downloaded from PubChem via the SIDs. All with the help of a "small" python script that uses the
pubchempy
library.Anyways, you are right, before we use this as a test set, we need to validate the structures. At the moment this is just a 'placeholder', to store the structures and it is only tested if an InChI can be created not the actual InChI.
f87f47d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fbaensch-beilstein
That's the collection with the non-standard format, right?!?
Is there a "converted-to-molfile" collection of this somewhere? Or alternatively at least the conversion script?
f87f47d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@schatzsc
What do you mean with non-standard format?
You can download it here: https://github.com/IUPAC-InChI/InChI/blob/rwth/INCHI-1-TEST/tests/test_executable/data/organometallic_structures_pubchem.sdf.gz
It is a zipped sdf with the SID structures.
f87f47d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fbaensch-beilstein
non-molfile
If I recall correctly, this collection was in a special format that is not routinely/easily converted to molfile
Thanks - can download and open it with ChemDraw in absence of a better viewer.
Not sure if this is from the multiple conversions (original format -> molfile -> ChemDraw import) or there in the original file, but it looks like utter nonsense - can only roughly assume that this is supposed to be homobimetallic chlorido-bridged [(h3-allyl)Pd(µ-Cl)2Pd(h3-allyl)]):
f87f47d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there are only 187 structures in the sdf file - want me to re-draw them to a proper representation and export as individual molfiles ?!?
f87f47d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@schatzsc
Exactly, but the structures in the sdf are directly downloaded from PubChem.
Those are original files from PubChem, maveb the ChemDraw import mixed up the coordinates, but not the bonds.
It would be awesome to have at least one correct structure for each entry. Thanks a lot!
You can export them as individual molfiles or all in one sdf, whatever is more convenient for you.
f87f47d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What nomenclature for the filenames to use?
Simply call them
molfile1.mol
tomolfile187.mol
or some indicator of the structure such asPd-allyl-dimer.mol
?Below an example for the Pd allyl complex also mentioned above which I would consider correct within the limits of the current InChI (= metal-ligand bond disconnection).
The delocalized nature of the allyl anion cannot be represented, as rings and chains with an uneven number of atoms fail to give alternating double and single bonds, but setting the valence of the allyl carbon atoms to four fixes potential problems (as it is essentially equivalent to the "simple bonds" = lines only represent bonding interactions but are not considered in implicit hydrogen atom calculations of TUCAN). Also note that no local charges have to be assigned to any of the atoms when using this approach:
f87f47d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@schatzsc I apologize for the late reply, I had an extended weekend.
You can keep the nomenclature as you like.
Please draw the structures as ‘correctly’ as possible, i.e. also with haptic bonds.
f87f47d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fbaensch-beilstein What do you mean with "haptic bonds" - multi-attachment instead of explicitly drawn single bonds?!?
f87f47d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@schatzsc yes, would be nice to have both versions - with multi-attachment and the single bonds version
f87f47d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fbaensch-beilstein Is there a place to upload the molfiles here to the repro or would you rather prefer to receive all of them together in a *.zip file?
f87f47d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@schatzsc I think the easiest way is you send me a zip file, so I can arrange the files for our tests.