Clarify that there are actually 301 unique compounds, not 306 #9

shntnu · 2021-01-08T10:59:32Z

These 3 compounds are duplicated (but different broad_sample values)

pert_iname	n
BVT-948	2
dexamethasone	2
thiostrepton	2

The text was updated successfully, but these errors were encountered:

niranjchandrasekaran · 2021-01-08T15:47:01Z

In the case of BVT-948 (44483339 and 6604934) and Thiostrepton (16154490 and 101290202), it looks they may be different molecules with the same chemical name (They have different IUPAC names or Isomeric SMILES).

In the case of Dexamethasone, I am not sure how the two are different though there are two different entries on PubChem (5743 and 5702035).

shntnu · 2021-05-01T17:36:28Z

My conclusion here is that given that they have the same molecular formula, we should treat them similar enough so as to always keep them in the same data split.

E.g. below, it is not ok to have the pair BRD-A22713669:PTPN1 in training and the pair BRD-K68567222:PTPN1 in testing.

With the fix in #17, pert_iname is now a sufficient and recommended identifier for compounds when doing train-test splits. Using broad_sample (or its first 13 characters, which ispert_id) may result in bad data splits wrt these 3 compounds.

pert_iname	pert_id	target
BVT-948	BRD-A22713669	PTPN1
BVT-948	BRD-A22713669	PTPN11
BVT-948	BRD-A22713669	PTPN2
BVT-948	BRD-K68567222	PTPN1
BVT-948	BRD-K68567222	PTPN11
BVT-948	BRD-K68567222	PTPN2
dexamethasone	BRD-A10188456	ANXA1
dexamethasone	BRD-A10188456	NOS2
dexamethasone	BRD-A10188456	NR0B1
dexamethasone	BRD-A10188456	NR3C1
dexamethasone	BRD-A10188456	NR3C2
dexamethasone	BRD-K38775274	ANXA1
dexamethasone	BRD-K38775274	NOS2
dexamethasone	BRD-K38775274	NR0B1
dexamethasone	BRD-K38775274	NR3C1
dexamethasone	BRD-K38775274	NR3C2
thiostrepton	BRD-A20697603	FOXM1
thiostrepton	BRD-K58049875	FOXM1

data.frame(pert_iname = c("dexamethasone", "BVT-948", "thiostrepton")) %>% inner_join(drug_target_samples) %>% select(pert_iname, pert_id, target) %>% distinct(pert_iname, pert_id, target) %>% arrange(pert_iname, pert_id) %>% knitr::kable()

shntnu · 2021-08-19T15:53:19Z

We need to figure this out so that others trying to create these plates know what to pick

Here's the metadata again for these compounds

broad_sample	pert_iname	pubchem_cid	target	InChIKey	smiles
BRD-A22713669-001-04-3	BVT-948	44483339	PTPN2	AJVXVYTVAAWZAP-UHFFFAOYSA-N	`CC1(C)C2C(=NC1=O)c1ccccc1C(=O)C2=O`
BRD-K68567222-001-01-2	BVT-948	6604934	PTPN2	LLPBUXODFQZPFH-UHFFFAOYSA-N	`CC1(C)C(=O)NC2=C1C(=O)C(=O)c1ccccc21`
BRD-K38775274-001-22-1	dexamethasone	5743	ANXA1	UREBDLICKHMUKA-CXSFZGCWSA-N	`C[C@@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@]4(C)[C@@]3(F)[C@@H](O)C[C@]2(C)[C@@]1(O)C(=O)CO`
BRD-A10188456-001-04-9	dexamethasone	5702035	ANXA1	UREBDLICKHMUKA-QCYOSJOCSA-N	`C[C@@H]1CC2C3CCC4=CC(=O)C=C[C@]4(C)[C@@]3(F)[C@@H](O)C[C@]2(C)[C@@]1(O)C(=O)CO`
BRD-K58049875-001-03-9	thiostrepton	16154490	FOXM1	NSFFHOGKXHRQEW-AIHSUZKVSA-N	`CC[C@H](C)[C@@H]1N[C@@H]2C=Cc3c(cc(nc3[C@H]2O)C(=O)O[C@H](C)[C@@H]2NC(=O)c3csc(n3)[C@@H](NC(=O)[C@H]3CSC(=N3)\C(NC(=O)[C@@H](NC(=O)c3csc(n3)[C@]3(CCC(=N[C@@H]3c3csc2n3)c2nc(cs2)C(=O)NC(=C)C(=O)NC(=C)C(N)=O)NC(=O)[C@H](C)NC(=O)C(=C)NC(=O)[C@H](C)NC1=O)[C@@H](C)O)=C\C)[C@](C)(O)[C@@H](C)O)[C@H](C)O`
BRD-A20697603-001-07-2	thiostrepton	101290202	FOXM1	NSFFHOGKXHRQEW-DVRIZHICSA-N	`CCC(C)C1NC2C=Cc3c(cc(nc3C2O)C(=O)OC(C)C2NC(=O)c3csc(n3)C(NC(=O)C3CSC(=N3)\C(NC(=O)C(NC(=O)c3csc(n3)C3(CCC(=NC3c3csc2n3)c2nc(cs2)C(=O)NC(=C)C(=O)NC(=C)C(N)=O)NC(=O)C(C)NC(=O)C(=C)NC(=O)C(C)NC1=O)C(C)O)=C/C)C(C)(O)C(C)O)C(C)O`

@niranjchandrasekaran said

In the case of BVT-948 (44483339 and 6604934) and Thiostrepton (16154490 and 101290202), it looks they may be different molecules with the same chemical name (They have different IUPAC names or Isomeric SMILES).

In the case of Dexamethasone, I am not sure how the two are different though there are two different entries on PubChem (5743 and 5702035).

In a separate conversation, we had some confusion about dexamethasone

https://github.com/jump-cellpainting/normalization/issues/3#issuecomment-824366920

Side note: I'm not sure how we ended up with these duplicates, but I think it was because we were using something other than pert_iname as identifiers when selecting (which is fine because its <1%).

So the main question is: what should we recommend to people trying to recreate this plate? I'll tag @dkuhn in case he has some insights given his comment in #24

shntnu · 2023-10-31T14:59:44Z

I used https://github.com/jump-cellpainting/compound-annotator/blob/79596e411fc119b2b71970a8b698ce33658f55a6/StandardizeMolecule.py below

Prep data

wget https://raw.githubusercontent.com/jump-cellpainting/JUMP-Target/6db6edbc2360c2bb5dfd2d082459acd6055d209c/JUMP-Target-2_compound_metadata.tsv

grep -E 'thiostrepton|dexamethasone|BVT-948|pert_iname' ~/Downloads/JUMP-Target-2_compound_metadata.tsv  > ~/Downloads/JUMP-Target-2_compound_metadata-only-dups.tsv

Run JUMP standardizer

python StandardizeMolecule.py \
  run --num_cpu=8 \
  --input=/Users/shsingh/Downloads/JUMP-Target-2_compound_metadata-only-dups.tsv \
  --output=/Users/shsingh/Downloads/JUMP-Target-
2_compound_metadata-only-dups-std.tsv

Output:

Tautomer enumeration stopped at maximum 1000
Tautomer enumeration stopped at maximum 1000
Tautomer enumeration stopped at maximum 1000
Tautomer enumeration stopped at maximum 1000
Tautomer enumeration stopped at maximum 1000
Tautomer enumeration stopped at maximum 1000
Tautomer enumeration stopped at maximum 1000
Tautomer enumeration stopped at maximum 1000
Tautomer enumeration stopped at maximum 1000
Tautomer enumeration stopped at maximum 1000
6/6 [04:11<00:00, 41.93s/it]
/Users/shsingh/Downloads/JUMP-Target-2_compound_metadata-only-dups-std.tsv

Original SMILES are distinct

cat /Users/shsingh/Downloads/JUMP-Target-2_compound_metadata-only-dups-std.tsv|csvcut -c SMILES_original|tail -n +2|sort|uniq -c
   1 CC1(C)C(=O)NC2=C1C(=O)C(=O)c1ccccc21
   1 CC1(C)C2C(=NC1=O)c1ccccc1C(=O)C2=O
   1 CCC(C)C1NC2C=Cc3c(cc(nc3C2O)C(=O)OC(C)C2NC(=O)c3csc(n3)C(NC(=O)C3CSC(=N3)\C(NC(=O)C(NC(=O)c3csc(n3)C3(CCC(=NC3c3csc2n3)c2nc(cs2)C(=O)NC(=C)C(=O)NC(=C)C(N)=O)NC(=O)C(C)NC(=O)C(=C)NC(=O)C(C)NC1=O)C(C)O)=C/C)C(C)(O)C(C)O)C(C)O
   1 CC[C@H](C)[C@@H]1N[C@@H]2C=Cc3c(cc(nc3[C@H]2O)C(=O)O[C@H](C)[C@@H]2NC(=O)c3csc(n3)[C@@H](NC(=O)[C@H]3CSC(=N3)\C(NC(=O)[C@@H](NC(=O)c3csc(n3)[C@]3(CCC(=N[C@@H]3c3csc2n3)c2nc(cs2)C(=O)NC(=C)C(=O)NC(=C)C(N)=O)NC(=O)[C@H](C)NC(=O)C(=C)NC(=O)[C@H](C)NC1=O)[C@@H](C)O)=C\C)[C@](C)(O)[C@@H](C)O)[C@H](C)O
   1 C[C@@H]1CC2C3CCC4=CC(=O)C=C[C@]4(C)[C@@]3(F)[C@@H](O)C[C@]2(C)[C@@]1(O)C(=O)CO
   1 C[C@@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@]4(C)[C@@]3(F)[C@@H](O)C[C@]2(C)[C@@]1(O)C(=O)CO

Standardized SMILES reveal "duplicates"

cat /Users/shsingh/Downloads/JUMP-Target-2_compound_metadata-only-dups-std.tsv|csvcut -c SMILES_standardized|tail -n +2|sort|uniq -c
   2 C=C(NC(=O)C(=C)NC(=O)c1csc(C2=NC3c4csc(n4)C4NC(=O)c5csc(n5)C(C(C)(O)C(C)O)NC(=O)C5CSC(=N5)C(=CC)NC(=O)C(C(C)O)NC(=O)c5csc(n5)C3(CC2)NC(=O)C(C)NC(=O)C(=C)NC(=O)C(C)NC(=O)C(C(C)CC)Nc2ccc3c(c2O)NC(=CC3C(C)O)C(=O)OC4C)n1)C(N)=O
   2 CC1(C)C(=O)N=C2c3ccccc3C(=O)C(=O)C21
   2 CC1CC2C3CC=C4CC(=O)C=CC4(C)C3(F)C(O)CC2(C)C1(O)C(=O)CO

David-Araripe · 2023-10-31T15:43:06Z

It seems like the first pair of compounds is a tautomer (PubChem CID as molecule label):
,
The second one has a difference in one stereocenter (stated v.s. not shown):

In the third, it seems like there's one SMILES with stereochemistry and one without:

Standardizing compounds with the ChEMBL standardizer didn't change the structures either, even after trying to get the parent structures for each of the compounds.

To reproduce the analysis:

from io import StringIO
from pubchempy import get_compounds
from chembl_structure_pipeline.standardizer import standardize_mol, get_parent_mol
from rdkit.Chem import Draw

data = """broad_sample	pert_iname	pubchem_cid	target	InChIKey	smiles
BRD-A22713669-001-04-3	BVT-948	44483339	PTPN2	AJVXVYTVAAWZAP-UHFFFAOYSA-N	CC1(C)C2C(=NC1=O)c1ccccc1C(=O)C2=O
BRD-K68567222-001-01-2	BVT-948	6604934	PTPN2	LLPBUXODFQZPFH-UHFFFAOYSA-N	CC1(C)C(=O)NC2=C1C(=O)C(=O)c1ccccc21
BRD-K38775274-001-22-1	dexamethasone	5743	ANXA1	UREBDLICKHMUKA-CXSFZGCWSA-N	C[C@@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@]4(C)[C@@]3(F)[C@@H](O)C[C@]2(C)[C@@]1(O)C(=O)CO
BRD-A10188456-001-04-9	dexamethasone	5702035	ANXA1	UREBDLICKHMUKA-QCYOSJOCSA-N	C[C@@H]1CC2C3CCC4=CC(=O)C=C[C@]4(C)[C@@]3(F)[C@@H](O)C[C@]2(C)[C@@]1(O)C(=O)CO
BRD-K58049875-001-03-9	thiostrepton	16154490	FOXM1	NSFFHOGKXHRQEW-AIHSUZKVSA-N	CC[C@H](C)[C@@H]1N[C@@H]2C=Cc3c(cc(nc3[C@H]2O)C(=O)O[C@H](C)[C@@H]2NC(=O)c3csc(n3)[C@@H](NC(=O)[C@H]3CSC(=N3)\C(NC(=O)[C@@H](NC(=O)c3csc(n3)[C@]3(CCC(=N[C@@H]3c3csc2n3)c2nc(cs2)C(=O)NC(=C)C(=O)NC(=C)C(N)=O)NC(=O)[C@H](C)NC(=O)C(=C)NC(=O)[C@H](C)NC1=O)[C@@H](C)O)=C\C)[C@](C)(O)[C@@H](C)O)[C@H](C)O
BRD-A20697603-001-07-2	thiostrepton	101290202	FOXM1	NSFFHOGKXHRQEW-DVRIZHICSA-N	CCC(C)C1NC2C=Cc3c(cc(nc3C2O)C(=O)OC(C)C2NC(=O)c3csc(n3)C(NC(=O)C3CSC(=N3)\C(NC(=O)C(NC(=O)c3csc(n3)C3(CCC(=NC3c3csc2n3)c2nc(cs2)C(=O)NC(=C)C(=O)NC(=C)C(N)=O)NC(=O)C(C)NC(=O)C(=C)NC(=O)C(C)NC1=O)C(C)O)=C/C)C(C)(O)C(C)O)C(C)O"""

def get_inchi(smi):
    mol = Chem.MolFromSmiles(smi)
    return Chem.MolToInchiKey(mol)

def get_info_from_pubchem(cid):
    cpds = get_compounds(cid)
    assert len(cpds) == 1, "More than one compound found"
    return cpds[0].to_dict(
        properties=["isomeric_smiles", "inchikey", "iupac_name", "synonyms"]
    )

def standardize_smiles(smi):
    mol = Chem.MolFromSmiles(smi)
    parent_mol = get_parent_mol(mol)[0]
    mol = standardize_mol(parent_mol, sanitize=True)
    return Chem.MolToSmiles(mol, isomericSmiles=True, canonical=True)

df = pd.read_csv(StringIO(data), sep="\t").assign(
    rdInChIKey=lambda x: x["smiles"].apply(get_inchi),
    std_smiles=lambda x: x["smiles"].apply(standardize_smiles),
)
pubchem_info = df.apply(
    lambda x: get_info_from_pubchem(x["pubchem_cid"]), axis=1, result_type="expand"
)
pubchem_info = pubchem_info.rename(
    columns={c: f"pubchem_{c}" for c in pubchem_info.columns}
)
df = pd.concat([df, pubchem_info], axis=1)

assert all(
    [
        (df["rdInChIKey"] == df["InChIKey"]).all(),
        (df["InChIKey"] == df["pubchem_inchikey"]).all(),
    ]
)

for name in df.pert_iname.unique():
    subset = df.query("pert_iname == @name")
    # plot the two molecules in a molecular grid
    mols = [Chem.MolFromSmiles(smi) for smi in subset["smiles"]]
    img = Draw.MolsToGridImage(
        mols,
        subImgSize=(300, 300),
        molsPerRow=2,
        returnPNG=True,
        legends=subset["pubchem_cid"].astype(str).tolist(),
    )
    display(img)

shntnu · 2023-10-31T15:58:22Z

@David-Araripe

Thank you very much for your help! I spoke with @srijitseal and this has helped us conclude that we should indeed report this as 303, not 306 compounds.

I'll report back here closing this out

#9

shntnu · 2023-10-31T16:30:02Z

All set in #31

Thank you once again David and Srijit!

shntnu · 2023-10-31T19:42:20Z

@srijitseal @David-Araripe -- sorry to reopen, but it appears that we might have 2 more "duplicates" as noted here jump-cellpainting/datasets#80 (comment)

Is it possible for you to verify that these two additional cases are also isomers?

If so, then we should report 301, not 303 compounds here.

David-Araripe · 2023-10-31T20:52:50Z

Hi @shntnu, here what I've observed:

ME-0328 seems to be indeed a duplicate, where one of the repeats doesn't have stereochemistry information:

quinine and quinidine are isomers:

The code for making the molecular drawings:

import pandas as pd
from rdkit import Chem
from rdkit.Chem import Draw

df = (
    pd.read_csv(
        "https://raw.githubusercontent.com/jump-cellpainting/JUMP-Target/09f11aefd6b550cfb2d2074e38a65c3be39ddc39/JUMP-Target-2_compound_metadata.tsv",
        sep="\t",
    )
    .query("pert_iname.str.contains('ME-0328|quinidine|quinine')")
    .assign(
        pubchem_cid=lambda x: x["pubchem_cid"].astype(int),
    )
)

for names in [["ME-0328"], ["quinidine", "quinine"]]:
    subset = df.query("pert_iname.isin(@names)")
    # plot the two molecules in a molecular grid
    mols = [Chem.MolFromSmiles(smi) for smi in subset["smiles"]]
    img = Draw.MolsToGridImage(
        mols,
        subImgSize=(300, 300),
        molsPerRow=2,
        returnPNG=True,
        legends=subset["pubchem_cid"].astype(str).tolist(),
    )
    display(img)

srijitseal · 2023-10-31T21:18:25Z

Hi David, I think we still need to check for this one: thiostrepton

…

On Tue, Oct 31, 2023 at 4:53 PM David Araripe ***@***.***> wrote: Hi @shntnu <https://github.com/shntnu>, here what I've observed: ME-0328 seems to be indeed a duplicate, where one of the repeats doesn't have stereochemistry information: [image: image] <https://user-images.githubusercontent.com/79095854/279512162-b848ac4b-b67c-492d-a256-afb7101172c0.png> quinine and quinidine are isomers: [image: image] <https://user-images.githubusercontent.com/79095854/279512178-03c60289-ca3d-48c5-9903-9b9fcb345242.png> The code for making the figures: import pandas as pdfrom rdkit import Chemfrom rdkit.Chem import Draw df = ( pd.read_csv( "https://raw.githubusercontent.com/jump-cellpainting/JUMP-Target/09f11aefd6b550cfb2d2074e38a65c3be39ddc39/JUMP-Target-2_compound_metadata.tsv", sep="\t", ) .query("pert_iname.str.contains('ME-0328|quinidine|quinine')") .assign( pubchem_cid=lambda x: x["pubchem_cid"].astype(int), ) ) for names in [["ME-0328"], ["quinidine", "quinine"]]: subset = ***@***.***)") # plot the two molecules in a molecular grid mols = [Chem.MolFromSmiles(smi) for smi in subset["smiles"]] img = Draw.MolsToGridImage( mols, subImgSize=(300, 300), molsPerRow=2, returnPNG=True, legends=subset["pubchem_cid"].astype(str).tolist(), ) display(img) — Reply to this email directly, view it on GitHub <#9 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AN34ZTYU2HRY4MKJV76NDK3YCFQK3AVCNFSM4V2JF662U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZYHAYDCOJTGU4A> . You are receiving this because you were mentioned.Message ID: ***@***.***>

David-Araripe · 2023-10-31T21:31:53Z

@srijitseal This is the third compound pair from my first comment. Where that doesn't have the stereochemistry information. PubChem CIDs 101290202 and 16154490

srijitseal · 2023-10-31T21:45:02Z

The smiles also seem to be broken? I see N with 7 bonds and C=O floating around. Is it possible that it’s a fault of rdkit Draw. If it’s not too late in your time zone, can you get the smiles from pubchem/chembl/drugbank and draw that? See if that repairs it? If that does repair, then we need to think about replacing the smiles with the proper ones, Best, Srijit

…

On Tue, Oct 31, 2023 at 5:32 PM David Araripe ***@***.***> wrote: Again seems like the same but without stereochemistry information: [image: image] <https://user-images.githubusercontent.com/79095854/279520428-38c7b0d4-6494-4335-b11a-74243e39e377.png> — Reply to this email directly, view it on GitHub <#9 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AN34ZTYAYHAVSRMZR5EILPDYCFU5LAVCNFSM4V2JF662U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZYHAYDMNJZGY4Q> . You are receiving this because you were mentioned.Message ID: ***@***.***>

David-Araripe · 2023-10-31T22:13:04Z

Here's the molecular drawing but made with the PubChem smiles instead:

I wanted to check the difference between the SMILES that are currently registered in the smiles column and the ones from PubChem so compared their Morgan fingerprints (ignoring chirality). Here's the code:

from io import StringIO
from rdkit.Chem import AllChem, DataStructs

data = """broad_sample	pert_iname	pubchem_cid	target	InChIKey	smiles
BRD-K58049875-001-03-9	thiostrepton	16154490	FOXM1	NSFFHOGKXHRQEW-AIHSUZKVSA-N	CC[C@H](C)[C@@H]1N[C@@H]2C=Cc3c(cc(nc3[C@H]2O)C(=O)O[C@H](C)[C@@H]2NC(=O)c3csc(n3)[C@@H](NC(=O)[C@H]3CSC(=N3)\C(NC(=O)[C@@H](NC(=O)c3csc(n3)[C@]3(CCC(=N[C@@H]3c3csc2n3)c2nc(cs2)C(=O)NC(=C)C(=O)NC(=C)C(N)=O)NC(=O)[C@H](C)NC(=O)C(=C)NC(=O)[C@H](C)NC1=O)[C@@H](C)O)=C\C)[C@](C)(O)[C@@H](C)O)[C@H](C)O
BRD-A20697603-001-07-2	thiostrepton	101290202	FOXM1	NSFFHOGKXHRQEW-DVRIZHICSA-N	CCC(C)C1NC2C=Cc3c(cc(nc3C2O)C(=O)OC(C)C2NC(=O)c3csc(n3)C(NC(=O)C3CSC(=N3)\C(NC(=O)C(NC(=O)c3csc(n3)C3(CCC(=NC3c3csc2n3)c2nc(cs2)C(=O)NC(=C)C(=O)NC(=C)C(N)=O)NC(=O)C(C)NC(=O)C(=C)NC(=O)C(C)NC1=O)C(C)O)=C/C)C(C)(O)C(C)O)C(C)O"""

df = pd.read_csv(StringIO(data), sep="\t")


def get_info_from_pubchem(cid):
    cpds = get_compounds(cid)
    assert len(cpds) == 1, "More than one compound found"
    return cpds[0].to_dict(
        properties=["isomeric_smiles", "inchikey", "iupac_name", "synonyms"]
    )


def get_morgan_fp(smi):
    mol = Chem.MolFromSmiles(smi)
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048, useChirality=False)
    return fp


def check_same_fp(fp1, fp2):
    return DataStructs.FingerprintSimilarity(fp1, fp2)


pubchem_info = df.apply(
    lambda x: get_info_from_pubchem(x["pubchem_cid"]), axis=1, result_type="expand"
)
pubchem_info = pubchem_info.rename(
    columns={c: f"pubchem_{c}" for c in pubchem_info.columns}
)
df = pd.concat([df, pubchem_info], axis=1)

fp_JUMP_smiles = subset.smiles.apply(get_morgan_fp).tolist()
fp_pubchem_smiles = subset.pubchem_isomeric_smiles.apply(get_morgan_fp).tolist()
cids = subset.pubchem_cid.tolist()

for cid, fp1, fp2 in zip(cids, fp_JUMP_smiles, fp_pubchem_smiles):
    print(
        f"CID: {cid}, fingerprint similarity (JUMP v.s. PubChem): {check_same_fp(fp1, fp2)}"
    )

and the output

>>>CID: 16154490, fingerprint similarity (JUMP v.s. PubChem): 1.0
>>>CID: 101290202, fingerprint similarity (JUMP v.s. PubChem): 0.46788990825688076

So there seems to be something wrong with the smiles from BRD-A20697603-001-07-2 indeed. Maybe consider using the smiles from PubChem since the CIDs are already there? These would be the new smiles:

{
    "BRD-K58049875-001-03-9": "CC[C@H](C)[C@H]1C(=O)N[C@H](C(=O)NC(=C)C(=O)N[C@H](C(=O)N[C@]23CCC(=N[C@@H]2C4=CSC(=N4)[C@H]([C@H](OC(=O)C5=NC6=C(C=C[C@H]([C@@H]6O)N1)C(=C5)[C@H](C)O)C)NC(=O)C7=CSC(=N7)[C@@H](NC(=O)[C@H]8CSC(=N8)/C(=C/C)/NC(=O)[C@@H](NC(=O)C9=CSC3=N9)[C@@H](C)O)[C@@](C)([C@@H](C)O)O)C1=NC(=CS1)C(=O)NC(=C)C(=O)NC(=C)C(=O)N)C)C",
    "BRD-A20697603-001-07-2": "CCC(C)C1C(=NC(C(=NC(=C)C(=NC(C(=NC23CCC(=NC2C4=CSC(=N4)C(C(OC(=O)C5=NC6=C(C=CC(C6O)N1)C(=C5)C(C)O)C)N=C(C7=CSC(=N7)C(N=C(C8CSC(=N8)/C(=C\\C)/N=C(C(N=C(C9=CSC3=N9)O)C(C)O)O)O)C(C)(C(C)O)O)O)C1=NC(=CS1)C(=NC(=C)C(=NC(=C)C(=N)O)O)O)O)C)O)O)C)O"
}

shntnu · 2023-11-02T02:02:02Z

Really helpful!

At this point, we should go back to the source of the metadata https://repo-hub.broadinstitute.org/repurposing to figure this out, because that may also resolve some of the other inconsistencies.

For example, here it is for the two samples of thiostrepton from repurposing_samples_20200324.txt:

broad_id	pert_iname	qc_incompatible	purity	vendor	catalog_no	vendor_name	expected_mass	smiles	InChIKey	pubchem_cid	deprecated_broad_id
BRD-A20697603-001-07-2	thiostrepton	0	42.09	MicroSource	1505111	THIOSTREPTON	"1,663.49"	CCC(C)C1NC2C=Cc3c(cc(nc3C2O)C(=O)OC(C)C2NC(=O)c3csc(n3)C(NC(=O)C3CSC(=N3)\C(NC(=O)C(NC(=O)c3csc(n3)C3(CCC(=NC3c3csc2n3)c2nc(cs2)C(=O)NC(=C)C(=O)NC(=C)C(N)=O)NC(=O)C(C)NC(=O)C(=C)NC(=O)C(C)NC1=O)C(C)O)=C/C)C(C)(O)C(C)O)C(C)O	NSFFHOGKXHRQEW-DVRIZHICSA-N	101290202	
BRD-K58049875-001-03-9	thiostrepton	0	96.6	MedChemEx	HY-B0990	Thiostrepton	"1,663.49"	CC[C@H](C)[C@@H]1N[C@@H]2C=Cc3c(cc(nc3[C@H]2O)C(=O)O[C@H](C)[C@@H]2NC(=O)c3csc(n3)[C@@H](NC(=O)[C@H]3CSC(=N3)\C(NC(=O)[C@@H](NC(=O)c3csc(n3)[C@]3(CCC(=N[C@@H]3c3csc2n3)c2nc(cs2)C(=O)NC(=C)C(=O)NC(=C)C(N)=O)NC(=O)[C@H](C)NC(=O)C(=C)NC(=O)[C@H](C)NC1=O)[C@@H](C)O)=C\C)[C@](C)(O)[C@@H](C)O)[C@H](C)O	NSFFHOGKXHRQEW-AIHSUZKVSA-N	16154490	BRD-A89819812-001-02-8

We might as well just retrieve all these annotations over again from the source. Will report back here later, but all this info is super useful.

@David-Araripe Thank you again!

srijitseal · 2023-11-02T02:09:42Z

I think that's a great idea. I think in this particular case, the one with the higher purity is better, but you can see that they are indeed the same compound for ML purposes (same molecular weight as well). The second SMILES has data on stereoisomers, the first doesn't. I am happy to look at all 300+ molecules individually and then we can make a call on what is the number of unique compounds (unique based on chemistry/stereoisomers). Best, Srijit ᐧ

…

On Wed, 1 Nov 2023 at 22:02, Shantanu Singh ***@***.***> wrote: Really helpful! At this point, we should go back to the source of the metadata https://repo-hub.broadinstitute.org/repurposing to figure this out, because that may also resolve some of the other inconsistencies. For example, here it is for the two samples of thiostrepton from repurposing_samples_20200324.txt: broad_id pert_iname qc_incompatible purity vendor catalog_no vendor_name expected_mass smiles InChIKey pubchem_cid deprecated_broad_id BRD-A20697603-001-07-2 thiostrepton 0 42.09 MicroSource 1505111 THIOSTREPTON "1,663.49" CCC(C)C1NC2C=Cc3c(cc(nc3C2O)C(=O)OC(C)C2NC(=O)c3csc(n3)C(NC(=O)C3CSC(=N3)\C(NC(=O)C(NC(=O)c3csc(n3)C3(CCC(=NC3c3csc2n3)c2nc(cs2)C(=O)NC(=C)C(=O)NC(=C)C(N)=O)NC(=O)C(C)NC(=O)C(=C)NC(=O)C(C)NC1=O)C(C)O)=C/C)C(C)(O)C(C)O)C(C)O NSFFHOGKXHRQEW-DVRIZHICSA-N 101290202 BRD-K58049875-001-03-9 thiostrepton 0 96.6 MedChemEx HY-B0990 Thiostrepton "1,663.49" ***@***.***(C)[C@@h]1N[C@@***@***.******@***.***(C)[C@@h]2NC(=O)c3csc(n3)[C@@***@***.***3CSC(=N3)\C(NC(=O)[C@@h](NC(=O)c3csc(n3)[C@]3(CCC(=N[C@@***@***.******@***.***(C)NC1=O)[C@@h](C)O)=C\C)[C@](C)(O)[C@@***@***.***(C)O NSFFHOGKXHRQEW-AIHSUZKVSA-N 16154490 BRD-A89819812-001-02-8 We might as well just retrieve all these annotations over again from the source. Will report back here later, but all this info is super useful. @David-Araripe <https://github.com/David-Araripe> Thank you again! — Reply to this email directly, view it on GitHub <#9 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AN34ZTYV5CTSIDKYFTGLGPLYCL5KLAVCNFSM4V2JF662U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZYHE4TIMZRGQ3A> . You are receiving this because you were mentioned.Message ID: ***@***.***>

srijitseal · 2024-03-18T03:49:40Z

Refer to /jump-cellpainting/datasets/issues/80 for detailed analysis.

srijitseal · 2024-03-18T04:27:36Z

There are 301 unique compounds in this dataset.

We found there are 5 compounds with duplicate entries, hence 301 and not 306 unique compounds. After standardization, we don't see this problem anymore In the original Target2 dataset, they seemed different because of the following.

1.0

It seems like the pair of compounds is a tautomer
2.0

This one has a difference in one stereocenter (stated v.s. not shown)
3.0

ME-0328 seems to be indeed a duplicate, where one of the repeats doesn't have stereochemistry information.
4.0

quinine and quinidine are isomers, we should keep both as "unique", they have different biological signal, so no data leak when using cell painting, but caution is advised when using ECFP, that would leak data as it cant differentiate stereochemistry.
5.0

For thiostrepton, connectivity is same, it seems stereochemistry is different

Update JUMP-Target-2_compound_metadata.tsv

niranjchandrasekaran mentioned this issue Jul 9, 2021

Structures with same name and different InChiKey in JUMP-Target 2 #24

Closed

shntnu added a commit that referenced this issue Oct 31, 2023

Clarify that there are actually 303 unique compounds, not 306

d3de489

#9

shntnu mentioned this issue Oct 31, 2023

Clarify that there are actually 303 unique compounds, not 306 #31

Merged

shntnu closed this as completed in #31 Oct 31, 2023

shntnu mentioned this issue Oct 31, 2023

Resolve inconsistencies in Target2 Compound InChIKeys jump-cellpainting/datasets#80

Closed

shntnu reopened this Oct 31, 2023

shntnu self-assigned this Dec 8, 2023

shntnu assigned srijitseal and unassigned shntnu Dec 19, 2023

shntnu mentioned this issue Apr 4, 2024

Update JUMP-Target-2_compound_metadata.tsv to include standarized SMILES and related identifiers #32

Merged

shntnu closed this as completed in #32 Apr 4, 2024

shntnu added a commit that referenced this issue Apr 4, 2024

Merge pull request #32 from /issues/9

3fdb1a9

Update JUMP-Target-2_compound_metadata.tsv

shntnu changed the title ~~Clarify that there are actually 303 unique compounds, not 306~~ Clarify that there are actually 301 unique compounds, not 306 Apr 4, 2024

shntnu mentioned this issue Apr 4, 2024

Actually clarify that there are actually 301 unique compounds, not 306 #33

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify that there are actually 301 unique compounds, not 306 #9

Clarify that there are actually 301 unique compounds, not 306 #9

shntnu commented Jan 8, 2021

niranjchandrasekaran commented Jan 8, 2021

shntnu commented May 1, 2021 •

edited

Loading

shntnu commented Aug 19, 2021

shntnu commented Oct 31, 2023 •

edited

Loading

David-Araripe commented Oct 31, 2023

shntnu commented Oct 31, 2023

shntnu commented Oct 31, 2023

shntnu commented Oct 31, 2023 •

edited

Loading

David-Araripe commented Oct 31, 2023 •

edited

Loading

srijitseal commented Oct 31, 2023 via email

David-Araripe commented Oct 31, 2023 •

edited

Loading

srijitseal commented Oct 31, 2023 via email

David-Araripe commented Oct 31, 2023

shntnu commented Nov 2, 2023

srijitseal commented Nov 2, 2023 via email

srijitseal commented Mar 18, 2024

srijitseal commented Mar 18, 2024

Clarify that there are actually 301 unique compounds, not 306 #9

Clarify that there are actually 301 unique compounds, not 306 #9

Comments

shntnu commented Jan 8, 2021

niranjchandrasekaran commented Jan 8, 2021

shntnu commented May 1, 2021 • edited Loading

shntnu commented Aug 19, 2021

shntnu commented Oct 31, 2023 • edited Loading

David-Araripe commented Oct 31, 2023

shntnu commented Oct 31, 2023

shntnu commented Oct 31, 2023

shntnu commented Oct 31, 2023 • edited Loading

David-Araripe commented Oct 31, 2023 • edited Loading

srijitseal commented Oct 31, 2023 via email

David-Araripe commented Oct 31, 2023 • edited Loading

srijitseal commented Oct 31, 2023 via email

David-Araripe commented Oct 31, 2023

shntnu commented Nov 2, 2023

srijitseal commented Nov 2, 2023 via email

srijitseal commented Mar 18, 2024

srijitseal commented Mar 18, 2024

shntnu commented May 1, 2021 •

edited

Loading

shntnu commented Oct 31, 2023 •

edited

Loading

shntnu commented Oct 31, 2023 •

edited

Loading

David-Araripe commented Oct 31, 2023 •

edited

Loading

David-Araripe commented Oct 31, 2023 •

edited

Loading