-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify that there are actually 301 unique compounds, not 306 #9
Comments
In the case of BVT-948 (44483339 and 6604934) and Thiostrepton (16154490 and 101290202), it looks they may be different molecules with the same chemical name (They have different IUPAC names or Isomeric SMILES). In the case of Dexamethasone, I am not sure how the two are different though there are two different entries on PubChem (5743 and 5702035). |
My conclusion here is that given that they have the same molecular formula, we should treat them similar enough so as to always keep them in the same data split. E.g. below, it is not ok to have the pair With the fix in #17,
data.frame(pert_iname = c("dexamethasone", "BVT-948", "thiostrepton")) %>% inner_join(drug_target_samples) %>% select(pert_iname, pert_id, target) %>% distinct(pert_iname, pert_id, target) %>% arrange(pert_iname, pert_id) %>% knitr::kable() |
We need to figure this out so that others trying to create these plates know what to pick Here's the metadata again for these compounds
In the case of BVT-948 (44483339 and 6604934) and Thiostrepton (16154490 and 101290202), it looks they may be different molecules with the same chemical name (They have different IUPAC names or Isomeric SMILES). In the case of Dexamethasone, I am not sure how the two are different though there are two different entries on PubChem (5743 and 5702035). In a separate conversation, we had some confusion about dexamethasone https://github.com/jump-cellpainting/normalization/issues/3#issuecomment-824366920 Side note: I'm not sure how we ended up with these duplicates, but I think it was because we were using something other than So the main question is: what should we recommend to people trying to recreate this plate? I'll tag @dkuhn in case he has some insights given his comment in #24 |
Prep data wget https://raw.githubusercontent.com/jump-cellpainting/JUMP-Target/6db6edbc2360c2bb5dfd2d082459acd6055d209c/JUMP-Target-2_compound_metadata.tsv
grep -E 'thiostrepton|dexamethasone|BVT-948|pert_iname' ~/Downloads/JUMP-Target-2_compound_metadata.tsv > ~/Downloads/JUMP-Target-2_compound_metadata-only-dups.tsv Run JUMP standardizer python StandardizeMolecule.py \
run --num_cpu=8 \
--input=/Users/shsingh/Downloads/JUMP-Target-2_compound_metadata-only-dups.tsv \
--output=/Users/shsingh/Downloads/JUMP-Target-
2_compound_metadata-only-dups-std.tsv Output:
Original SMILES are distinct
Standardized SMILES reveal "duplicates"
|
Thank you very much for your help! I spoke with @srijitseal and this has helped us conclude that we should indeed report this as 303, not 306 compounds. I'll report back here closing this out |
All set in #31 Thank you once again David and Srijit! |
@srijitseal @David-Araripe -- sorry to reopen, but it appears that we might have 2 more "duplicates" as noted here jump-cellpainting/datasets#80 (comment) Is it possible for you to verify that these two additional cases are also isomers? If so, then we should report 301, not 303 compounds here. |
Hi @shntnu, here what I've observed: ME-0328 seems to be indeed a duplicate, where one of the repeats doesn't have stereochemistry information: quinine and quinidine are isomers: The code for making the molecular drawings: import pandas as pd
from rdkit import Chem
from rdkit.Chem import Draw
df = (
pd.read_csv(
"https://raw.githubusercontent.com/jump-cellpainting/JUMP-Target/09f11aefd6b550cfb2d2074e38a65c3be39ddc39/JUMP-Target-2_compound_metadata.tsv",
sep="\t",
)
.query("pert_iname.str.contains('ME-0328|quinidine|quinine')")
.assign(
pubchem_cid=lambda x: x["pubchem_cid"].astype(int),
)
)
for names in [["ME-0328"], ["quinidine", "quinine"]]:
subset = df.query("pert_iname.isin(@names)")
# plot the two molecules in a molecular grid
mols = [Chem.MolFromSmiles(smi) for smi in subset["smiles"]]
img = Draw.MolsToGridImage(
mols,
subImgSize=(300, 300),
molsPerRow=2,
returnPNG=True,
legends=subset["pubchem_cid"].astype(str).tolist(),
)
display(img) |
Hi David,
I think we still need to check for this one:
thiostrepton
…On Tue, Oct 31, 2023 at 4:53 PM David Araripe ***@***.***> wrote:
Hi @shntnu <https://github.com/shntnu>, here what I've observed:
ME-0328 seems to be indeed a duplicate, where one of the repeats doesn't
have stereochemistry information:
[image: image]
<https://user-images.githubusercontent.com/79095854/279512162-b848ac4b-b67c-492d-a256-afb7101172c0.png>
quinine and quinidine are isomers:
[image: image]
<https://user-images.githubusercontent.com/79095854/279512178-03c60289-ca3d-48c5-9903-9b9fcb345242.png>
The code for making the figures:
import pandas as pdfrom rdkit import Chemfrom rdkit.Chem import Draw
df = (
pd.read_csv(
"https://raw.githubusercontent.com/jump-cellpainting/JUMP-Target/09f11aefd6b550cfb2d2074e38a65c3be39ddc39/JUMP-Target-2_compound_metadata.tsv",
sep="\t",
)
.query("pert_iname.str.contains('ME-0328|quinidine|quinine')")
.assign(
pubchem_cid=lambda x: x["pubchem_cid"].astype(int),
)
)
for names in [["ME-0328"], ["quinidine", "quinine"]]:
subset = ***@***.***)")
# plot the two molecules in a molecular grid
mols = [Chem.MolFromSmiles(smi) for smi in subset["smiles"]]
img = Draw.MolsToGridImage(
mols,
subImgSize=(300, 300),
molsPerRow=2,
returnPNG=True,
legends=subset["pubchem_cid"].astype(str).tolist(),
)
display(img)
—
Reply to this email directly, view it on GitHub
<#9 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AN34ZTYU2HRY4MKJV76NDK3YCFQK3AVCNFSM4V2JF662U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZYHAYDCOJTGU4A>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@srijitseal This is the third compound pair from my first comment. Where that doesn't have the stereochemistry information. PubChem CIDs 101290202 and 16154490 |
The smiles also seem to be broken? I see N with 7 bonds and C=O floating
around. Is it possible that it’s a fault of rdkit Draw.
If it’s not too late in your time zone, can you get the smiles from
pubchem/chembl/drugbank and draw that? See if that repairs it? If that does
repair, then we need to think about replacing the smiles with the proper
ones,
Best,
Srijit
…On Tue, Oct 31, 2023 at 5:32 PM David Araripe ***@***.***> wrote:
Again seems like the same but without stereochemistry information:
[image: image]
<https://user-images.githubusercontent.com/79095854/279520428-38c7b0d4-6494-4335-b11a-74243e39e377.png>
—
Reply to this email directly, view it on GitHub
<#9 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AN34ZTYAYHAVSRMZR5EILPDYCFU5LAVCNFSM4V2JF662U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZYHAYDMNJZGY4Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Really helpful! At this point, we should go back to the source of the metadata https://repo-hub.broadinstitute.org/repurposing to figure this out, because that may also resolve some of the other inconsistencies. For example, here it is for the two samples of thiostrepton from repurposing_samples_20200324.txt:
We might as well just retrieve all these annotations over again from the source. Will report back here later, but all this info is super useful. @David-Araripe Thank you again! |
I think that's a great idea. I think in this particular case, the one with
the higher purity is better, but you can see that they are indeed the same
compound for ML purposes (same molecular weight as well). The second SMILES
has data on stereoisomers, the first doesn't.
I am happy to look at all 300+ molecules individually and then we can make
a call on what is the number of unique compounds (unique based on
chemistry/stereoisomers).
Best,
Srijit
ᐧ
…On Wed, 1 Nov 2023 at 22:02, Shantanu Singh ***@***.***> wrote:
Really helpful!
At this point, we should go back to the source of the metadata
https://repo-hub.broadinstitute.org/repurposing to figure this out,
because that may also resolve some of the other inconsistencies.
For example, here it is for the two samples of thiostrepton from
repurposing_samples_20200324.txt:
broad_id pert_iname qc_incompatible purity vendor catalog_no vendor_name expected_mass smiles InChIKey pubchem_cid deprecated_broad_id
BRD-A20697603-001-07-2 thiostrepton 0 42.09 MicroSource 1505111 THIOSTREPTON "1,663.49" CCC(C)C1NC2C=Cc3c(cc(nc3C2O)C(=O)OC(C)C2NC(=O)c3csc(n3)C(NC(=O)C3CSC(=N3)\C(NC(=O)C(NC(=O)c3csc(n3)C3(CCC(=NC3c3csc2n3)c2nc(cs2)C(=O)NC(=C)C(=O)NC(=C)C(N)=O)NC(=O)C(C)NC(=O)C(=C)NC(=O)C(C)NC1=O)C(C)O)=C/C)C(C)(O)C(C)O)C(C)O NSFFHOGKXHRQEW-DVRIZHICSA-N 101290202
BRD-K58049875-001-03-9 thiostrepton 0 96.6 MedChemEx HY-B0990 Thiostrepton "1,663.49" ***@***.***(C)[C@@h]1N[C@@***@***.******@***.***(C)[C@@h]2NC(=O)c3csc(n3)[C@@***@***.***3CSC(=N3)\C(NC(=O)[C@@h](NC(=O)c3csc(n3)[C@]3(CCC(=N[C@@***@***.******@***.***(C)NC1=O)[C@@h](C)O)=C\C)[C@](C)(O)[C@@***@***.***(C)O NSFFHOGKXHRQEW-AIHSUZKVSA-N 16154490 BRD-A89819812-001-02-8
We might as well just retrieve all these annotations over again from the
source. Will report back here later, but all this info is super useful.
@David-Araripe <https://github.com/David-Araripe> Thank you again!
—
Reply to this email directly, view it on GitHub
<#9 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AN34ZTYV5CTSIDKYFTGLGPLYCL5KLAVCNFSM4V2JF662U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZYHE4TIMZRGQ3A>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Refer to /jump-cellpainting/datasets/issues/80 for detailed analysis. |
There are 301 unique compounds in this dataset. We found there are 5 compounds with duplicate entries, hence 301 and not 306 unique compounds. After standardization, we don't see this problem anymore In the original Target2 dataset, they seemed different because of the following. 1.0 It seems like the pair of compounds is a tautomer This one has a difference in one stereocenter (stated v.s. not shown) ME-0328 seems to be indeed a duplicate, where one of the repeats doesn't have stereochemistry information. quinine and quinidine are isomers, we should keep both as "unique", they have different biological signal, so no data leak when using cell painting, but caution is advised when using ECFP, that would leak data as it cant differentiate stereochemistry. For thiostrepton, connectivity is same, it seems stereochemistry is different |
Update JUMP-Target-2_compound_metadata.tsv
These 3 compounds are duplicated (but different
broad_sample
values)The text was updated successfully, but these errors were encountered: