Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with BioThings BindingDB object fields #717

Open
colleenXu opened this issue Sep 6, 2023 · 8 comments
Open

Issues with BioThings BindingDB object fields #717

colleenXu opened this issue Sep 6, 2023 · 8 comments

Comments

@colleenXu
Copy link
Collaborator

colleenXu commented Sep 6, 2023

(CC @newgene @erikyao for pending BioThings, @rjawesome as the original person who worked on the parser biothings/pending.api#70)

Andy Crouse from Translator's UI team pointed out that BTE was returning result chemical Nodes that didn't have names, with edges from BioThings BindingDB (Translator Slack link).

When investigating this, I discovered cases where object fields (the chemical), specifically ID and name ones, seem incorrect or problematic (see the comments). Some notes:

  • I've been assuming that the object.pubchem_sid value is a "correct" ID (when the document has this field, which is >98% of documents or 1413131 / 1438909)
  • I'm not sure how many documents are affected
  • I'm not sure if the other chemical (object) fields have similar issues
@colleenXu
Copy link
Collaborator Author

colleenXu commented Sep 6, 2023

object.pubchem_cid

  • seems tricky to fix (or not fixable, issues are upstream in BindingDB data source?)
  • values sometimes seem incorrect
expand for detailed example

Example: https://biothings.ncats.io/bindingdb/query?q=object.pubchem_cid:4585

"object":{
                "bindingdb_link": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=35254",
                "chebi": "7735",
                "chembl": "CHEMBL715",
                "drugbank": "DB00334",
                "inchi": "InChI=1S/C17H20N4S/c1-12-11-13-16(21-9-7-20(2)8-10-21)18-14-5-3-4-6-15(14)19-17(13)22-12/h3-6,11,19H,7-10H2,1-2H3",
                "inchikey": "KVWDHTXUZHCGIO-UHFFFAOYSA-N",
                "iuphar_grac_id": "47",
                "kegg": "C07322",
                "monomer_id": 35254,
                "name": "2-methyl-4-(4-methylpiperazin-1-yl)-10H-thieno[2,3-b][1,5]benzodiazepine::CHEMBL715::OLANZAPINE::Olansek::US8802672, Olanzapine::Zyprexa::olanzapine",
                "pubchem_cid": 4585,
                "pubchem_sid": 85753333,
                "smiles": "CN1CCN(CC1)C1=Nc2ccccc2Nc2sc(C)cc12",
                "zinc": "ZINC52957434"
            },

Other examples with basically the same behavior (notes here):

@colleenXu
Copy link
Collaborator Author

object.inchikey

  • seems tricky to fix (or not fixable, issues are upstream in BindingDB data source?)
  • not clear if these values are incorrect, outdated, or represent slightly different but valid structures (isomers? different salts or protonation states?)...
example 1

"object": {
                "bindingdb_link": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=50115522",
                "chembl": "CHEMBL341945CHEMBL106813",
                "inchi": "InChI=1S/C28H35F3N4O4/c1-19-9-16-35(37)20(2)24(19)26(36)33-17-12-27(3,13-18-33)34-14-10-22(11-15-34)25(32-38-4)21-5-7-23(8-6-21)39-28(29,30)31/h5-9,16,22H,10-15,17-18H2,1-4H3",
                "inchikey": "YQCLAYRIYWYIKH-UHFFFAOYSA-N",
                "monomer_id": 50115522,
                "name": "(2,4-Dimethyl-1-oxy-pyridin-3-yl)-{4-[methoxyimino-(4-trifluoromethoxy-phenyl)-methyl]-4'-methyl-[1,4']bipiperidinyl-1'-yl}-methanone::CHEMBL106813::CHEMBL341945",
                "pubchem_cid": 9579327,
                "pubchem_sid": 104009437,
                "smiles": "CO[N-][C+](C1CCN(CC1)C1(C)CCN(CC1)C(=O)c1c(C)cc[n+]([O-])c1C)c1ccc(OC(F)(F)F)cc1",
                "zinc": "ZINC26848826"
            },

example 2

"object": {
                "bindingdb_link": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=22869",
                "inchi": "InChI=1S/C18H19ClN4/c1-22-8-10-23(11-9-22)18-14-4-2-3-5-15(14)20-16-7-6-13(19)12-17(16)21-18/h2-7,12,21H,8-11H2,1H3",
                "inchikey": "ZUXABONWMNSFBN-UHFFFAOYSA-N",
                "iuphar_grac_id": "38",
                "monomer_id": 22869,
                "name": "6-chloro-10-(4-methylpiperazin-1-yl)-2,9-diazatricyclo[9.4.0.0^{3,8}]pentadeca-1,3(8),4,6,10,12,14-heptaene::CLOZARIL::Clozapine::Leponex::US10259786, Clozapine",
                "pubchem_cid": 2818,
                "pubchem_sid": 49846683,
                "smiles": "CN1CCN(CC1)C1=c2ccccc2=Nc2ccc(Cl)cc2N1",
                "zinc": "ZINC19796155"
            },

@colleenXu
Copy link
Collaborator Author

object.chembl

Sometimes the values seem to be concatenated strings of multiple IDs:

  • seems EASY to fix: split whenever CHEMBL shows up more than once (clue that there's > 1 CHEMBL ID in the string). have an array of strings where each string is a single CHEMBL ID
  • I haven't done any checking to see if the single CHEMBL IDs are valid or not

Examples:

@colleenXu
Copy link
Collaborator Author

colleenXu commented Sep 6, 2023

object.name

  • seems easy to fix
  • These seem to be "::"-delimited strings, that are sometimes very long and unwieldy. I think it'd make sense to split them on "::" and have arrays of strings instead.

For examples, click on any of the BioThings BindingDB links above, and look at the object.name value.

@colleenXu
Copy link
Collaborator Author

colleenXu commented Sep 6, 2023

And this is what I did to address Andy Crouse's original problem of "Nodes with no names from BioThings BindingDB":

(pasted from biothings/pending.api#99 (comment))

For now, I've changed the x-bte annotation for this resource NCATS-Tangerine/translator-api-registry@022e876:

(1) Use object.inchikey:

  • covers a little over 96% of the resource (1394153 / 1438909)
  • my hunch is that the INCHIKEY IDs are not completely incorrect, VS the pubchem cids are sometimes incorrect and I don't really know why
  • VS object.pubchem_cid (used before) covers a little over 98% of the resource (1413051 / 1438909)
  • other fields cover much less of the resource or aren't supported by Node Norm:
    • object.pubchem_sid: 1413131 but Node Norm doesn't seem to support this ID namespace right now
    • object.chembl: 631745. Has its own issues, see last comment
    • object.kegg (KEGG.COMPOUND): 33680
    • object.chebi: 25903
    • object.drugbank: 24177

(2) Retrieve subject.name and object.name fields for input_name/output_name behavior, if Node Norm doesn't retrieve info for the ID. Every document has those fields, but the names provided have issues (will be covered in a later post).

@colleenXu colleenXu changed the title Issues with BioThings BindingDB chem ID and name fields Issues with BioThings BindingDB object fields Sep 6, 2023
@erikyao erikyao self-assigned this Sep 6, 2023
@erikyao erikyao removed their assignment Dec 6, 2023
@colleenXu
Copy link
Collaborator Author

@everaldorodrigo @newgene @andrewsu

I think this would be a useful issue for @everaldorodrigo to dig into and address as much as possible.

@everaldorodrigo
Copy link

  • I've been assuming that the object.pubchem_sid value is a "correct" ID (when the document has this field, which is >98% of documents or 1413131 / 1438909)
  • I'm not sure how many documents are affected
  • I'm not sure if the other chemical (object) fields have similar issues

@colleenXu, considering the last released data from May 2024 to the CI environment,
a total of 41678 from 1795611 items don't have the field object.pubchem_sid.

For those cases missing the object.pubchem_sid field, seems the data source (.tsv file) doesn't have the value for the column PubChem SID used in the parser to fill the field object.pubchem_sid.

The partial data below is extracted from the data source. There are two lines. The header and an example of item missing the field PubChem SID. Look for PubChem SID in the table below:

BindingDB Reactant_set_id Ligand SMILES Ligand InChI Ligand InChI Key BindingDB MonomerID BindingDB Ligand Name Target Name Target Source Organism According to Curator or DataSource Ki (nM) IC50 (nM) Kd (nM) EC50 (nM) kon (M-1-s-1) koff (s-1) pH Temp (C) Curation/DataSource Article DOI BindingDB Entry DOI PMID PubChem AID Patent Number Authors Institution Link to Ligand in BindingDB Link to Target in BindingDB Link to Ligand-Target Pair in BindingDB Ligand HET ID in PDB PDB ID(s) for Ligand-Target Complex PubChem CID PubChem SID ChEBI ID of Ligand ChEMBL ID of Ligand DrugBank ID of Ligand IUPHAR_GRAC ID of Ligand KEGG ID of Ligand ZINC ID of Ligand Number of Protein Chains in Target (>1 implies a multichain complex) BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain BindingDB Target Chain Sequence PDB ID(s) of Target Chain UniProt (SwissProt) Recommended Name of Target Chain UniProt (SwissProt) Entry Name of Target Chain UniProt (SwissProt) Primary ID of Target Chain UniProt (SwissProt) Secondary ID(s) of Target Chain UniProt (SwissProt) Alternative ID(s) of Target Chain UniProt (TrEMBL) Submitted Name of Target Chain UniProt (TrEMBL) Entry Name of Target Chain UniProt (TrEMBL) Primary ID of Target Chain UniProt (TrEMBL) Secondary ID(s) of Target Chain UniProt (TrEMBL) Alternative ID(s) of Target Chain
1187404 CS(=O)(=O)N1CCC@HC@@HC1 InChI=1S/C20H27FN6O4S/c1-32(30,31)26-7-6-16(15(21)11-26)24-20-23-10-13-8-12(9-17(22)28)19(29)27(18(13)25-20)14-4-2-3-5-14/h8,10,14-16H,2-7,9,11H2,1H3,(H2,22,28)(H,23,24,25)/t15-,16-/m0/s1 LGDFLYMYRBSZOR-HOTGVXAUSA-N 370273 BDBM467168::US10233188, Example 160 Cyclin-dependent kinase 6/G1/S-specific cyclin-D1 [L188C] Homo sapiens 3.55               US Patent   10.7270/Q2KK9G2V   aid1806677 US11396512 Behenna, DC; Chen, P; Freeman-Cook, KD; Hoffman, RL; Jalaie, M; Nagata, A; Nair, SK; Ninkovic, S; Ornelas, MA; Palmer, CL; Rui, EY Pfizer Inc. http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=370273 http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=com&complexid=99&target=Cyclin-dependent+kinase+6%2FG1%2FS-specific+cyclin-D1+%5BL188C%5D&column=ki&startPg=0&Increment=50&submit=Search http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=370273&enzyme=Cyclin-dependent+kinase+6%2FG1%2FS-specific+cyclin-D1+%5BL188C%5D&column=ki&startPg=0&Increment=50&submit=Search                     2 mehqllccevetirraypdanllndrvlramlkaeetcapsvsyfkcvqkevlpsmrkivatwmlevceeqkceeevfplamnyldrflslepvkksrlqllgatcmfvaskmketipltaeklciytdgsirpeellqmelllvnklkwnlaamtphdfiehflskmpeaeenkqiirkhaqtfvascatdvkfisnppsmvaagsvvaavqglnlrspnnflsyyrltrflsrvikcdpdclracqeqieallesslrqaqqnmdpkaaeeeeeeeeevdlactptdvrdvdi   G1/S-specific cyclin-D1 CCND1_HUMAN P24385 Q6LEF0             MEKDGLCRADQQYECVAEIGEGAYGKVFKARDLKNGGRFVALKRVRVQTGEEGMPLSTIREVAVLRHLETFEHPNVVRLFDVCTVSRTDRETKLTLVFEHVDQDLTTYLDKVPEPGVPTETIKDMMFQLLRGLDFLHSHRVVHRDLKPQNILVTSSGQIKLADFGLARIYSFQMALTSVVVTLWYRAPEVLLQSSYATPVDLWSVGCIFAEMFRRKPLFRGSSDVDQLGKILDVIGLPGEEDWPRDVALPRQAFHSKSAQPIEKFVTDIDELGKDLLLKCLTFNPAKRISAYSALSHPYFQDLERCKENLDSHLPPSQNTSELNTA 1G3N,1JOW,1XO2,2EUF,2F2C,3NUP,3NUX,4AUA,4EZ5,4TTH,5L2I,5L2S,5L2T,6OQL,6OQO Cyclin-dependent kinase 6 CDK6_HUMAN Q00534 A4D1G0                                                                                                                                                                                                                                                                                    

Do you think we should use another field for the operations instead of object.pubchem_sid?

@colleenXu
Copy link
Collaborator Author

Sorry for the late response.

I think it's okay that some rows don't have a pubchem SID. I've been using the INCHIKEY instead (see the earlier post where I explain that I think it covered most of the resource and was somewhat reliable but still had a problem).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants