Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data source: BindingDB #70

Closed
newgene opened this issue Jun 1, 2022 · 39 comments
Closed

Data source: BindingDB #70

newgene opened this issue Jun 1, 2022 · 39 comments
Assignees
Labels
data source Data source pending to create a new API

Comments

@newgene
Copy link
Member

newgene commented Jun 1, 2022

Name: BindingDB - curated protein-chemical bindings
URL: https://www.bindingdb.org/rwd/bind/index.jsp
Download: https://www.bindingdb.org/rwd/bind/chemsearch/marvin/SDFdownload.jsp?all_download=yes
License: publicly available with no license specified

@newgene newgene added the data source Data source pending to create a new API label Jun 1, 2022
@andrewsu
Copy link
Member

andrewsu commented Jun 24, 2022

  • Downloaded BindingDB_All_2022m5.tsv.zip
  • 2407382 rows, each with a binding relationship (193 columns of metadata about the relationship) [code snippet 1]
    • 2178377 rows with a UniProt ID for the target [2]
    • 1243406 rows with a UniProt ID for the target AND target is human protein [3]
      • records by data source [4]
706654 ChEMBL
 420222 US Patent
  63957 PubChem
  42123 Curated from the literature by BindingDB
   8510 PDSP Ki
   1084 D3R
    763 CSAR
     93 Taylor Research Group, UCSD
  • Records by target (1786 unique targets; top ~30 shown below) [5]
  17347 JAK2_HUMAN
  16717 BTK_HUMAN
  15331 EGFR_HUMAN
  14108 DRD2_HUMAN
  12988 JAK1_HUMAN
  12917 CAH2_HUMAN
  12591 PK3CD_HUMAN
  12345 BACE1_HUMAN
  12233 OX2R_HUMAN
  12004 VGFR2_HUMAN
  11871 KCNH2_HUMAN
  11412 OX1R_HUMAN
  11403 CNR2_HUMAN
  10934 IRAK4_HUMAN
  10804 CAH1_HUMAN
  10574 CNR1_HUMAN
  10231 HDAC1_HUMAN
  10071 OPRM_HUMAN
   9901 PK3CA_HUMAN
   9518 JAK3_HUMAN
   9262 CDK2_HUMAN
   9072 OPRK_HUMAN
   8911 GSK3B_HUMAN
   8693 CP3A4_HUMAN
   8670 MK01_HUMAN
   8380 P2RX3_HUMAN
   8023 AA2AR_HUMAN

Conclusion: To start, let's go with the 1243406 records according to the minimal filtering described above. The goal is to create a JSON file with one record per row (compound - target association). The JSON structure should roughly follow the pattern described in #55

Code snippetsNOTE: alias gawkt='awk -F"\t" -v OFS="\t"'

[1]: wc BindingDB_All.tsv
[2]: cat BindingDB_All.tsv | gawkt '$42!=""' | wc
[3]: cat BindingDB_All.tsv | gawkt '$42!=""&&$8=="Homo sapiens"' > BindingDB_All_humanuniprot.tsv; wc BindingDB_All_humanuniprot.tsv
[4]: cat BindingDB_All_humanuniprot.tsv | gawkt '{print $17}' | sort | uniq -c | sort -k1nr
[5]: cat BindingDB_All_humanuniprot.tsv | gawkt '{print $41}' | sort | uniq -c | sort -k1nr

@rjawesome
Copy link
Collaborator

I am currently working on this issue

@rjawesome
Copy link
Collaborator

I made a basic parser for this: https://github.com/rjawesome/BindingDB_parser

I realized that some rows have more than one chemical -> protein relationship (indicated by "Number of Protein Chains in Target (>1 implies a multichain complex)") so I have split those into separate documents

At the moment I am currently filtering by checking if the primary uniprot name ends with _HUMAN. I checked one of the records with Homo Sapiens as the species that listed CGH2_SHV21 (ID Q01043) as the protein, from the UniProt website it seemed not to be a human protein but I could be mistaken?

Also, the parser takes around 2 minutes to run so I am not sure if I need to optimize my code or if this is just a really big data file.

Sample Record:

{
  "object": {
    "BindingDB Reactant_set_id": "143",
    "Ligand SMILES": "Cc1nc(CN2CCN(CC2)c2c(Cl)cnc3[nH]c(nc23)-c2cn(C)nc2C)no1",
    "Ligand InChI": "InChI=1S/C19H22ClN9O/c1-11-13(9-27(3)25-11)18-23-16-17(14(20)8-21-19(16)24-18)29-6-4-28(5-7-29)10-15-22-12(2)30-26-15/h8-9H,4-7,10H2,1-3H3,(H,21,23,24)",
    "Ligand InChI Key": "ZYQKMYRXVHUATB-UHFFFAOYSA-N",
    "BindingDB MonomerID": "247370",
    "BindingDB Ligand Name": "US9447092, 3",
    "Target Name Assigned by Curator or DataSource": "Cytochrome P450 3A4",
    "Target Source Organism According to Curator or DataSource": "Homo sapiens",
    "Ki (nM)": "",
    "IC50 (nM)": ">50000",
    "Kd (nM)": "",
    "EC50 (nM)": "",
    "kon (M-1-s-1)": "",
    "koff (s-1)": "",
    "pH": "",
    "Temp (C)": "",
    "Curation/DataSource": "US Patent",
    "Article DOI": "",
    "PMID": "",
    "PubChem AID": "",
    "Patent Number": "US9447092",
    "Authors": "Blagg, J; Bavetsias, V; Moore, AS; Linardopoulos, S",
    "Institution": "Cancer Research Technology Limited",
    "Link to Ligand in BindingDB": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=247370",
    "Link to Target in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=pol&polymerid=2127&target=Cytochrome+P450+3A4&column=ki&startPg=0&Increment=50&submit=Search",
    "Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=247370&enzyme=Cytochrome+P450+3A4&column=ki&startPg=0&Increment=50&submit=Search",
    "Ligand HET ID in PDB": "",
    "PDB ID(s) for Ligand-Target Complex": "",
    "PubChem CID": "71463198",
    "PubChem SID": "346541913",
    "ChEBI ID of Ligand": "",
    "ChEMBL ID of Ligand": "",
    "DrugBank ID of Ligand": "",
    "IUPHAR_GRAC ID of Ligand": "",
    "KEGG ID of Ligand": "",
    "ZINC ID of Ligand": "",
    "Number of Protein Chains in Target (>1 implies a multichain complex)": "1"
  },
  "subject": {
    "BindingDB Target Chain  Sequence": "MALIPDLAMETWLLLAVSLVLLYLYGTHSHGLFKKLGIPGPTPLPFLGNILSYHKGFCMFDMECHKKYGKVWGFYDGQQPVLAITDPDMIKTVLVKECYSVFTNRRPFGPVGFMKSAISIAEDEEWKRLRSLLSPTFTSGKLKEMVPIIAQYGDVLVRNLRREAETGKPVTLKDVFGAYSMDVITSTSFGVNIDSLNNPQDPFVENTKKLLRFDFLDPFFLSITVFPFLIPILEVLNICVFPREVTNFLRKSVKRMKESRLEDTQKHRVDFLQLMIDSQNSKETESHKALSDLELVAQSIIFIFAGYETTSSVLSFIMYELATHPDVQQKLQEEIDAVLPNKAPPTYDTVLQMEYLDMVVNETLRLFPIAMRLERVCKKDVEINGMFIPKGVVVMIPSYALHRDPKYWTEPEKFLPERFSKKNKDNIDPYIYTPFGSGPRNCIGMRFALMNMKLALIRVLQNFSFKPCKETQIPLKLSLGGLLQPEKPVVLKVESRDGTVSGA",
    "PDB ID(s) of Target Chain": "1W0E,1W0F,1W0G,2J0D,2V0M,3NXU,4NY4,7LXL",
    "UniProt (SwissProt) Recommended Name of Target Chain": "Cytochrome P450 3A4",
    "UniProt (SwissProt) Entry Name of Target Chain": "CP3A4_HUMAN",
    "UniProt (SwissProt) Primary ID of Target Chain": "P08684",
    "UniProt (SwissProt) Secondary ID(s) of Target Chain": "P05184,Q16757,Q9UK50",
    "UniProt (SwissProt) Alternative ID(s) of Target Chain": "",
    "UniProt (TrEMBL) Submitted Name of Target Chain": "",
    "UniProt (TrEMBL) Entry Name of Target Chain": "",
    "UniProt (TrEMBL) Primary ID of Target Chain": "Q6GRK0",
    "UniProt (TrEMBL) Secondary ID(s) of Target Chain": "",
    "UniProt (TrEMBL) Alternative ID(s) of Target Chain": ""
  }
}

@rjawesome
Copy link
Collaborator

Updated Sample Document

{
  "object": {
    "BindingDB Reactant_set_id": 199,
    "Ligand SMILES": "CN(Cc1ccc(s1)C(=O)N[C@@H](CC(O)=O)C(=O)CSCc1ccccc1Cl)Cc1ccc(O)c(c1)C(O)=O",
    "Ligand InChI": "InChI=1S/C27H27ClN2O7S2/c1-30(12-16-6-8-22(31)19(10-16)27(36)37)13-18-7-9-24(39-18)26(35)29-21(11-25(33)34)23(32)15-38-14-17-4-2-3-5-20(17)28/h2-10,21,31H,11-15H2,1H3,(H,29,35)(H,33,34)(H,36,37)/t21-/m0/s1",
    "Ligand InChI Key": "FIEQQFOHZKVJLV-NRFANRHFSA-N",
    "BindingDB MonomerID": 219,
    "BindingDB Ligand Name": "5-({[(5-{[(2S)-1-carboxy-4-{[(2-chlorophenyl)methyl]sulfanyl}-3-oxobutan-2-yl]carbamoyl}thiophen-2-yl)methyl](methyl)amino}methyl)-2-hydroxybenzoic acid::Inhibitor 47c::Thiophene Scaffold 47c",
    "PubChem CID": 5327301,
    "PubChem SID": 8030144,
    "ZINC ID of Ligand": "ZINC14942804",
    "Number of Protein Chains in Target (>1 implies a multichain complex)": 1
  },
  "subject": {
    "Target Name Assigned by Curator or DataSource": "Caspase-3",
    "Target Source Organism According to Curator or DataSource": "Homo sapiens",
    "Link to Target in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=pol&polymerid=1072&target=Caspase-3&column=ki&startPg=0&Increment=50&submit=Search",
    "BindingDB Target Chain  Sequence": "MENTENSVDSKSIKNLEPKIIHGSESMDSGISLDNSYKMDYPEMGLCIIINNKNFHKSTGMTSRSGTDVDAANLRETFRNLKYEVRNKNDLTREEIVELMRDVSKEDHSKRSSFVCVLLSHGEEGIIFGTNGPVDLKKITNFFRGDRCRSLTGKPKLFIIQACRGTELDCGIETDSGVDDDMACHKIPVEADFLYAYSTAPGYYSWRNSKDGSWFIQSLCAMLKQYADKLEFMHILTRVNRKVATEFESFSFDATFHAKKQIPCIVSMLTKELYFYH",
    "PDB ID(s) of Target Chain": [
      "1GFW",
      "1I3O",
      "1NME",
      "1PAU",
      "1RE1",
      "1RHJ",
      "1RHK",
      "1RHM",
      "1RHQ",
      "1RHR",
      "1RHU",
      "2C1E",
      "2C2K",
      "2C2M",
      "2C2O",
      "2CDR",
      "2CJX",
      "2CJY",
      "2CNK",
      "2CNL",
      "2CNN",
      "2CNO",
      "2DKO",
      "2H5I",
      "2H5J",
      "2H65",
      "2J30",
      "2XYG",
      "2XYH",
      "2XYP",
      "2XZD",
      "2XZT",
      "2Y0B",
      "3EDQ",
      "3GJQ",
      "3GJR",
      "3GJS",
      "3GJT",
      "3H0E",
      "3KJF",
      "4DCJ",
      "4DCO",
      "4DCP",
      "4JJE",
      "4PRY",
      "4PS0",
      "5IC4",
      "6BDV",
      "6BFJ",
      "6BFK",
      "6BFL",
      "6BFO",
      "6BG0",
      "6BG1",
      "6BG4",
      "6BGK",
      "6BGQ",
      "6BGR",
      "6BGS",
      "6BH9",
      "6BHA",
      "6CKZ",
      "6CL0",
      "6X8I",
      "6X8K",
      "7RN7",
      "7RN8",
      "7RN9",
      "7RNB",
      "7RND",
      "7RNE",
      "7RNF",
      "7SEO"
    ],
    "UniProt (SwissProt) Recommended Name of Target Chain": "Caspase-3",
    "UniProt (SwissProt) Entry Name of Target Chain": "CASP3_HUMAN",
    "UniProt (SwissProt) Primary ID of Target Chain": "P42574",
    "UniProt (SwissProt) Secondary ID(s) of Target Chain": [
      "A8K5M2",
      "D3DP53",
      "Q96AN1",
      "Q96KP2"
    ]
  },
  "relation": {
    "Ki (nM)": " 90",
    "pH": "7.4",
    "Temp (C)": "25.00 C",
    "Curation/DataSource": "Curated from the literature by BindingDB",
    "Article DOI": "10.1021/jm020230j",
    "PMID": "12408711",
    "Authors": "Choong, IC; Lew, W; Lee, D; Pham, P; Burdett, MT; Lam, JW; Wiesmann, C; Luong, TN; Fahr, B; DeLano, WL; McDowell, RS; Allen, DA; Erlanson, DA; Gordon, EM; O'Brien, T",
    "Institution": "Sunesis Pharmaceuticals",
    "Link to Ligand in BindingDB": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=219",
    "Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=219&enzyme=Caspase-3&column=ki&startPg=0&Increment=50&submit=Search"
  }
}

@andrewsu
Copy link
Member

Just a note/caveat. I've pasted a (transposed) snippet of the data file which shows four records that are exactly the same except for the Ki. In this case, best to collapse these four input records into a single output record, where the Ki is an array. We should be careful to identify any other columns that need similar treatment. (Possibly to help that effort, we should add an "_id" key based on BindingDB MonomerID and UniProt (SwissProt) Primary ID of Target Chain -- in the example above, it would be 219-P42574.)

image

@rjawesome
Copy link
Collaborator

rjawesome commented Jun 30, 2022

Ok, I will check for that. Just curious, what software are you using to view the file? It seems like it would be tricky to view the file given its large size.

(Just off the first few, I found documents with the same ID according to your construction had different IC50 (nM) and/or Author/Institution/PMID/etc.)

In this case would we turn the Author/Institution/etc. into an array or would we separate those entries? ... in the case we separate those entities, we couldn't use your idea for the _id key

@andrewsu
Copy link
Member

andrewsu commented Jul 1, 2022

Ok, I will check for that. Just curious, what software are you using to view the file? It seems like it would be tricky to view the file given its large size.

That's a screenshot in excel. I typically use command-line tools (like awk) to extract a very small subset of the file before trying to load it...

Ok, I will check for that. Just curious, what software are you using to view the file? It seems like it would be tricky to view the file given its large size.

(Just off the first few, I found documents with the same ID according to your construction had different IC50 (nM) and/or Author/Institution/PMID/etc.)

In this case would we turn the Author/Institution/etc. into an array or would we separate those entries? ... in the case we separate those entities, we couldn't use your idea for the _id key

I think Ki, IC50, Kd, EC50, kon, and koff can all be treated the same -- put them in an array. I think that'd be fine for Author and Institution too, but can you post a couple examples, and if possible, a count of the number of times this occurs? Depending on those answers, we might add one level of grouping (under something like references) where those fields are grouped together before assembling into an array.

@rjawesome
Copy link
Collaborator

rjawesome commented Jul 1, 2022

Here are the # of entries with the same ID but differing values for each key (this was for the first 300k entries on the table)

{
"BindingDB Reactant_set_id":37936,
"IC50 (nM)":19365,
"PMID":7147,
"pH":6401,
"Authors":9407,
"Article DOI":6694,
"Temp (C)":6480,
"Institution":6862,
"Ki (nM)":9881,
"Target Name Assigned by Curator or DataSource":725,
"Link to Ligand-Target Pair in BindingDB":725,
"Link to Target in BindingDB":726,
"Kd (nM)":1138,
"Curation/DataSource":1753,
"Patent Number":4203,"EC50 (nM)":7173,
"Number of Protein Chains in Target (>1 implies a multichain complex)":433,
"UniProt (TrEMBL) Primary ID of Target Chain":298,
"PubChem AID":11479,
"Target Source Organism According to Curator or DataSource":106,
"PDB ID(s) for Ligand-Target Complex":29,
"kon (M-1-s-1)":68,
"koff (s-1)":48
}

Here are some example documents where the ID is the same but the authors are different

{
  "subject": {
    "Target Name Assigned by Curator or DataSource": "Protein kinase C alpha type",
    "Target Source Organism According to Curator or DataSource": "Homo sapiens",
    "Link to Target in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=pol&polymerid=599&target=Protein+kinase+C+alpha+type&column=ki&startPg=0&Increment=50&submit=Search",
    "BindingDB Target Chain  Sequence": "MADVFPGNDSTASQDVANRFARKGALRQKNVHEVKDHKFIARFFKQPTFCSHCTDFIWGFGKQGFQCQVCCFVVHKRCHEFVTFSCPGADKGPDTDDPRSKHKFKIHTYGSPTFCDHCGSLLYGLIHQGMKCDTCDMNVHKQCVINVPSLCGMDHTEKRGRIYLKAEVADEKLHVTVRDAKNLIPMDPNGLSDPYVKLKLIPDPKNESKQKTKTIRSTLNPQWNESFTFKLKPSDKDRRLSVEIWDWDRTTRNDFMGSLSFGVSELMKMPASGWYKLLNQEEGEYYNVPIPEGDEEGNMELRQKFEKAKLGPAGNKVISPSEDRKQPSNNLDRVKLTDFNFLMVLGKGSFGKVMLADRKGTEELYAIKILKKDVVIQDDDVECTMVEKRVLALLDKPPFLTQLHSCFQTVDRLYFVMEYVNGGDLMYHIQQVGKFKEPQAVFYAAEISIGLFFLHKRGIIYRDLKLDNVMLDSEGHIKIADFGMCKEHMMDGVTTRTFCGTPDYIAPEIIAYQPYGKSVDWWAYGVLLYEMLAGQPPFDGEDEDELFQSIMEHNVSYPKSLSKEAVSVCKGLMTKHPAKRLGCGPEGERDVREHAFFRRIDWEKLENREIQPPFKPKVCGKGAENFDKFFTRGQPVLTPPDQLVIANIDQSDFEGFSYVNPQFVHPILQSAV",
    "PDB ID(s) of Target Chain": "4DNL,4RA4",
    "UniProt (SwissProt) Recommended Name of Target Chain": "Protein kinase C alpha type",
    "UniProt (SwissProt) Entry Name of Target Chain": "KPCA_HUMAN",
    "UniProt (SwissProt) Primary ID of Target Chain": "P17252",
    "UniProt (SwissProt) Secondary ID(s) of Target Chain": "B5BU22,Q15137,Q32M72,Q96RE4",
    "UniProt (TrEMBL) Primary ID of Target Chain": "L7RSM7"
  },
  "object": {
    "BindingDB Reactant_set_id": "4491",
    "Ligand SMILES": "OC(=O)c1cccc(O)c1C(=O)c1c(O)cc(cc1O)C(=O)O[C@@H]1CCCNC[C@H]1NC(=O)c1ccc(O)cc1",
    "Ligand InChI": "InChI=1S/C28H26N2O10/c31-16-8-6-14(7-9-16)26(36)30-18-13-29-10-2-5-22(18)40-28(39)15-11-20(33)24(21(34)12-15)25(35)23-17(27(37)38)3-1-4-19(23)32/h1,3-4,6-9,11-12,18,22,29,31-34H,2,5,10,13H2,(H,30,36)(H,37,38)/t18-,22-/m1/s1",
    "Ligand InChI Key": "XYUFCXJZFZPEJD-XMSQKQJNSA-N",
    "BindingDB MonomerID": "3149",
    "BindingDB Ligand Name": "2-{[2,6-dihydroxy-4-({[(3R,4R)-3-[(4-hydroxybenzene)amido]azepan-4-yl]oxy}carbonyl)phenyl]carbonyl}-3-hydroxybenzoic acid::Acyclic Balanol Analog (-)-1::Balanol analog 1::Balanol, 1::CHEMBL60254",        
    "PubChem CID": "5287736",
    "PubChem SID": "8032894",
    "ChEMBL ID of Ligand": "CHEMBL60254",
    "ZINC ID of Ligand": "ZINC03871640",
    "Number of Protein Chains in Target (>1 implies a multichain complex)": "1"
  },
  "relation": {
    "IC50 (nM)": " 30",
    "Curation/DataSource": "Curated from the literature by BindingDB",
    "Article DOI": "10.1021/jm960581w",
    "PMID": "8978850",
    "Authors": "Defauw, JM; Murphy, MM; Jagdmann, GE; Hu, H; Lampe, JW; Hollinshead, SP; Mitchell, TJ; Crane, HM; Heerding, JM; Mendoza, JS; Davis, JE; Darges, JW; Hubbard, FR; Hall, SE",
    "Institution": "Sphinx Laboratories",
    "Link to Ligand in BindingDB": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=3149",    "Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=3149&enzyme=Protein+kinase+C+alpha+type&column=ki&startPg=0&Increment=50&submit=Search",  
    "Ligand HET ID in PDB": "BA1",
    "PDB ID(s) for Ligand-Target Complex": "1BX6,3KRX,3KRW"
  },
  "_id": "3149-P17252"
}
{
  "subject": {
    "Target Name Assigned by Curator or DataSource": "Protein kinase C alpha type",
    "Target Source Organism According to Curator or DataSource": "Homo sapiens",
    "Link to Target in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=pol&polymerid=599&target=Protein+kinase+C+alpha+type&column=ki&startPg=0&Increment=50&submit=Search",
    "BindingDB Target Chain  Sequence": "MADVFPGNDSTASQDVANRFARKGALRQKNVHEVKDHKFIARFFKQPTFCSHCTDFIWGFGKQGFQCQVCCFVVHKRCHEFVTFSCPGADKGPDTDDPRSKHKFKIHTYGSPTFCDHCGSLLYGLIHQGMKCDTCDMNVHKQCVINVPSLCGMDHTEKRGRIYLKAEVADEKLHVTVRDAKNLIPMDPNGLSDPYVKLKLIPDPKNESKQKTKTIRSTLNPQWNESFTFKLKPSDKDRRLSVEIWDWDRTTRNDFMGSLSFGVSELMKMPASGWYKLLNQEEGEYYNVPIPEGDEEGNMELRQKFEKAKLGPAGNKVISPSEDRKQPSNNLDRVKLTDFNFLMVLGKGSFGKVMLADRKGTEELYAIKILKKDVVIQDDDVECTMVEKRVLALLDKPPFLTQLHSCFQTVDRLYFVMEYVNGGDLMYHIQQVGKFKEPQAVFYAAEISIGLFFLHKRGIIYRDLKLDNVMLDSEGHIKIADFGMCKEHMMDGVTTRTFCGTPDYIAPEIIAYQPYGKSVDWWAYGVLLYEMLAGQPPFDGEDEDELFQSIMEHNVSYPKSLSKEAVSVCKGLMTKHPAKRLGCGPEGERDVREHAFFRRIDWEKLENREIQPPFKPKVCGKGAENFDKFFTRGQPVLTPPDQLVIANIDQSDFEGFSYVNPQFVHPILQSAV",
    "PDB ID(s) of Target Chain": "4DNL,4RA4",
    "UniProt (SwissProt) Recommended Name of Target Chain": "Protein kinase C alpha type",
    "UniProt (SwissProt) Entry Name of Target Chain": "KPCA_HUMAN",
    "UniProt (SwissProt) Primary ID of Target Chain": "P17252",
    "UniProt (SwissProt) Secondary ID(s) of Target Chain": "B5BU22,Q15137,Q32M72,Q96RE4",
    "UniProt (TrEMBL) Primary ID of Target Chain": "L7RSM7"
  },
  "object": {
    "BindingDB Reactant_set_id": "4239",
    "Ligand SMILES": "OC(=O)c1cccc(O)c1C(=O)c1c(O)cc(cc1O)C(=O)O[C@@H]1CCCNC[C@H]1NC(=O)c1ccc(O)cc1",
    "Ligand InChI": "InChI=1S/C28H26N2O10/c31-16-8-6-14(7-9-16)26(36)30-18-13-29-10-2-5-22(18)40-28(39)15-11-20(33)24(21(34)12-15)25(35)23-17(27(37)38)3-1-4-19(23)32/h1,3-4,6-9,11-12,18,22,29,31-34H,2,5,10,13H2,(H,30,36)(H,37,38)/t18-,22-/m1/s1",
    "Ligand InChI Key": "XYUFCXJZFZPEJD-XMSQKQJNSA-N",
    "BindingDB MonomerID": "3149",
    "BindingDB Ligand Name": "2-{[2,6-dihydroxy-4-({[(3R,4R)-3-[(4-hydroxybenzene)amido]azepan-4-yl]oxy}carbonyl)phenyl]carbonyl}-3-hydroxybenzoic acid::Acyclic Balanol Analog (-)-1::Balanol analog 1::Balanol, 1::CHEMBL60254",        
    "PubChem CID": "5287736",
    "PubChem SID": "8032894",
    "ChEMBL ID of Ligand": "CHEMBL60254",
    "ZINC ID of Ligand": "ZINC03871640",
    "Number of Protein Chains in Target (>1 implies a multichain complex)": "1"
  },
  "relation": {
    "IC50 (nM)": " 74",
    "pH": "7.5",
    "Temp (C)": "30.00 C",
    "Curation/DataSource": "Curated from the literature by BindingDB",
    "Article DOI": "10.1016/0960-894X(95)00365-Z",
    "Authors": "Lai, YS; Menaldino, DS; Nichols, JB; Jagdmann , GE; Mylott, F; Gillespie, J; Hall, SE",
    "Institution": "Sphinx Laboratories",
    "Link to Ligand in BindingDB": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=3149",    "Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=3149&enzyme=Protein+kinase+C+alpha+type&column=ki&startPg=0&Increment=50&submit=Search",  
    "Ligand HET ID in PDB": "BA1",
    "PDB ID(s) for Ligand-Target Complex": "1BX6,3KRX,3KRW"
  },
  "_id": "3149-P17252"
}

For these two documents, the different fields were 'Authors', 'PMID', 'Temp (C)', 'BindingDB Reactant_set_id', 'pH', 'IC50 (nM)', 'Article DOI'

Also, just a note if we are combining documents together, I believe that would force the code to store all previous documents which could cause high ram usage (possibly around the size of the tsv itself)

@rjawesome
Copy link
Collaborator

Also, this doesn't just apply to fields in the relation, fields in the subject as well can be different even with the same ID. Here is an example where the fields 'Link to Target in BindingDB', 'Link to Ligand-Target Pair in BindingDB', 'IC50 (nM)', 'Target Name Assigned by Curator or DataSource' are different.

{
  "subject": {
    "Target Name Assigned by Curator or DataSource": "CDK2/CycE",
    "Target Source Organism According to Curator or DataSource": "Homo sapiens",
    "Link to Target in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=com&complexid=81&target=CDK2%2FCycE&column=ki&startPg=0&Increment=50&submit=Search",
    "BindingDB Target Chain  Sequence": "MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNHPNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHSHRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEILLGCKYYSTAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSFPKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAALAHPFFQDVTKPVPHLRL",
    "PDB ID(s) of Target Chain": "1CKP,1DI8,1DM2,1E1V,1E1X,1E9H,1F5Q,1FIN,1FQ1,1FVT,1FVV,1G5S,1GIH,1GY3,1GZ8,1H00,1H01,1H07,1H08,1H0V,1H0W,1H1P,1H1Q,1H1R,1H1S,1H24,1H25,1H26,1H27,1H28,1HCK,1HCL,1JST,1JSU,1JSV,1JVP,1KE5,1KE6,1KE7,1KE8,1KE9,1OGU,1OI9,1OIQ,1OIR,1OIU,1OIY,1OKV,1OKW,1OL1,1OL2,1P2A,1P5E,1PF8,1PKD,1PW2,1PXI,1PXJ,1PXK,1PXL,1PXM,1PXN,1PXO,1PXP,1PYE,1QMZ,1R78,1URC,1URW,1V1K,1VYW,1VYZ,1W0X,1W8C,1W98,1WCC,1Y8Y,1Y91,1YKR,2A0C,2A4L,2B52,2B53,2B54,2B55,2BHE,2BHH,2BKZ,2BPM,2BTR,2BTS,2C4G,2C5N,2C5O,2C5V,2C5X,2C5Y,2C68,2C69,2C6I,2C6K,2C6L,2C6M,2C6O,2C6T,2CCH,2CCI,2CJM,2CLX,2DUV,2EXM,2FVD,2G9X,2I40,2J9M,2JGZ,2R3F,2R3G,2R3H,2R3I,2R3J,2R3K,2R3L,2R3M,2R3N,2R3O,2R3P,2R3Q,2R3R,2R64,2UUE,2UZB,2UZD,2UZE,2UZL,2UZN,2UZO,2V0D,2V22,2VTA,2VTH,2VTI,2VTJ,2VTL,2VTM,2VTN,2VTO,2VTP,2VTQ,2VTR,2VTS,2VTT,2VU3,2VV9,2W05,2W06,2W17,2W1H,2WEV,2WFY,2WHB,2WIH,2WIP,2WMA,2WMB,2WPA,2WXV,2X1N,2XMY,2XNB,3BHT,3BHU,3BHV,3DDP,3DDQ,3DOG,3EID,3EJ1,3EOC,3EZR,3EZV,3F5X,3FZ1,3IG7,3IGG,3LE6,3LFN,3LFQ,3LFS,3MY5,3NS9,3PJ8,3PXF,3PXQ,3PXR,3PXY,3PXZ,3PY0,3PY1,3QHR,3QHW,3QL8,3QQF,3QQG,3QQH,3QQJ,3QQK,3QQL,3QRT,3QRU,3QTQ,3QTR,3QTS,3QTU,3QTW,3QTX,3QTZ,3QU0,3QWJ,3QWK,3QX2,3QX4,3QXO,3QXP,3QZF,3QZG,3QZH,3QZI,3R1Q,3R1S,3R1Y,3R28,3R6X,3R71,3R73,3R7E,3R7I,3R7U,3R7V,3R7Y,3R83,3R8L,3R8M,3R8P,3R8U,3R8V,3R8Z,3R9D,3R9H,3R9N,3R9O,3RAH,3RAI,3RAK,3RAL,3RJC,3RK5,3RK7,3RK9,3RKB,3RM6,3RM7,3RMF,3RNI,3ROY,3RPO,3RPR,3RPV,3RPY,3RZB,3S00,3S0O,3S1H,3S2P,3SQQ,3SW4,3SW7,3TI1,3TIY,3TIZ,3TNW,3ULI,3UNJ,3UNK,3WBL,4ACM,4BCK,4BCM,4BCN,4BCO,4BCP,4BCQ,4BGH,4BZD,4CFM,4CFN,4CFU,4CFV,4CFW,4CFX,4D1X,4D1Z,4EK3,4EK4,4EK5,4EK6,4EK8,4EOQ,4EOR,4EOS,4ERW,4EZ3,4EZ7,4FKG,4FKI,4FKJ,4FKL,4FKO,4FKP,4FKQ,4FKR,4FKS,4FKT,4FKU,4FKV,4FKW,4FX3,4GCJ,4I3Z,4II5,4KD1,4LYN,4NJ3,4RJ3,5A14,5AND,5ANE,5ANG,5ANI,5ANJ,5ANK,5ANO,5CYI,5D1J,5FP5,5FP6,5IEV,5IEX,5IEY,5IF1,5JQ5,5JQ8,5K4J,5L2W,5LMK,5MHQ,5NEV,5OO0,5OSJ,5UQ1,5UQ2,5UQ3,6ATH,6GUB,6GUC,6GUE,6GUF,6GUH,6GUK,6GVA,6INL,6JGM,6OQI,6P3W,6Q3B,6Q3C,6Q3F,6Q48,6Q49,6Q4A,6Q4B,6Q4C,6Q4D,6Q4E,6Q4F,6Q4H,6Q4J,6Q4K,6RIJ,6SG4,7ACK,7B5L,7B5R,7B7S,7E34,7KJS,7M2F,7NVQ,7RA5",
    "UniProt (SwissProt) Recommended Name of Target Chain": "Cyclin-dependent kinase 2",
    "UniProt (SwissProt) Entry Name of Target Chain": "CDK2_HUMAN",
    "UniProt (SwissProt) Primary ID of Target Chain": "P24941",
    "UniProt (SwissProt) Secondary ID(s) of Target Chain": "A8K7C6,O75100"
  },
  "object": {
    "BindingDB Reactant_set_id": "10166",
    "Ligand SMILES": "CCn1c2nc(Nc3ccccc3)ncc2ccc1=O",
    "Ligand InChI": "InChI=1S/C15H14N4O/c1-2-19-13(20)9-8-11-10-16-15(18-14(11)19)17-12-6-4-3-5-7-12/h3-10H,2H2,1H3,(H,16,17,18)",
    "Ligand InChI Key": "WSZLNFZLFQJSAJ-UHFFFAOYSA-N",
    "BindingDB MonomerID": "6221",
    "BindingDB Ligand Name": "8-Ethyl-2-phenylamino-8H-pyrido[2,3-d]pyrimidin-7-one::8-ethyl-2-(phenylamino)-7H,8H-pyrido[2,3-d]pyrimidin-7-one::C2 Pyrido[2,3-d]pyrimidin-7-one deriv. 1",
    "PubChem CID": "5330199",
    "PubChem SID": "8035820",
    "ZINC ID of Ligand": "ZINC12354795",
    "Number of Protein Chains in Target (>1 implies a multichain complex)": "2"
  },
  "relation": {
    "IC50 (nM)": " 410",
    "Curation/DataSource": "Curated from the literature by BindingDB",
    "Article DOI": "10.1021/jm000271k",
    "PMID": "11101352",
    "Authors": "Barvian, M; Boschelli, DH; Cossrow, J; Dobrusin, E; Fattaey, A; Fritsch, A; Fry, D; Harvey, P; Keller, P; Garrett, M; La, F; Leopold, W; McNamara, D; Quin, M; Trumpp-Kallmeyer, S; Toogood, P; Wu, Z; Zhang, E",
    "Institution": "Parke-Davis Pharmaceutical Research",
    "Link to Ligand in BindingDB": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=6221",    "Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=6221&enzyme=CDK2%2FCycE&column=ki&startPg=0&Increment=50&submit=Search"
  },
  "_id": "6221-P24941"
}
{
  "subject": {
    "Target Name Assigned by Curator or DataSource": "Cyclin-Dependent Kinase 2 (CDK2)",
    "Target Source Organism According to Curator or DataSource": "Homo sapiens",
    "Link to Target in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=com&complexid=97&target=Cyclin-Dependent+Kinase+2+%28CDK2%29&column=ki&startPg=0&Increment=50&submit=Search",       
    "BindingDB Target Chain  Sequence": "MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNHPNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHSHRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEILLGCKYYSTAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSFPKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAALAHPFFQDVTKPVPHLRL",
    "PDB ID(s) of Target Chain": "1CKP,1DI8,1DM2,1E1V,1E1X,1E9H,1F5Q,1FIN,1FQ1,1FVT,1FVV,1G5S,1GIH,1GY3,1GZ8,1H00,1H01,1H07,1H08,1H0V,1H0W,1H1P,1H1Q,1H1R,1H1S,1H24,1H25,1H26,1H27,1H28,1HCK,1HCL,1JST,1JSU,1JSV,1JVP,1KE5,1KE6,1KE7,1KE8,1KE9,1OGU,1OI9,1OIQ,1OIR,1OIU,1OIY,1OKV,1OKW,1OL1,1OL2,1P2A,1P5E,1PF8,1PKD,1PW2,1PXI,1PXJ,1PXK,1PXL,1PXM,1PXN,1PXO,1PXP,1PYE,1QMZ,1R78,1URC,1URW,1V1K,1VYW,1VYZ,1W0X,1W8C,1W98,1WCC,1Y8Y,1Y91,1YKR,2A0C,2A4L,2B52,2B53,2B54,2B55,2BHE,2BHH,2BKZ,2BPM,2BTR,2BTS,2C4G,2C5N,2C5O,2C5V,2C5X,2C5Y,2C68,2C69,2C6I,2C6K,2C6L,2C6M,2C6O,2C6T,2CCH,2CCI,2CJM,2CLX,2DUV,2EXM,2FVD,2G9X,2I40,2J9M,2JGZ,2R3F,2R3G,2R3H,2R3I,2R3J,2R3K,2R3L,2R3M,2R3N,2R3O,2R3P,2R3Q,2R3R,2R64,2UUE,2UZB,2UZD,2UZE,2UZL,2UZN,2UZO,2V0D,2V22,2VTA,2VTH,2VTI,2VTJ,2VTL,2VTM,2VTN,2VTO,2VTP,2VTQ,2VTR,2VTS,2VTT,2VU3,2VV9,2W05,2W06,2W17,2W1H,2WEV,2WFY,2WHB,2WIH,2WIP,2WMA,2WMB,2WPA,2WXV,2X1N,2XMY,2XNB,3BHT,3BHU,3BHV,3DDP,3DDQ,3DOG,3EID,3EJ1,3EOC,3EZR,3EZV,3F5X,3FZ1,3IG7,3IGG,3LE6,3LFN,3LFQ,3LFS,3MY5,3NS9,3PJ8,3PXF,3PXQ,3PXR,3PXY,3PXZ,3PY0,3PY1,3QHR,3QHW,3QL8,3QQF,3QQG,3QQH,3QQJ,3QQK,3QQL,3QRT,3QRU,3QTQ,3QTR,3QTS,3QTU,3QTW,3QTX,3QTZ,3QU0,3QWJ,3QWK,3QX2,3QX4,3QXO,3QXP,3QZF,3QZG,3QZH,3QZI,3R1Q,3R1S,3R1Y,3R28,3R6X,3R71,3R73,3R7E,3R7I,3R7U,3R7V,3R7Y,3R83,3R8L,3R8M,3R8P,3R8U,3R8V,3R8Z,3R9D,3R9H,3R9N,3R9O,3RAH,3RAI,3RAK,3RAL,3RJC,3RK5,3RK7,3RK9,3RKB,3RM6,3RM7,3RMF,3RNI,3ROY,3RPO,3RPR,3RPV,3RPY,3RZB,3S00,3S0O,3S1H,3S2P,3SQQ,3SW4,3SW7,3TI1,3TIY,3TIZ,3TNW,3ULI,3UNJ,3UNK,3WBL,4ACM,4BCK,4BCM,4BCN,4BCO,4BCP,4BCQ,4BGH,4BZD,4CFM,4CFN,4CFU,4CFV,4CFW,4CFX,4D1X,4D1Z,4EK3,4EK4,4EK5,4EK6,4EK8,4EOQ,4EOR,4EOS,4ERW,4EZ3,4EZ7,4FKG,4FKI,4FKJ,4FKL,4FKO,4FKP,4FKQ,4FKR,4FKS,4FKT,4FKU,4FKV,4FKW,4FX3,4GCJ,4I3Z,4II5,4KD1,4LYN,4NJ3,4RJ3,5A14,5AND,5ANE,5ANG,5ANI,5ANJ,5ANK,5ANO,5CYI,5D1J,5FP5,5FP6,5IEV,5IEX,5IEY,5IF1,5JQ5,5JQ8,5K4J,5L2W,5LMK,5MHQ,5NEV,5OO0,5OSJ,5UQ1,5UQ2,5UQ3,6ATH,6GUB,6GUC,6GUE,6GUF,6GUH,6GUK,6GVA,6INL,6JGM,6OQI,6P3W,6Q3B,6Q3C,6Q3F,6Q48,6Q49,6Q4A,6Q4B,6Q4C,6Q4D,6Q4E,6Q4F,6Q4H,6Q4J,6Q4K,6RIJ,6SG4,7ACK,7B5L,7B5R,7B7S,7E34,7KJS,7M2F,7NVQ,7RA5",
    "UniProt (SwissProt) Recommended Name of Target Chain": "Cyclin-dependent kinase 2",
    "UniProt (SwissProt) Entry Name of Target Chain": "CDK2_HUMAN",
    "UniProt (SwissProt) Primary ID of Target Chain": "P24941",
    "UniProt (SwissProt) Secondary ID(s) of Target Chain": "A8K7C6,O75100"
  },
  "object": {
    "BindingDB Reactant_set_id": "10159",
    "Ligand SMILES": "CCn1c2nc(Nc3ccccc3)ncc2ccc1=O",
    "Ligand InChI": "InChI=1S/C15H14N4O/c1-2-19-13(20)9-8-11-10-16-15(18-14(11)19)17-12-6-4-3-5-7-12/h3-10H,2H2,1H3,(H,16,17,18)",
    "Ligand InChI Key": "WSZLNFZLFQJSAJ-UHFFFAOYSA-N",
    "BindingDB MonomerID": "6221",
    "BindingDB Ligand Name": "8-Ethyl-2-phenylamino-8H-pyrido[2,3-d]pyrimidin-7-one::8-ethyl-2-(phenylamino)-7H,8H-pyrido[2,3-d]pyrimidin-7-one::C2 Pyrido[2,3-d]pyrimidin-7-one deriv. 1",
    "PubChem CID": "5330199",
    "PubChem SID": "8035820",
    "ZINC ID of Ligand": "ZINC12354795",
    "Number of Protein Chains in Target (>1 implies a multichain complex)": "2"
  },
  "relation": {
    "IC50 (nM)": " 129",
    "Curation/DataSource": "Curated from the literature by BindingDB",
    "Article DOI": "10.1021/jm000271k",
    "PMID": "11101352",
    "Authors": "Barvian, M; Boschelli, DH; Cossrow, J; Dobrusin, E; Fattaey, A; Fritsch, A; Fry, D; Harvey, P; Keller, P; Garrett, M; La, F; Leopold, W; McNamara, D; Quin, M; Trumpp-Kallmeyer, S; Toogood, P; Wu, Z; Zhang, E",
    "Institution": "Parke-Davis Pharmaceutical Research",
    "Link to Ligand in BindingDB": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=6221",    "Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=6221&enzyme=Cyclin-Dependent+Kinase+2+%28CDK2%29&column=ki&startPg=0&Increment=50&submit=Search"
  },
  "_id": "6221-P24941"
}

@colleenXu
Copy link

Just a note (Andrew and you are the decision-makers): is there a way to organize the documents by relationship? Maybe by what fields are present, or something else (values in the fields, organization of database, names of files)?

For example, if a document has "inhibition-specific" fields like IC50 and Ki, does that mean it will definitely lack the fields for EC50 (which can be "agonist / stimulator" but can also be more general "effect") and Kd (more general to receptor-ligand binding) (this source is helpful)? To me, that would imply that the document is representing an "inhibition" relationship that is more specific than "this binds to that"...

It would help the ingestion into BTE a lot if we could put a keyword under relation that defined the relationship this document actually represents. Like "inhibition", "stimulates", "binds to".....something like that.

@rjawesome
Copy link
Collaborator

rjawesome commented Jul 2, 2022

I could probably look into other things, but I have found that some amount of documents (~2000 in the first 300000) have both Ki and EC50 so I don't believe your method would work unless we could have multiple relationships, or alternatively we could ignore/not have a relationship for ones with multiple of these fields

@andrewsu
Copy link
Member

andrewsu commented Jul 5, 2022

For now then, I think we should not worry about trying to characterize the relationship in more detail. Let's just add a top-level key for predicate with a value of 'physically interacts with'.

Also @rjawesome, can you add a mapping table between the original column names and the corresponding key to use in the JSON? For example, PDB ID(s) of Target Chain can be converted to just pdb, Ligand SMILES -> smiles, PubChem CID -> pubchem_cid. If you can create a document with the original column names we are using in the output JSON, I can provide the appropriate mapped values...

@rjawesome
Copy link
Collaborator

rjawesome commented Jul 5, 2022

All column names are located in this sample doc (#70 (comment)). I will add the predicate and start on coding a mapping table, but I was wondering what our decision was relating to the documents with duplicate IDs?

@andrewsu
Copy link
Member

andrewsu commented Jul 6, 2022

In cases where the _id is duplicated, then let's combine the relation info as objects in an array. Using the example you describe in this comment, the updated document would look something like this:

{
  "subject": {
    "Target Name Assigned by Curator or DataSource": "Protein kinase C alpha type",
    "Target Source Organism According to Curator or DataSource": "Homo sapiens",
    "Link to Target in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=pol&polymerid=599&target=Protein+kinase+C+alpha+type&column=ki&startPg=0&Increment=50&submit=Search",
    "BindingDB Target Chain  Sequence": "MADVFPGNDSTASQDVANRFARKGALRQKNVHEVKDHKFIARFFKQPTFCSHCTDFIWGFGKQGFQCQVCCFVVHKRCHEFVTFSCPGADKGPDTDDPRSKHKFKIHTYGSPTFCDHCGSLLYGLIHQGMKCDTCDMNVHKQCVINVPSLCGMDHTEKRGRIYLKAEVADEKLHVTVRDAKNLIPMDPNGLSDPYVKLKLIPDPKNESKQKTKTIRSTLNPQWNESFTFKLKPSDKDRRLSVEIWDWDRTTRNDFMGSLSFGVSELMKMPASGWYKLLNQEEGEYYNVPIPEGDEEGNMELRQKFEKAKLGPAGNKVISPSEDRKQPSNNLDRVKLTDFNFLMVLGKGSFGKVMLADRKGTEELYAIKILKKDVVIQDDDVECTMVEKRVLALLDKPPFLTQLHSCFQTVDRLYFVMEYVNGGDLMYHIQQVGKFKEPQAVFYAAEISIGLFFLHKRGIIYRDLKLDNVMLDSEGHIKIADFGMCKEHMMDGVTTRTFCGTPDYIAPEIIAYQPYGKSVDWWAYGVLLYEMLAGQPPFDGEDEDELFQSIMEHNVSYPKSLSKEAVSVCKGLMTKHPAKRLGCGPEGERDVREHAFFRRIDWEKLENREIQPPFKPKVCGKGAENFDKFFTRGQPVLTPPDQLVIANIDQSDFEGFSYVNPQFVHPILQSAV",
    "PDB ID(s) of Target Chain": "4DNL,4RA4",
    "UniProt (SwissProt) Recommended Name of Target Chain": "Protein kinase C alpha type",
    "UniProt (SwissProt) Entry Name of Target Chain": "KPCA_HUMAN",
    "UniProt (SwissProt) Primary ID of Target Chain": "P17252",
    "UniProt (SwissProt) Secondary ID(s) of Target Chain": "B5BU22,Q15137,Q32M72,Q96RE4",
    "UniProt (TrEMBL) Primary ID of Target Chain": "L7RSM7"
  },
  "object": {
    "BindingDB Reactant_set_id": "4491",
    "Ligand SMILES": "OC(=O)c1cccc(O)c1C(=O)c1c(O)cc(cc1O)C(=O)O[C@@H]1CCCNC[C@H]1NC(=O)c1ccc(O)cc1",
    "Ligand InChI": "InChI=1S/C28H26N2O10/c31-16-8-6-14(7-9-16)26(36)30-18-13-29-10-2-5-22(18)40-28(39)15-11-20(33)24(21(34)12-15)25(35)23-17(27(37)38)3-1-4-19(23)32/h1,3-4,6-9,11-12,18,22,29,31-34H,2,5,10,13H2,(H,30,36)(H,37,38)/t18-,22-/m1/s1",
    "Ligand InChI Key": "XYUFCXJZFZPEJD-XMSQKQJNSA-N",
    "BindingDB MonomerID": "3149",
    "BindingDB Ligand Name": "2-{[2,6-dihydroxy-4-({[(3R,4R)-3-[(4-hydroxybenzene)amido]azepan-4-yl]oxy}carbonyl)phenyl]carbonyl}-3-hydroxybenzoic acid::Acyclic Balanol Analog (-)-1::Balanol analog 1::Balanol, 1::CHEMBL60254",        
    "PubChem CID": "5287736",
    "PubChem SID": "8032894",
    "ChEMBL ID of Ligand": "CHEMBL60254",
    "ZINC ID of Ligand": "ZINC03871640",
    "Number of Protein Chains in Target (>1 implies a multichain complex)": "1"
  },
  "relation": [
    {
      "BindingDB Reactant_set_id": "4491",
      "IC50 (nM)": " 30",
      "Curation/DataSource": "Curated from the literature by BindingDB",
      "Article DOI": "10.1021/jm960581w",
      "PMID": "8978850",
      "Authors": "Defauw, JM; Murphy, MM; Jagdmann, GE; Hu, H; Lampe, JW; Hollinshead, SP; Mitchell, TJ; Crane, HM; Heerding, JM; Mendoza, JS; Davis, JE; Darges, JW; Hubbard, FR; Hall, SE",
      "Institution": "Sphinx Laboratories",
      "Link to Ligand in BindingDB": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=3149",    "Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=3149&enzyme=Protein+kinase+C+alpha+type&column=ki&startPg=0&Increment=50&submit=Search",  
      "Ligand HET ID in PDB": "BA1",
      "PDB ID(s) for Ligand-Target Complex": "1BX6,3KRX,3KRW"
    }, {
      "BindingDB Reactant_set_id": "4239",
      "IC50 (nM)": " 74",
      "pH": "7.5",
      "Temp (C)": "30.00 C",
      "Curation/DataSource": "Curated from the literature by BindingDB",
      "Article DOI": "10.1016/0960-894X(95)00365-Z",
      "Authors": "Lai, YS; Menaldino, DS; Nichols, JB; Jagdmann , GE; Mylott, F; Gillespie, J; Hall, SE",
      "Institution": "Sphinx Laboratories",
      "Link to Ligand in BindingDB": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=3149",    "Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=3149&enzyme=Protein+kinase+C+alpha+type&column=ki&startPg=0&Increment=50&submit=Search",  
      "Ligand HET ID in PDB": "BA1",
      "PDB ID(s) for Ligand-Target Complex": "1BX6,3KRX,3KRW"
    }
  ],
  "_id": "3149-P17252"
}

Note also that the BindingDB Reactant_set_id should be moved from the object section to the relation section.

@rjawesome
Copy link
Collaborator

rjawesome commented Jul 6, 2022

I can do this, but as I pointed out earlier, there are other fields that would seem to best fit in the object/subject which are also duplicated such as Target Name Assigned by Curator or DataSource, Link to Target in BindingDB, UniProt (TrEMBL) Primary ID of Target Chain Target Source Organism According to Curator or DataSource. Should I move those to the relation?

@andrewsu
Copy link
Member

andrewsu commented Jul 6, 2022

All the ones you explicitly listed in your last comment -- Target Name Assigned by Curator or DataSource, Link to Target in BindingDB, UniProt (TrEMBL) Primary ID of Target Chain, and Target Source Organism According to Curator or DataSource -- should remain in the subject section, and those values should be converted to arrays if they differ between records.

I also just noticed that Link to Ligand in BindingDB shows up in relation -- that should be moved to object.

Post here if there are any other fields whose behavior needs discussion...

@rjawesome
Copy link
Collaborator

rjawesome commented Jul 6, 2022

Alright, I've updated the parser, here is a new sample document (all fields that could have duplicates have been turned into arrays)

{
  "subject": {
    "Target Name Assigned by Curator or DataSource": [
      "Cyclin-Dependent Kinase 2 (CDK2)",
      "CDK2/CycE"
    ],
    "Target Source Organism According to Curator or DataSource": [
      "Homo sapiens"
    ],
    "Link to Target in BindingDB": [
      "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=com&complexid=97&target=Cyclin-Dependent+Kinase+2+%28CDK2%29&column=ki&startPg=0&Increment=50&submit=Search",
      "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=com&complexid=81&target=CDK2%2FCycE&column=ki&startPg=0&Increment=50&submit=Search"
    ],
    "BindingDB Target Chain  Sequence": "MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNHPNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHSHRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEILLGCKYYSTAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSFPKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAALAHPFFQDVTKPVPHLRL",
    "PDB ID(s) of Target Chain": [
      [
        "1CKP",
        "1DI8",
        ...
        "7M2F",
        "7NVQ",
        "7RA5"
      ]
    ],
    "UniProt (SwissProt) Recommended Name of Target Chain": [
      "Cyclin-dependent kinase 2"
    ],
    "UniProt (SwissProt) Entry Name of Target Chain": "CDK2_HUMAN",
    "UniProt (SwissProt) Primary ID of Target Chain": "P24941",
    "UniProt (SwissProt) Secondary ID(s) of Target Chain": [
      "A8K7C6",
      "O75100"
    ]
  },
  "object": {
    "Ligand SMILES": "CCn1c2nc(Nc3ccccc3)ncc2ccc1=O",
    "Ligand InChI": "InChI=1S/C15H14N4O/c1-2-19-13(20)9-8-11-10-16-15(18-14(11)19)17-12-6-4-3-5-7-12/h3-10H,2H2,1H3,(H,16,17,18)",
    "Ligand InChI Key": "WSZLNFZLFQJSAJ-UHFFFAOYSA-N",
    "BindingDB MonomerID": 6221,
    "BindingDB Ligand Name": [
      "8-Ethyl-2-phenylamino-8H-pyrido[2,3-d]pyrimidin-7-one::8-ethyl-2-(phenylamino)-7H,8H-pyrido[2,3-d]pyrimidin-7-one::C2 Pyrido[2,3-d]pyrimidin-7-one deriv. 1"
    ],
    "Link to Ligand in BindingDB": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=6221",
    "PubChem CID": 5330199,
    "PubChem SID": 8035820,
    "ZINC ID of Ligand": "ZINC12354795"
  },
  "relation": [
    {
      "BindingDB Reactant_set_id": 10159,
      "IC50 (nM)": " 129",
      "Curation/DataSource": "Curated from the literature by BindingDB",
      "Article DOI": "10.1021/jm000271k",
      "PMID": "11101352",
      "Authors": [
        "Barvian, M",
        "Boschelli, DH",
        "Cossrow, J",
        "Dobrusin, E",
        "Fattaey, A",
        "Fritsch, A",
        "Fry, D",
        "Harvey, P",
        "Keller, P",
        "Garrett, M",
        "La, F",
        "Leopold, W",
        "McNamara, D",
        "Quin, M",
        "Trumpp-Kallmeyer, S",
        "Toogood, P",
        "Wu, Z",
        "Zhang, E"
      ],
      "Institution": "Parke-Davis Pharmaceutical Research",
      "Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=6221&enzyme=Cyclin-Dependent+Kinase+2+%28CDK2%29&column=ki&startPg=0&Increment=50&submit=Search",
      "Number of Protein Chains in Target (>1 implies a multichain complex)": 2
    },
    {
      "BindingDB Reactant_set_id": 10166,
      "IC50 (nM)": " 410",
      "Curation/DataSource": "Curated from the literature by BindingDB",
      "Article DOI": "10.1021/jm000271k",
      "PMID": "11101352",
      "Authors": [
        "Barvian, M",
        "Boschelli, DH",
        "Cossrow, J",
        "Dobrusin, E",
        "Fattaey, A",
        "Fritsch, A",
        "Fry, D",
        "Harvey, P",
        "Keller, P",
        "Garrett, M",
        "La, F",
        "Leopold, W",
        "McNamara, D",
        "Quin, M",
        "Trumpp-Kallmeyer, S",
        "Toogood, P",
        "Wu, Z",
        "Zhang, E"
      ],
      "Institution": "Parke-Davis Pharmaceutical Research",
      "Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=6221&enzyme=CDK2%2FCycE&column=ki&startPg=0&Increment=50&submit=Search",
      "Number of Protein Chains in Target (>1 implies a multichain complex)": 2
    }
  ],
  "_id": "6221-P24941"
}

@rjawesome
Copy link
Collaborator

Also @andrewsu you were mentioning you wanted the fields to be mapped, so if you still want that could you could send what you want each field to be mapped to (FYI, all fields located in this document)

@andrewsu
Copy link
Member

andrewsu commented Jul 7, 2022

You can use the mapping table below. Note that I also decided to collapse the "swissprot" and "trembl" sets of fields under a single subject.uniprot object, and then we'd add a subject.uniprot.type field that was either "SwissProt" or "TrEMBL", depending on what set of columns the data came from. Let me know if anything here doesn't make sense!

mapping table
original mapped
BindingDB Reactant_set_id relation.bindingdb_set_id
Ligand SMILES object.smiles
Ligand InChI object.inchi
Ligand InChI Key object.inchikey
BindingDB MonomerID object.monomer_id
BindingDB Ligand Name object.name
Target Name Assigned by Curator or DataSource subject.name
Target Source Organism According to Curator or DataSource subject.organism
Ki (nM) relation.ki_nm
IC50 (nM) relation.ic50_nm
Kd (nM) relation.kd_nm
EC50 (nM) relation.ec50_nm
kon (M-1-s-1) relation.kon
koff (s-1) relation.koff
pH relation.ph
Temp (C) relation.temp_c
Curation/DataSource relation.curation_datasource
Article DOI relation.article_doi
PMID relation.pmid
PubChem AID relation.pubchem_aid
Patent Number relation.patent_number
Authors relation.authors
Institution relation.institution
Link to Ligand in BindingDB object.bindingdb_link
Link to Target in BindingDB subject.bindingdb_link
Link to Ligand-Target Pair in BindingDB relation.bindingdb_link
Ligand HET ID in PDB object.het_id_pdb
PDB ID(s) for Ligand-Target Complex relation.pdb
PubChem CID object.pubchem_cid
PubChem SID object.pubchem_sid
ChEBI ID of Ligand object.chebi
ChEMBL ID of Ligand object.chembl
DrugBank ID of Ligand object.drugbank
IUPHAR_GRAC ID of Ligand object.iuphar_grac_id
KEGG ID of Ligand object.kegg
ZINC ID of Ligand object.zinc
Number of Protein Chains in Target (>1 implies a multichain complex) relation.num_protein_chains
BindingDB Target Chain  Sequence subject.sequence
PDB ID(s) of Target Chain subject.pdb
UniProt (SwissProt) Recommended Name of Target Chain subject.uniprot.fullname
UniProt (SwissProt) Entry Name of Target Chain subject.uniprot.id
UniProt (SwissProt) Primary ID of Target Chain subject.uniprot.accession
UniProt (SwissProt) Secondary ID(s) of Target Chain subject.uniprot.secondary_accession
UniProt (SwissProt) Alternative ID(s) of Target Chain subject.uniprot.alternative_accession
UniProt (TrEMBL) Submitted Name of Target Chain subject.uniprot.fullname
UniProt (TrEMBL) Entry Name of Target Chain subject.uniprot.id
UniProt (TrEMBL) Primary ID of Target Chain subject.uniprot.accession
UniProt (TrEMBL) Secondary ID(s) of Target Chain subject.uniprot.secondary_accession
UniProt (TrEMBL) Alternative ID(s) of Target Chain subject.uniprot.alternative_accession

@rjawesome
Copy link
Collaborator

rjawesome commented Jul 7, 2022

@andrewsu, for the subject.uniprot field, you put the Entry Name as the subject.uniprot.id so does this mean I should be using the Entry Name in _id instead of the Primary ID. Also, should subject.uniprot be an array to contain both SwissProt and TrEMBL or should I create separate documents for TrEMBL and SwissProt links?

@andrewsu
Copy link
Member

andrewsu commented Jul 7, 2022

for the subject.uniprot field, you put the Entry Name as the subject.uniprot.id so does this mean I should be using the Entry Name in _id instead of the Primary ID.

I think you can continue using the UniProt (SwissProt) Primary ID of Target Chain / subject.uniprot.accession in _id...

Also, should subject.uniprot be an array to contain both SwissProt and TrEMBL or should I create separate documents for TrEMBL and SwissProt links?

separate documents for TrEMBL and SwissProt please...

@rjawesome
Copy link
Collaborator

rjawesome commented Jul 8, 2022

@andrewsu I've noticed that none of the TrEMBL documents meet my current criteria for determining if a protein is a human protein (ie. the entry name ends with _HUMAN). More specifically, none of the TrEMBL documents have their own entry names. We have a few options here:

  • Eliminate TrEMBL Documents
  • Include a TrEMBL Document if the SwissProt entry name contains _HUMAN
  • Broaden our criteria to also include any document with a target species of Homo sapiens as a human protein

@andrewsu
Copy link
Member

Hmm, looks like for TrEMBL, they populate UniProt (TrEMBL) Primary ID of Target Chain but not UniProt (TrEMBL) Entry Name of Target Chain, which (as you point out) makes the species filtering challenging. We could download a uniprot table to look up the corresponding "Entry Name". But in practice, TrEMBL is a dataset of computationally predicted/annotated proteins, so they are of lesser importance than SwissProt entries.

So, bottom line, let's keep your filtering based on the "Entry Name". In practice, that means that no TrEMBL records will be created. But at least the logic will be in place in case they start populating those columns in the future...

@rjawesome
Copy link
Collaborator

Parser has been updated. New Sample Record below...

{
  "subject": {
    "name": [
      "Cyclin-Dependent Kinase 2 (CDK2)",
      "CDK2/CycE"
    ],
    "organism": [
      "Homo sapiens"
    ],
    "bindingdb_link": [
      "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=com&complexid=97&target=Cyclin-Dependent+Kinase+2+%28CDK2%29&column=ki&startPg=0&Increment=50&submit=Search",
      "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=com&complexid=81&target=CDK2%2FCycE&column=ki&startPg=0&Increment=50&submit=Search"
    ],
    "uniprot": {
      "type": "swissprot",
      "fullname": [
        "Cyclin-dependent kinase 2"
      ],
      "id": "CDK2_HUMAN",
      "accession": "P24941",
      "secondary_accession": [
        "A8K7C6",
        "O75100"
      ]
    },
    "sequence": "MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNHPNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHSHRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEILLGCKYYSTAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSFPKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAALAHPFFQDVTKPVPHLRL",
    "pdb": [
      [
        "1CKP",
        "1DI8",
        ...
        "7NVQ",
        "7RA5"
      ]
    ]
  },
  "object": {
    "smiles": "CCn1c2nc(Nc3ccccc3)ncc2ccc1=O",
    "inchi": "InChI=1S/C15H14N4O/c1-2-19-13(20)9-8-11-10-16-15(18-14(11)19)17-12-6-4-3-5-7-12/h3-10H,2H2,1H3,(H,16,17,18)",
    "inchikey": "WSZLNFZLFQJSAJ-UHFFFAOYSA-N",
    "monomer_id": 6221,
    "name": [
      "8-Ethyl-2-phenylamino-8H-pyrido[2,3-d]pyrimidin-7-one::8-ethyl-2-(phenylamino)-7H,8H-pyrido[2,3-d]pyrimidin-7-one::C2 Pyrido[2,3-d]pyrimidin-7-one deriv. 1"
    ],
    "bindingdb_link": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=6221",
    "pubchem_cid": 5330199,
    "pubchem_sid": 8035820,
    "zinc": "ZINC12354795"
  },
  "relation": [
    {
      "bindingdb_set_id": 10159,
      "ic50_nm": " 129",
      "curation_datasource": "Curated from the literature by BindingDB",
      "article_doi": "10.1021/jm000271k",
      "pmid": "11101352",
      "authors": [
        "Barvian, M",
        "Boschelli, DH",
        "Cossrow, J",
        "Dobrusin, E",
        "Fattaey, A",
        "Fritsch, A",
        "Fry, D",
        "Harvey, P",
        "Keller, P",
        "Garrett, M",
        "La, F",
        "Leopold, W",
        "McNamara, D",
        "Quin, M",
        "Trumpp-Kallmeyer, S",
        "Toogood, P",
        "Wu, Z",
        "Zhang, E"
      ],
      "institution": "Parke-Davis Pharmaceutical Research",
      "bindingdb_link": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=6221&enzyme=Cyclin-Dependent+Kinase+2+%28CDK2%29&column=ki&startPg=0&Increment=50&submit=Search",
      "num_protein_chains": 2
    },
    {
      "bindingdb_set_id": 10166,
      "ic50_nm": " 410",
      "curation_datasource": "Curated from the literature by BindingDB",
      "article_doi": "10.1021/jm000271k",
      "pmid": "11101352",
      "authors": [
        "Barvian, M",
        "Boschelli, DH",
        "Cossrow, J",
        "Dobrusin, E",
        "Fattaey, A",
        "Fritsch, A",
        "Fry, D",
        "Harvey, P",
        "Keller, P",
        "Garrett, M",
        "La, F",
        "Leopold, W",
        "McNamara, D",
        "Quin, M",
        "Trumpp-Kallmeyer, S",
        "Toogood, P",
        "Wu, Z",
        "Zhang, E"
      ],
      "institution": "Parke-Davis Pharmaceutical Research",
      "bindingdb_link": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=6221&enzyme=CDK2%2FCycE&column=ki&startPg=0&Increment=50&submit=Search",
      "num_protein_chains": 2
    }
  ],
  "_id": "6221-P24941",
  "predicate": "physically interacts with"
}

@rjawesome
Copy link
Collaborator

Note, just fixed a glitch in my parser. Another sample record (that was affected).

{
  "subject": {
    "name": "Epidermal growth factor receptor",
    "organism": "Homo sapiens",
    "bindingdb_link": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=pol&polymerid=520&target=Epidermal+growth+factor+receptor&column=ki&startPg=0&Increment=50&submit=Search",
    "uniprot": {
      "type": "swissprot",
      "fullname": "Epidermal growth factor receptor",
      "id": "EGFR_HUMAN",
      "accession": "P00533",
      "secondary_accession": [
        "O00688",
        "O00732",
        "P06268",
        "Q14225",
        "Q68GS5",
        "Q92795",
        "Q9BZS2",
        "Q9GZX1",
        "Q9H2C9",
        "Q9H3C9",
        "Q9UMD7",
        "Q9UMD8",
        "Q9UMG5"
      ]
    },
    "sequence": "MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEVVLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALAVLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDFQNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGCTGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYVVTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFKNCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAFENLEIIRGRTKQHGQFSLAVVSLNITSLGLRSLKEISDGDVIISGNKNLCYANTINWKKLFGTSGQKTKIISNRGENSCKATGQVCHALCSPEGCWGPEPRDCVSCRNVSRGRECVDKCNLLEGEPREFVENSECIQCHPECLPQAMNITCTGRGPDNCIQCAHYIDGPHCVKTCPAGVMGENNTLVWKYADAGHVCHLCHPNCTYGCTGPGLEGCPTNGPKIPSIATGMVGALLLLLVVALGIGLFMRRRHIVRKRTLRRLLQERELVEPLTPSGEAPNQALLRILKETEFKKIKVLGSGAFGTVYKGLWIPEGEKVKIPVAIKELREATSPKANKEILDEAYVMASVDNPHVCRLLGICLTSTVQLITQLMPFGCLLDYVREHKDNIGSQYLLNWCVQIAKGMNYLEDRRLVHRDLAARNVLVKTPQHVKITDFGLAKLLGAEEKEYHAEGGKVPIKWMALESILHRIYTHQSDVWSYGVTVWELMTFGSKPYDGIPASEISSILEKGERLPQPPICTIDVYMIMVKCWMIDADSRPKFRELIIEFSKMARDPQRYLVIQGDERMHLPSPTDSNFYRALMDEEDMDDVVDADEYLIPQQGFFSSPSTSRTPLLSSLSATSNNSTVACIDRNGLQSCPIKEDSFLQRYSSDPTGALTEDSIDDTFLPVPEYINQSVPKRPAGSVQNPVYHNQPLNPAPSRDPHYQDPHSTAVGNPEYLNTVQPTCVNSTFDSPAHWAQKGSHQISLDNPDYQQDFFPKEAKPNGIFKGSTAENAEYLRVAPQSSEFIGA",
    "pdb": [
      [
        "1IVO",
        "1M14",
        "1M17",
        ...
        "7AEI",
        "7AEM",
        "7OXB"
      ],
      [
        "1IVO",
        "1M14",
        "1M17",
        ...
        "6VHN",
        "6VHP",
        "7AEI",
        "7AEM"
      ]
    ]
  },
  "object": {
    "smiles": "Cc1ccc(cc1)-n1nc(cc1NC(=O)Nc1ccc(OCCN2CCOCC2)c2ccccc12)C(C)(C)C",
    "inchi": "InChI=1S/C31H37N5O3/c1-22-9-11-23(12-10-22)36-29(21-28(34-36)31(2,3)4)33-30(37)32-26-13-14-27(25-8-6-5-7-24(25)26)39-20-17-35-15-18-38-19-16-35/h5-14,21H,15-20H2,1-4H3,(H2,32,33,37)",
    "inchikey": "MVCOAUNKQVWQHZ-UHFFFAOYSA-N",
    "monomer_id": 13533,
    "name": "1-[2-(4-methylphenyl)-5-tert-butyl-pyrazol-3-yl]-3-[4-(2-morpholin-4-ylethoxy)naphthalen-1-yl]urea::1-[5-tert-butyl-2-(4-methylphenyl)-3-pyrazolyl]-3-[4-[2-(4-morpholinyl)ethoxy]-1-naphthalenyl]urea::1-[5-tert-butyl-2-(4-methylphenyl)pyrazol-3-yl]-3-[4-(2-morpholin-4-ylethoxy)naphthalen-1-yl]urea::1-[5-tert-butyl-2-(p-tolyl)pyrazol-3-yl]-3-[4-(2-morpholinoethoxy)-1-naphthyl]urea::3-[2-(4-methylphenyl)-5-tert-butyl-pyrazol-3-yl]-1-[4-(2-morpholin-4-ylethoxy)naphthalen-1-yl]urea::3-[3-tert-butyl-1-(4-methylphenyl)-1H-pyrazol-5-yl]-1-{4-[2-(morpholin-4-yl)ethoxy]naphthalen-1-yl}urea::BIRB 796::BIRB-796::BIRB-796, 3::CHEMBL103667::Doramapimod::US8933228, BIRB 796::US9187470, 43 (BIRB-796)::US9242960, BIRB 796::US9260410, BIRB796::cid_156422::diaryl urea compound 10",
    "bindingdb_link": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=13533",
    "het_id_pdb": "B96",
    "pubchem_cid": 156422,
    "pubchem_sid": 46513934,
    "chembl": "CHEMBL103667",
    "drugbank": "DB03044",
    "iuphar_grac_id": "5668",
    "zinc": "ZINC24044436"
  },
  "relation": [
    {
      "bindingdb_set_id": 65378,
      "kd_nm": " 7000",
      "curation_datasource": "PubChem",
      "pubchem_aid": "aid1433",
      "authors": [
        "PubChem, PC"
      ],
      "institution": "Ambit Biosciences",
      "bindingdb_link": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=13533&enzyme=Epidermal+growth+factor+receptor&column=ki&startPg=0&Increment=50&submit=Search",
      "pdb": [
        "4JVG",
        "1KV2",
        "6GTT",
        "5N66",
        "4TWN",
        "3NPC",
        "3FZS"
      ],
      "num_protein_chains": 1
    },
    {
      "bindingdb_set_id": 65395,
      "kd_nm": " 9100",
      "curation_datasource": "PubChem",
      "pubchem_aid": "aid1433",
      "authors": [
        "PubChem, PC"
      ],
      "institution": "Ambit Biosciences",
      "bindingdb_link": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=13533&enzyme=Epidermal+growth+factor+receptor&column=ki&startPg=0&Increment=50&submit=Search",
      "pdb": [
        "4JVG",
        "1KV2",
        "6GTT",
        "5N66",
        "4TWN",
        "3NPC",
        "3FZS"
      ],
      "num_protein_chains": 1
    },
    ...
    {
      "bindingdb_set_id": 50208407,
      "ic50_nm": ">20000",
      "curation_datasource": "ChEMBL",
      "article_doi": "10.1021/jm020057r",
      "pmid": "12086485",
      "authors": [
        "Regan, J",
        "Breitfelder, S",
        "Cirillo, P",
        "Gilmore, T",
        "Graham, AG",
        "Hickey, E",
        "Klaus, B",
        "Madwed, J",
        "Moriak, M",
        "Moss, N",
        "Pargellis, C",
        "Pav, S",
        "Proto, A",
        "Swinamer, A",
        "Tong, L",
        "Torcellini, C"
      ],
      "institution": "Boehringer Ingelheim Pharmaceuticals Inc",
      "bindingdb_link": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=13533&enzyme=Epidermal+growth+factor+receptor&column=ki&startPg=0&Increment=50&submit=Search",
      "pdb": [
        "4JVG",
        "1KV2",
        "6GTT",
        "5N66",
        "4TWN",
        "3NPC",
        "3FZS"
      ],
      "num_protein_chains": 1
    }
  ],
  "_id": "13533-P00533",
  "predicate": "physically interacts with"
}

@andrewsu
Copy link
Member

Looks great, nice work @rjawesome! I think this one is also ready to pass off to @erikyao for API creation...

@erikyao
Copy link
Contributor

erikyao commented Jul 28, 2022

API published to https://biothings.ncats.io/bindingdb

@erikyao
Copy link
Contributor

erikyao commented Jul 28, 2022

Hi @rjawesome, I forked your repo to https://github.com/biothings/BindingDB and made some changes. The significant change is to move the content of your mappings.json into parser.py. The reason is that a data plugin is dynamically imported by importlib internally so the relative path "./mappings.json" no longer works.

@colleenXu
Copy link

Noting that the next step is writing a SmartAPI yaml for this API. An intern can try to do this, or I'll stick it in my to-do list...

@rjawesome
Copy link
Collaborator

rjawesome commented Jul 28, 2022

@colleenXu I could work on that

@rjawesome
Copy link
Collaborator

@colleenXu do you know how I can determine which fields should be put in the bte-response-mapping (or should I put all of them)? List of fields

@colleenXu
Copy link

colleenXu commented Jul 30, 2022

Just in case, I'll talk about the main fields first:

  • it looks like two-ish operations (depending on how many operations are needed to cover the "object" ligand id-prefixes): Gene -> SmallMolecule and SmallMolecule -> Gene
  • For Gene (aka the "subject" part of the API), the field to grab IDs from is probably subject.uniprot.id. The subject ID-prefix can then be looked up by searching for "uniprot" in the biolink-model (it's UniProtKB, yes it's spelled exactly that way. also it's fine that the id-prefix isn't under Gene. There's Gene/Protein conflation - aka they're interchangeable - and so we generally write the operations for Genes)
  • For SmallMolecule (aka the "object" part of the API), you'll need to know some stuff:
    • are there multiple ID-prefixes for the object in 1 record/document? It looks like there probably is...that's tricky because we don't really want redundant querying (aka retrieve the same document in two different sub-queries using different id-namespaces)
    • so...we want a set of id-prefixes that covers as much of the API as we can, without a lot of overlap between them (aka not many records/documents have all of those prefixes). Ideally, there's 1 main id-prefix that's in most of the records....and we can just use that. If there's a bunch, then we'll want to write operations for each (so there'll be more than 2 operations for the API)
    • If a bunch of id-prefixes are equally good (in most of the records), use the id-prefixes list order for SmallMolecule (in biolink-model).
    • I suspect that you will use PUBCHEM.COMPOUND (which is PUBCHEM CID), and the field object.pubchem_cid. Other "good" IDs (aka I know Translator/BTE works well with them) are CHEBI and CHEMBL.COMPOUND (looks like that's what the CHEMBL IDs in the api are).

@colleenXu
Copy link

For the "other" fields in the response-mapping / retrieved in the fields section of the parameters, I suggest:

  • from subject:
    • subject.name: this looks like it may be different from what's retrieved during BTE's ID-resolution, since it's assigned by the curator/datasource. This looks useful
    • subject.organism: need that info
  • from relation:
    • relation.curation_datasource: this is an example of the "source" I put in response-mapping (not on the top-level next to predicates aka the infores thing). It's useful
    • relation.pmid (and if some records don't have pmid and instead have these other fields, include them: relation.article_doi, relation.patent_number). Outside resources are super useful.
    • relation.bindingdb_link: Again, outside links are super useful

Some other relation fields are kinda interesting, but seem like "a lot of clutter" / hard to interpret / can go to the bindingdb link to learn more. so I think we can add a comment about them but otherwise not include them for now...: relation.ki_nm, relation.ic50_nm, relation.kd_nm, relation.ec50_nm, relation.kon, relation.koff, relation.ph, relation.temp_c, relation.num_protein_chains

@rjawesome
Copy link
Collaborator

rjawesome commented Jul 30, 2022

While it does have multiple idenitifiers, I think InChi/InChiKey is the most common identifier in the documents (a lot of them only have InChi/InChiKey), so I will probably just have an operation for that?

@colleenXu
Copy link

colleenXu commented Aug 1, 2022

Huh....I count more documents with pubchem.cid than inchikey.

pubchem_cid: 1413051
inchikey: 1394153

so if we had to pick 1 id-prefix for the chemical stuff, I'd like pick pubchem_cid (aka PUBCHEM.COMPOUND).


[EDIT: hmmm so ~1.8% of records or so would be not retrieved if we only used pubchem_cid....maybe that's fine?]

Of the records that don't have pubchem_cid (25858):

(also noting the inverse: of the 44756 records that don't have inchikey, almost all (42125) have a pubchem_cid.)

@rjawesome
Copy link
Collaborator

Oh, I missed pubchem_cid. I can change it to that or add a separate operation with the cid. Also, I made a PR with the YAML -- right now it is using InChiKey.

@colleenXu
Copy link

I saw that. I suggest switching to using pubchem_cid (PUBCHEM.COMPOUND)

@colleenXu
Copy link

@rjawesome I edited the SmartAPI yaml (commit NCATS-Tangerine/translator-api-registry@f3b4ca2) before registering it and hooking it up to BTE.

Notes:

  • check and edit the info.version, info.x-translator.biolink-version (this is whatever version BTE is using, the easiest way to check is to look at recent commit history for the module https://github.com/biothings/biolink-model.js/commits/main), examples for endpoints, and predicates in the x-bte operations.
  • we use the key "pubmed" for fields with PMIDs. This is because BTE has special handling for PMIDs.

@colleenXu
Copy link

colleenXu commented Aug 17, 2022

We still have to deploy changes to our instances to use this API in our team/general endpoints. We can query this API through the SmartAPI-specific endpoint using its registration ID 38e9e5169a72aee3659c9ddba956790d

Once it is deployed, Example Query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": ["biolink:SmallMolecule"],
                    "ids": ["PUBCHEM.COMPOUND:134553288"]
                },
                "n1": {
                    "categories": ["biolink:Gene"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:physically_interacts_with"]
                }
            }
        }
    }
}

And there'd be an edge like this in the response:

                "b82527794b6190f21cfe9da2d11fff93": {
                    "predicate": "biolink:physically_interacts_with",
                    "subject": "PUBCHEM.COMPOUND:134553288",
                    "object": "NCBIGene:187",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:aggregator_knowledge_source",
                            "value": [
                                "infores:biothings-explorer"
                            ],
                            "value_type_id": "biolink:InformationResource"
                        },
                        {
                            "attribute_type_id": "biolink:primary_knowledge_source",
                            "value": [
                                "infores:bindingdb"
                            ],
                            "value_type_id": "biolink:InformationResource"
                        },
                        {
                            "attribute_type_id": "biolink:aggregator_knowledge_source",
                            "value": [
                                "infores:biothings-bindingdb"
                            ],
                            "value_type_id": "biolink:InformationResource"
                        },
                        {
                            "attribute_type_id": "biolink:original_subject",
                            "value": "Apelin receptor"
                        },
                        {
                            "attribute_type_id": "in_taxon",
                            "value": "Homo sapiens"
                        },
                        {
                            "attribute_type_id": "bindingdb_curation_datasource",
                            "value": "US Patent"
                        },
                        {
                            "attribute_type_id": "bindingdb_url",
                            "value": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=456871&enzyme=Apelin+receptor&column=ki&startPg=0&Increment=50&submit=Search"
                        },
                        {
                            "attribute_type_id": "patent_number",
                            "value": "US10736883"
                        }
                    ]
                }
            }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data source Data source pending to create a new API
Projects
None yet
Development

No branches or pull requests

5 participants