Data source: BindingDB #70

newgene · 2022-06-01T23:11:49Z

Name: BindingDB - curated protein-chemical bindings
URL: https://www.bindingdb.org/rwd/bind/index.jsp
Download: https://www.bindingdb.org/rwd/bind/chemsearch/marvin/SDFdownload.jsp?all_download=yes
License: publicly available with no license specified

andrewsu · 2022-06-24T05:06:44Z

Downloaded BindingDB_All_2022m5.tsv.zip
2407382 rows, each with a binding relationship (193 columns of metadata about the relationship) [code snippet 1]
- 2178377 rows with a UniProt ID for the target [2]
- 1243406 rows with a UniProt ID for the target AND target is human protein [3]
  - records by data source [4]

706654 ChEMBL
 420222 US Patent
  63957 PubChem
  42123 Curated from the literature by BindingDB
   8510 PDSP Ki
   1084 D3R
    763 CSAR
     93 Taylor Research Group, UCSD

Records by target (1786 unique targets; top ~30 shown below) [5]

  17347 JAK2_HUMAN
  16717 BTK_HUMAN
  15331 EGFR_HUMAN
  14108 DRD2_HUMAN
  12988 JAK1_HUMAN
  12917 CAH2_HUMAN
  12591 PK3CD_HUMAN
  12345 BACE1_HUMAN
  12233 OX2R_HUMAN
  12004 VGFR2_HUMAN
  11871 KCNH2_HUMAN
  11412 OX1R_HUMAN
  11403 CNR2_HUMAN
  10934 IRAK4_HUMAN
  10804 CAH1_HUMAN
  10574 CNR1_HUMAN
  10231 HDAC1_HUMAN
  10071 OPRM_HUMAN
   9901 PK3CA_HUMAN
   9518 JAK3_HUMAN
   9262 CDK2_HUMAN
   9072 OPRK_HUMAN
   8911 GSK3B_HUMAN
   8693 CP3A4_HUMAN
   8670 MK01_HUMAN
   8380 P2RX3_HUMAN
   8023 AA2AR_HUMAN

Conclusion: To start, let's go with the 1243406 records according to the minimal filtering described above. The goal is to create a JSON file with one record per row (compound - target association). The JSON structure should roughly follow the pattern described in #55

Code snippets

NOTE: alias gawkt='awk -F"\t" -v OFS="\t"'

[1]: wc BindingDB_All.tsv
[2]: cat BindingDB_All.tsv | gawkt '$42!=""' | wc
[3]: cat BindingDB_All.tsv | gawkt '$42!=""&&$8=="Homo sapiens"' > BindingDB_All_humanuniprot.tsv; wc BindingDB_All_humanuniprot.tsv
[4]: cat BindingDB_All_humanuniprot.tsv | gawkt '{print $17}' | sort | uniq -c | sort -k1nr
[5]: cat BindingDB_All_humanuniprot.tsv | gawkt '{print $41}' | sort | uniq -c | sort -k1nr

rjawesome · 2022-06-30T01:43:11Z

I am currently working on this issue

rjawesome · 2022-06-30T17:49:13Z

I made a basic parser for this: https://github.com/rjawesome/BindingDB_parser

I realized that some rows have more than one chemical -> protein relationship (indicated by "Number of Protein Chains in Target (>1 implies a multichain complex)") so I have split those into separate documents

At the moment I am currently filtering by checking if the primary uniprot name ends with _HUMAN. I checked one of the records with Homo Sapiens as the species that listed CGH2_SHV21 (ID Q01043) as the protein, from the UniProt website it seemed not to be a human protein but I could be mistaken?

Also, the parser takes around 2 minutes to run so I am not sure if I need to optimize my code or if this is just a really big data file.

Sample Record:

{
  "object": {
    "BindingDB Reactant_set_id": "143",
    "Ligand SMILES": "Cc1nc(CN2CCN(CC2)c2c(Cl)cnc3[nH]c(nc23)-c2cn(C)nc2C)no1",
    "Ligand InChI": "InChI=1S/C19H22ClN9O/c1-11-13(9-27(3)25-11)18-23-16-17(14(20)8-21-19(16)24-18)29-6-4-28(5-7-29)10-15-22-12(2)30-26-15/h8-9H,4-7,10H2,1-3H3,(H,21,23,24)",
    "Ligand InChI Key": "ZYQKMYRXVHUATB-UHFFFAOYSA-N",
    "BindingDB MonomerID": "247370",
    "BindingDB Ligand Name": "US9447092, 3",
    "Target Name Assigned by Curator or DataSource": "Cytochrome P450 3A4",
    "Target Source Organism According to Curator or DataSource": "Homo sapiens",
    "Ki (nM)": "",
    "IC50 (nM)": ">50000",
    "Kd (nM)": "",
    "EC50 (nM)": "",
    "kon (M-1-s-1)": "",
    "koff (s-1)": "",
    "pH": "",
    "Temp (C)": "",
    "Curation/DataSource": "US Patent",
    "Article DOI": "",
    "PMID": "",
    "PubChem AID": "",
    "Patent Number": "US9447092",
    "Authors": "Blagg, J; Bavetsias, V; Moore, AS; Linardopoulos, S",
    "Institution": "Cancer Research Technology Limited",
    "Link to Ligand in BindingDB": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=247370",
    "Link to Target in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=pol&polymerid=2127&target=Cytochrome+P450+3A4&column=ki&startPg=0&Increment=50&submit=Search",
    "Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=247370&enzyme=Cytochrome+P450+3A4&column=ki&startPg=0&Increment=50&submit=Search",
    "Ligand HET ID in PDB": "",
    "PDB ID(s) for Ligand-Target Complex": "",
    "PubChem CID": "71463198",
    "PubChem SID": "346541913",
    "ChEBI ID of Ligand": "",
    "ChEMBL ID of Ligand": "",
    "DrugBank ID of Ligand": "",
    "IUPHAR_GRAC ID of Ligand": "",
    "KEGG ID of Ligand": "",
    "ZINC ID of Ligand": "",
    "Number of Protein Chains in Target (>1 implies a multichain complex)": "1"
  },
  "subject": {
    "BindingDB Target Chain  Sequence": "MALIPDLAMETWLLLAVSLVLLYLYGTHSHGLFKKLGIPGPTPLPFLGNILSYHKGFCMFDMECHKKYGKVWGFYDGQQPVLAITDPDMIKTVLVKECYSVFTNRRPFGPVGFMKSAISIAEDEEWKRLRSLLSPTFTSGKLKEMVPIIAQYGDVLVRNLRREAETGKPVTLKDVFGAYSMDVITSTSFGVNIDSLNNPQDPFVENTKKLLRFDFLDPFFLSITVFPFLIPILEVLNICVFPREVTNFLRKSVKRMKESRLEDTQKHRVDFLQLMIDSQNSKETESHKALSDLELVAQSIIFIFAGYETTSSVLSFIMYELATHPDVQQKLQEEIDAVLPNKAPPTYDTVLQMEYLDMVVNETLRLFPIAMRLERVCKKDVEINGMFIPKGVVVMIPSYALHRDPKYWTEPEKFLPERFSKKNKDNIDPYIYTPFGSGPRNCIGMRFALMNMKLALIRVLQNFSFKPCKETQIPLKLSLGGLLQPEKPVVLKVESRDGTVSGA",
    "PDB ID(s) of Target Chain": "1W0E,1W0F,1W0G,2J0D,2V0M,3NXU,4NY4,7LXL",
    "UniProt (SwissProt) Recommended Name of Target Chain": "Cytochrome P450 3A4",
    "UniProt (SwissProt) Entry Name of Target Chain": "CP3A4_HUMAN",
    "UniProt (SwissProt) Primary ID of Target Chain": "P08684",
    "UniProt (SwissProt) Secondary ID(s) of Target Chain": "P05184,Q16757,Q9UK50",
    "UniProt (SwissProt) Alternative ID(s) of Target Chain": "",
    "UniProt (TrEMBL) Submitted Name of Target Chain": "",
    "UniProt (TrEMBL) Entry Name of Target Chain": "",
    "UniProt (TrEMBL) Primary ID of Target Chain": "Q6GRK0",
    "UniProt (TrEMBL) Secondary ID(s) of Target Chain": "",
    "UniProt (TrEMBL) Alternative ID(s) of Target Chain": ""
  }
}

rjawesome · 2022-06-30T18:32:38Z

Updated Sample Document

{
  "object": {
    "BindingDB Reactant_set_id": 199,
    "Ligand SMILES": "CN(Cc1ccc(s1)C(=O)N[C@@H](CC(O)=O)C(=O)CSCc1ccccc1Cl)Cc1ccc(O)c(c1)C(O)=O",
    "Ligand InChI": "InChI=1S/C27H27ClN2O7S2/c1-30(12-16-6-8-22(31)19(10-16)27(36)37)13-18-7-9-24(39-18)26(35)29-21(11-25(33)34)23(32)15-38-14-17-4-2-3-5-20(17)28/h2-10,21,31H,11-15H2,1H3,(H,29,35)(H,33,34)(H,36,37)/t21-/m0/s1",
    "Ligand InChI Key": "FIEQQFOHZKVJLV-NRFANRHFSA-N",
    "BindingDB MonomerID": 219,
    "BindingDB Ligand Name": "5-({[(5-{[(2S)-1-carboxy-4-{[(2-chlorophenyl)methyl]sulfanyl}-3-oxobutan-2-yl]carbamoyl}thiophen-2-yl)methyl](methyl)amino}methyl)-2-hydroxybenzoic acid::Inhibitor 47c::Thiophene Scaffold 47c",
    "PubChem CID": 5327301,
    "PubChem SID": 8030144,
    "ZINC ID of Ligand": "ZINC14942804",
    "Number of Protein Chains in Target (>1 implies a multichain complex)": 1
  },
  "subject": {
    "Target Name Assigned by Curator or DataSource": "Caspase-3",
    "Target Source Organism According to Curator or DataSource": "Homo sapiens",
    "Link to Target in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=pol&polymerid=1072&target=Caspase-3&column=ki&startPg=0&Increment=50&submit=Search",
    "BindingDB Target Chain  Sequence": "MENTENSVDSKSIKNLEPKIIHGSESMDSGISLDNSYKMDYPEMGLCIIINNKNFHKSTGMTSRSGTDVDAANLRETFRNLKYEVRNKNDLTREEIVELMRDVSKEDHSKRSSFVCVLLSHGEEGIIFGTNGPVDLKKITNFFRGDRCRSLTGKPKLFIIQACRGTELDCGIETDSGVDDDMACHKIPVEADFLYAYSTAPGYYSWRNSKDGSWFIQSLCAMLKQYADKLEFMHILTRVNRKVATEFESFSFDATFHAKKQIPCIVSMLTKELYFYH",
    "PDB ID(s) of Target Chain": [
      "1GFW",
      "1I3O",
      "1NME",
      "1PAU",
      "1RE1",
      "1RHJ",
      "1RHK",
      "1RHM",
      "1RHQ",
      "1RHR",
      "1RHU",
      "2C1E",
      "2C2K",
      "2C2M",
      "2C2O",
      "2CDR",
      "2CJX",
      "2CJY",
      "2CNK",
      "2CNL",
      "2CNN",
      "2CNO",
      "2DKO",
      "2H5I",
      "2H5J",
      "2H65",
      "2J30",
      "2XYG",
      "2XYH",
      "2XYP",
      "2XZD",
      "2XZT",
      "2Y0B",
      "3EDQ",
      "3GJQ",
      "3GJR",
      "3GJS",
      "3GJT",
      "3H0E",
      "3KJF",
      "4DCJ",
      "4DCO",
      "4DCP",
      "4JJE",
      "4PRY",
      "4PS0",
      "5IC4",
      "6BDV",
      "6BFJ",
      "6BFK",
      "6BFL",
      "6BFO",
      "6BG0",
      "6BG1",
      "6BG4",
      "6BGK",
      "6BGQ",
      "6BGR",
      "6BGS",
      "6BH9",
      "6BHA",
      "6CKZ",
      "6CL0",
      "6X8I",
      "6X8K",
      "7RN7",
      "7RN8",
      "7RN9",
      "7RNB",
      "7RND",
      "7RNE",
      "7RNF",
      "7SEO"
    ],
    "UniProt (SwissProt) Recommended Name of Target Chain": "Caspase-3",
    "UniProt (SwissProt) Entry Name of Target Chain": "CASP3_HUMAN",
    "UniProt (SwissProt) Primary ID of Target Chain": "P42574",
    "UniProt (SwissProt) Secondary ID(s) of Target Chain": [
      "A8K5M2",
      "D3DP53",
      "Q96AN1",
      "Q96KP2"
    ]
  },
  "relation": {
    "Ki (nM)": " 90",
    "pH": "7.4",
    "Temp (C)": "25.00 C",
    "Curation/DataSource": "Curated from the literature by BindingDB",
    "Article DOI": "10.1021/jm020230j",
    "PMID": "12408711",
    "Authors": "Choong, IC; Lew, W; Lee, D; Pham, P; Burdett, MT; Lam, JW; Wiesmann, C; Luong, TN; Fahr, B; DeLano, WL; McDowell, RS; Allen, DA; Erlanson, DA; Gordon, EM; O'Brien, T",
    "Institution": "Sunesis Pharmaceuticals",
    "Link to Ligand in BindingDB": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=219",
    "Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=219&enzyme=Caspase-3&column=ki&startPg=0&Increment=50&submit=Search"
  }
}

andrewsu · 2022-06-30T21:37:47Z

Just a note/caveat. I've pasted a (transposed) snippet of the data file which shows four records that are exactly the same except for the Ki. In this case, best to collapse these four input records into a single output record, where the Ki is an array. We should be careful to identify any other columns that need similar treatment. (Possibly to help that effort, we should add an "_id" key based on BindingDB MonomerID and UniProt (SwissProt) Primary ID of Target Chain -- in the example above, it would be 219-P42574.)

rjawesome · 2022-06-30T23:38:18Z

Ok, I will check for that. Just curious, what software are you using to view the file? It seems like it would be tricky to view the file given its large size.

(Just off the first few, I found documents with the same ID according to your construction had different IC50 (nM) and/or Author/Institution/PMID/etc.)

In this case would we turn the Author/Institution/etc. into an array or would we separate those entries? ... in the case we separate those entities, we couldn't use your idea for the _id key

andrewsu · 2022-07-01T04:56:54Z

Ok, I will check for that. Just curious, what software are you using to view the file? It seems like it would be tricky to view the file given its large size.

That's a screenshot in excel. I typically use command-line tools (like awk) to extract a very small subset of the file before trying to load it...

Ok, I will check for that. Just curious, what software are you using to view the file? It seems like it would be tricky to view the file given its large size.

(Just off the first few, I found documents with the same ID according to your construction had different IC50 (nM) and/or Author/Institution/PMID/etc.)

In this case would we turn the Author/Institution/etc. into an array or would we separate those entries? ... in the case we separate those entities, we couldn't use your idea for the _id key

I think Ki, IC50, Kd, EC50, kon, and koff can all be treated the same -- put them in an array. I think that'd be fine for Author and Institution too, but can you post a couple examples, and if possible, a count of the number of times this occurs? Depending on those answers, we might add one level of grouping (under something like references) where those fields are grouped together before assembling into an array.

rjawesome · 2022-07-01T05:12:52Z

Here are the # of entries with the same ID but differing values for each key (this was for the first 300k entries on the table)

{
"BindingDB Reactant_set_id":37936,
"IC50 (nM)":19365,
"PMID":7147,
"pH":6401,
"Authors":9407,
"Article DOI":6694,
"Temp (C)":6480,
"Institution":6862,
"Ki (nM)":9881,
"Target Name Assigned by Curator or DataSource":725,
"Link to Ligand-Target Pair in BindingDB":725,
"Link to Target in BindingDB":726,
"Kd (nM)":1138,
"Curation/DataSource":1753,
"Patent Number":4203,"EC50 (nM)":7173,
"Number of Protein Chains in Target (>1 implies a multichain complex)":433,
"UniProt (TrEMBL) Primary ID of Target Chain":298,
"PubChem AID":11479,
"Target Source Organism According to Curator or DataSource":106,
"PDB ID(s) for Ligand-Target Complex":29,
"kon (M-1-s-1)":68,
"koff (s-1)":48
}

Here are some example documents where the ID is the same but the authors are different

{
  "subject": {
    "Target Name Assigned by Curator or DataSource": "Protein kinase C alpha type",
    "Target Source Organism According to Curator or DataSource": "Homo sapiens",
    "Link to Target in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=pol&polymerid=599&target=Protein+kinase+C+alpha+type&column=ki&startPg=0&Increment=50&submit=Search",
    "BindingDB Target Chain  Sequence": "MADVFPGNDSTASQDVANRFARKGALRQKNVHEVKDHKFIARFFKQPTFCSHCTDFIWGFGKQGFQCQVCCFVVHKRCHEFVTFSCPGADKGPDTDDPRSKHKFKIHTYGSPTFCDHCGSLLYGLIHQGMKCDTCDMNVHKQCVINVPSLCGMDHTEKRGRIYLKAEVADEKLHVTVRDAKNLIPMDPNGLSDPYVKLKLIPDPKNESKQKTKTIRSTLNPQWNESFTFKLKPSDKDRRLSVEIWDWDRTTRNDFMGSLSFGVSELMKMPASGWYKLLNQEEGEYYNVPIPEGDEEGNMELRQKFEKAKLGPAGNKVISPSEDRKQPSNNLDRVKLTDFNFLMVLGKGSFGKVMLADRKGTEELYAIKILKKDVVIQDDDVECTMVEKRVLALLDKPPFLTQLHSCFQTVDRLYFVMEYVNGGDLMYHIQQVGKFKEPQAVFYAAEISIGLFFLHKRGIIYRDLKLDNVMLDSEGHIKIADFGMCKEHMMDGVTTRTFCGTPDYIAPEIIAYQPYGKSVDWWAYGVLLYEMLAGQPPFDGEDEDELFQSIMEHNVSYPKSLSKEAVSVCKGLMTKHPAKRLGCGPEGERDVREHAFFRRIDWEKLENREIQPPFKPKVCGKGAENFDKFFTRGQPVLTPPDQLVIANIDQSDFEGFSYVNPQFVHPILQSAV",
    "PDB ID(s) of Target Chain": "4DNL,4RA4",
    "UniProt (SwissProt) Recommended Name of Target Chain": "Protein kinase C alpha type",
    "UniProt (SwissProt) Entry Name of Target Chain": "KPCA_HUMAN",
    "UniProt (SwissProt) Primary ID of Target Chain": "P17252",
    "UniProt (SwissProt) Secondary ID(s) of Target Chain": "B5BU22,Q15137,Q32M72,Q96RE4",
    "UniProt (TrEMBL) Primary ID of Target Chain": "L7RSM7"
  },
  "object": {
    "BindingDB Reactant_set_id": "4491",
    "Ligand SMILES": "OC(=O)c1cccc(O)c1C(=O)c1c(O)cc(cc1O)C(=O)O[C@@H]1CCCNC[C@H]1NC(=O)c1ccc(O)cc1",
    "Ligand InChI": "InChI=1S/C28H26N2O10/c31-16-8-6-14(7-9-16)26(36)30-18-13-29-10-2-5-22(18)40-28(39)15-11-20(33)24(21(34)12-15)25(35)23-17(27(37)38)3-1-4-19(23)32/h1,3-4,6-9,11-12,18,22,29,31-34H,2,5,10,13H2,(H,30,36)(H,37,38)/t18-,22-/m1/s1",
    "Ligand InChI Key": "XYUFCXJZFZPEJD-XMSQKQJNSA-N",
    "BindingDB MonomerID": "3149",
    "BindingDB Ligand Name": "2-{[2,6-dihydroxy-4-({[(3R,4R)-3-[(4-hydroxybenzene)amido]azepan-4-yl]oxy}carbonyl)phenyl]carbonyl}-3-hydroxybenzoic acid::Acyclic Balanol Analog (-)-1::Balanol analog 1::Balanol, 1::CHEMBL60254",        
    "PubChem CID": "5287736",
    "PubChem SID": "8032894",
    "ChEMBL ID of Ligand": "CHEMBL60254",
    "ZINC ID of Ligand": "ZINC03871640",
    "Number of Protein Chains in Target (>1 implies a multichain complex)": "1"
  },
  "relation": {
    "IC50 (nM)": " 30",
    "Curation/DataSource": "Curated from the literature by BindingDB",
    "Article DOI": "10.1021/jm960581w",
    "PMID": "8978850",
    "Authors": "Defauw, JM; Murphy, MM; Jagdmann, GE; Hu, H; Lampe, JW; Hollinshead, SP; Mitchell, TJ; Crane, HM; Heerding, JM; Mendoza, JS; Davis, JE; Darges, JW; Hubbard, FR; Hall, SE",
    "Institution": "Sphinx Laboratories",
    "Link to Ligand in BindingDB": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=3149",    "Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=3149&enzyme=Protein+kinase+C+alpha+type&column=ki&startPg=0&Increment=50&submit=Search",  
    "Ligand HET ID in PDB": "BA1",
    "PDB ID(s) for Ligand-Target Complex": "1BX6,3KRX,3KRW"
  },
  "_id": "3149-P17252"
}

{
  "subject": {
    "Target Name Assigned by Curator or DataSource": "Protein kinase C alpha type",
    "Target Source Organism According to Curator or DataSource": "Homo sapiens",
    "Link to Target in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=pol&polymerid=599&target=Protein+kinase+C+alpha+type&column=ki&startPg=0&Increment=50&submit=Search",
    "BindingDB Target Chain  Sequence": "MADVFPGNDSTASQDVANRFARKGALRQKNVHEVKDHKFIARFFKQPTFCSHCTDFIWGFGKQGFQCQVCCFVVHKRCHEFVTFSCPGADKGPDTDDPRSKHKFKIHTYGSPTFCDHCGSLLYGLIHQGMKCDTCDMNVHKQCVINVPSLCGMDHTEKRGRIYLKAEVADEKLHVTVRDAKNLIPMDPNGLSDPYVKLKLIPDPKNESKQKTKTIRSTLNPQWNESFTFKLKPSDKDRRLSVEIWDWDRTTRNDFMGSLSFGVSELMKMPASGWYKLLNQEEGEYYNVPIPEGDEEGNMELRQKFEKAKLGPAGNKVISPSEDRKQPSNNLDRVKLTDFNFLMVLGKGSFGKVMLADRKGTEELYAIKILKKDVVIQDDDVECTMVEKRVLALLDKPPFLTQLHSCFQTVDRLYFVMEYVNGGDLMYHIQQVGKFKEPQAVFYAAEISIGLFFLHKRGIIYRDLKLDNVMLDSEGHIKIADFGMCKEHMMDGVTTRTFCGTPDYIAPEIIAYQPYGKSVDWWAYGVLLYEMLAGQPPFDGEDEDELFQSIMEHNVSYPKSLSKEAVSVCKGLMTKHPAKRLGCGPEGERDVREHAFFRRIDWEKLENREIQPPFKPKVCGKGAENFDKFFTRGQPVLTPPDQLVIANIDQSDFEGFSYVNPQFVHPILQSAV",
    "PDB ID(s) of Target Chain": "4DNL,4RA4",
    "UniProt (SwissProt) Recommended Name of Target Chain": "Protein kinase C alpha type",
    "UniProt (SwissProt) Entry Name of Target Chain": "KPCA_HUMAN",
    "UniProt (SwissProt) Primary ID of Target Chain": "P17252",
    "UniProt (SwissProt) Secondary ID(s) of Target Chain": "B5BU22,Q15137,Q32M72,Q96RE4",
    "UniProt (TrEMBL) Primary ID of Target Chain": "L7RSM7"
  },
  "object": {
    "BindingDB Reactant_set_id": "4239",
    "Ligand SMILES": "OC(=O)c1cccc(O)c1C(=O)c1c(O)cc(cc1O)C(=O)O[C@@H]1CCCNC[C@H]1NC(=O)c1ccc(O)cc1",
    "Ligand InChI": "InChI=1S/C28H26N2O10/c31-16-8-6-14(7-9-16)26(36)30-18-13-29-10-2-5-22(18)40-28(39)15-11-20(33)24(21(34)12-15)25(35)23-17(27(37)38)3-1-4-19(23)32/h1,3-4,6-9,11-12,18,22,29,31-34H,2,5,10,13H2,(H,30,36)(H,37,38)/t18-,22-/m1/s1",
    "Ligand InChI Key": "XYUFCXJZFZPEJD-XMSQKQJNSA-N",
    "BindingDB MonomerID": "3149",
    "BindingDB Ligand Name": "2-{[2,6-dihydroxy-4-({[(3R,4R)-3-[(4-hydroxybenzene)amido]azepan-4-yl]oxy}carbonyl)phenyl]carbonyl}-3-hydroxybenzoic acid::Acyclic Balanol Analog (-)-1::Balanol analog 1::Balanol, 1::CHEMBL60254",        
    "PubChem CID": "5287736",
    "PubChem SID": "8032894",
    "ChEMBL ID of Ligand": "CHEMBL60254",
    "ZINC ID of Ligand": "ZINC03871640",
    "Number of Protein Chains in Target (>1 implies a multichain complex)": "1"
  },
  "relation": {
    "IC50 (nM)": " 74",
    "pH": "7.5",
    "Temp (C)": "30.00 C",
    "Curation/DataSource": "Curated from the literature by BindingDB",
    "Article DOI": "10.1016/0960-894X(95)00365-Z",
    "Authors": "Lai, YS; Menaldino, DS; Nichols, JB; Jagdmann , GE; Mylott, F; Gillespie, J; Hall, SE",
    "Institution": "Sphinx Laboratories",
    "Link to Ligand in BindingDB": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=3149",    "Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=3149&enzyme=Protein+kinase+C+alpha+type&column=ki&startPg=0&Increment=50&submit=Search",  
    "Ligand HET ID in PDB": "BA1",
    "PDB ID(s) for Ligand-Target Complex": "1BX6,3KRX,3KRW"
  },
  "_id": "3149-P17252"
}

For these two documents, the different fields were 'Authors', 'PMID', 'Temp (C)', 'BindingDB Reactant_set_id', 'pH', 'IC50 (nM)', 'Article DOI'

Also, just a note if we are combining documents together, I believe that would force the code to store all previous documents which could cause high ram usage (possibly around the size of the tsv itself)

rjawesome · 2022-07-01T17:11:34Z

Also, this doesn't just apply to fields in the relation, fields in the subject as well can be different even with the same ID. Here is an example where the fields 'Link to Target in BindingDB', 'Link to Ligand-Target Pair in BindingDB', 'IC50 (nM)', 'Target Name Assigned by Curator or DataSource' are different.

{
  "subject": {
    "Target Name Assigned by Curator or DataSource": "CDK2/CycE",
    "Target Source Organism According to Curator or DataSource": "Homo sapiens",
    "Link to Target in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=com&complexid=81&target=CDK2%2FCycE&column=ki&startPg=0&Increment=50&submit=Search",
    "BindingDB Target Chain  Sequence": "MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNHPNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHSHRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEILLGCKYYSTAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSFPKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAALAHPFFQDVTKPVPHLRL",
    "PDB ID(s) of Target Chain": "1CKP,1DI8,1DM2,1E1V,1E1X,1E9H,1F5Q,1FIN,1FQ1,1FVT,1FVV,1G5S,1GIH,1GY3,1GZ8,1H00,1H01,1H07,1H08,1H0V,1H0W,1H1P,1H1Q,1H1R,1H1S,1H24,1H25,1H26,1H27,1H28,1HCK,1HCL,1JST,1JSU,1JSV,1JVP,1KE5,1KE6,1KE7,1KE8,1KE9,1OGU,1OI9,1OIQ,1OIR,1OIU,1OIY,1OKV,1OKW,1OL1,1OL2,1P2A,1P5E,1PF8,1PKD,1PW2,1PXI,1PXJ,1PXK,1PXL,1PXM,1PXN,1PXO,1PXP,1PYE,1QMZ,1R78,1URC,1URW,1V1K,1VYW,1VYZ,1W0X,1W8C,1W98,1WCC,1Y8Y,1Y91,1YKR,2A0C,2A4L,2B52,2B53,2B54,2B55,2BHE,2BHH,2BKZ,2BPM,2BTR,2BTS,2C4G,2C5N,2C5O,2C5V,2C5X,2C5Y,2C68,2C69,2C6I,2C6K,2C6L,2C6M,2C6O,2C6T,2CCH,2CCI,2CJM,2CLX,2DUV,2EXM,2FVD,2G9X,2I40,2J9M,2JGZ,2R3F,2R3G,2R3H,2R3I,2R3J,2R3K,2R3L,2R3M,2R3N,2R3O,2R3P,2R3Q,2R3R,2R64,2UUE,2UZB,2UZD,2UZE,2UZL,2UZN,2UZO,2V0D,2V22,2VTA,2VTH,2VTI,2VTJ,2VTL,2VTM,2VTN,2VTO,2VTP,2VTQ,2VTR,2VTS,2VTT,2VU3,2VV9,2W05,2W06,2W17,2W1H,2WEV,2WFY,2WHB,2WIH,2WIP,2WMA,2WMB,2WPA,2WXV,2X1N,2XMY,2XNB,3BHT,3BHU,3BHV,3DDP,3DDQ,3DOG,3EID,3EJ1,3EOC,3EZR,3EZV,3F5X,3FZ1,3IG7,3IGG,3LE6,3LFN,3LFQ,3LFS,3MY5,3NS9,3PJ8,3PXF,3PXQ,3PXR,3PXY,3PXZ,3PY0,3PY1,3QHR,3QHW,3QL8,3QQF,3QQG,3QQH,3QQJ,3QQK,3QQL,3QRT,3QRU,3QTQ,3QTR,3QTS,3QTU,3QTW,3QTX,3QTZ,3QU0,3QWJ,3QWK,3QX2,3QX4,3QXO,3QXP,3QZF,3QZG,3QZH,3QZI,3R1Q,3R1S,3R1Y,3R28,3R6X,3R71,3R73,3R7E,3R7I,3R7U,3R7V,3R7Y,3R83,3R8L,3R8M,3R8P,3R8U,3R8V,3R8Z,3R9D,3R9H,3R9N,3R9O,3RAH,3RAI,3RAK,3RAL,3RJC,3RK5,3RK7,3RK9,3RKB,3RM6,3RM7,3RMF,3RNI,3ROY,3RPO,3RPR,3RPV,3RPY,3RZB,3S00,3S0O,3S1H,3S2P,3SQQ,3SW4,3SW7,3TI1,3TIY,3TIZ,3TNW,3ULI,3UNJ,3UNK,3WBL,4ACM,4BCK,4BCM,4BCN,4BCO,4BCP,4BCQ,4BGH,4BZD,4CFM,4CFN,4CFU,4CFV,4CFW,4CFX,4D1X,4D1Z,4EK3,4EK4,4EK5,4EK6,4EK8,4EOQ,4EOR,4EOS,4ERW,4EZ3,4EZ7,4FKG,4FKI,4FKJ,4FKL,4FKO,4FKP,4FKQ,4FKR,4FKS,4FKT,4FKU,4FKV,4FKW,4FX3,4GCJ,4I3Z,4II5,4KD1,4LYN,4NJ3,4RJ3,5A14,5AND,5ANE,5ANG,5ANI,5ANJ,5ANK,5ANO,5CYI,5D1J,5FP5,5FP6,5IEV,5IEX,5IEY,5IF1,5JQ5,5JQ8,5K4J,5L2W,5LMK,5MHQ,5NEV,5OO0,5OSJ,5UQ1,5UQ2,5UQ3,6ATH,6GUB,6GUC,6GUE,6GUF,6GUH,6GUK,6GVA,6INL,6JGM,6OQI,6P3W,6Q3B,6Q3C,6Q3F,6Q48,6Q49,6Q4A,6Q4B,6Q4C,6Q4D,6Q4E,6Q4F,6Q4H,6Q4J,6Q4K,6RIJ,6SG4,7ACK,7B5L,7B5R,7B7S,7E34,7KJS,7M2F,7NVQ,7RA5",
    "UniProt (SwissProt) Recommended Name of Target Chain": "Cyclin-dependent kinase 2",
    "UniProt (SwissProt) Entry Name of Target Chain": "CDK2_HUMAN",
    "UniProt (SwissProt) Primary ID of Target Chain": "P24941",
    "UniProt (SwissProt) Secondary ID(s) of Target Chain": "A8K7C6,O75100"
  },
  "object": {
    "BindingDB Reactant_set_id": "10166",
    "Ligand SMILES": "CCn1c2nc(Nc3ccccc3)ncc2ccc1=O",
    "Ligand InChI": "InChI=1S/C15H14N4O/c1-2-19-13(20)9-8-11-10-16-15(18-14(11)19)17-12-6-4-3-5-7-12/h3-10H,2H2,1H3,(H,16,17,18)",
    "Ligand InChI Key": "WSZLNFZLFQJSAJ-UHFFFAOYSA-N",
    "BindingDB MonomerID": "6221",
    "BindingDB Ligand Name": "8-Ethyl-2-phenylamino-8H-pyrido[2,3-d]pyrimidin-7-one::8-ethyl-2-(phenylamino)-7H,8H-pyrido[2,3-d]pyrimidin-7-one::C2 Pyrido[2,3-d]pyrimidin-7-one deriv. 1",
    "PubChem CID": "5330199",
    "PubChem SID": "8035820",
    "ZINC ID of Ligand": "ZINC12354795",
    "Number of Protein Chains in Target (>1 implies a multichain complex)": "2"
  },
  "relation": {
    "IC50 (nM)": " 410",
    "Curation/DataSource": "Curated from the literature by BindingDB",
    "Article DOI": "10.1021/jm000271k",
    "PMID": "11101352",
    "Authors": "Barvian, M; Boschelli, DH; Cossrow, J; Dobrusin, E; Fattaey, A; Fritsch, A; Fry, D; Harvey, P; Keller, P; Garrett, M; La, F; Leopold, W; McNamara, D; Quin, M; Trumpp-Kallmeyer, S; Toogood, P; Wu, Z; Zhang, E",
    "Institution": "Parke-Davis Pharmaceutical Research",
    "Link to Ligand in BindingDB": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=6221",    "Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=6221&enzyme=CDK2%2FCycE&column=ki&startPg=0&Increment=50&submit=Search"
  },
  "_id": "6221-P24941"
}

{
  "subject": {
    "Target Name Assigned by Curator or DataSource": "Cyclin-Dependent Kinase 2 (CDK2)",
    "Target Source Organism According to Curator or DataSource": "Homo sapiens",
    "Link to Target in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=com&complexid=97&target=Cyclin-Dependent+Kinase+2+%28CDK2%29&column=ki&startPg=0&Increment=50&submit=Search",       
    "BindingDB Target Chain  Sequence": "MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNHPNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHSHRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEILLGCKYYSTAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSFPKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAALAHPFFQDVTKPVPHLRL",
    "PDB ID(s) of Target Chain": "1CKP,1DI8,1DM2,1E1V,1E1X,1E9H,1F5Q,1FIN,1FQ1,1FVT,1FVV,1G5S,1GIH,1GY3,1GZ8,1H00,1H01,1H07,1H08,1H0V,1H0W,1H1P,1H1Q,1H1R,1H1S,1H24,1H25,1H26,1H27,1H28,1HCK,1HCL,1JST,1JSU,1JSV,1JVP,1KE5,1KE6,1KE7,1KE8,1KE9,1OGU,1OI9,1OIQ,1OIR,1OIU,1OIY,1OKV,1OKW,1OL1,1OL2,1P2A,1P5E,1PF8,1PKD,1PW2,1PXI,1PXJ,1PXK,1PXL,1PXM,1PXN,1PXO,1PXP,1PYE,1QMZ,1R78,1URC,1URW,1V1K,1VYW,1VYZ,1W0X,1W8C,1W98,1WCC,1Y8Y,1Y91,1YKR,2A0C,2A4L,2B52,2B53,2B54,2B55,2BHE,2BHH,2BKZ,2BPM,2BTR,2BTS,2C4G,2C5N,2C5O,2C5V,2C5X,2C5Y,2C68,2C69,2C6I,2C6K,2C6L,2C6M,2C6O,2C6T,2CCH,2CCI,2CJM,2CLX,2DUV,2EXM,2FVD,2G9X,2I40,2J9M,2JGZ,2R3F,2R3G,2R3H,2R3I,2R3J,2R3K,2R3L,2R3M,2R3N,2R3O,2R3P,2R3Q,2R3R,2R64,2UUE,2UZB,2UZD,2UZE,2UZL,2UZN,2UZO,2V0D,2V22,2VTA,2VTH,2VTI,2VTJ,2VTL,2VTM,2VTN,2VTO,2VTP,2VTQ,2VTR,2VTS,2VTT,2VU3,2VV9,2W05,2W06,2W17,2W1H,2WEV,2WFY,2WHB,2WIH,2WIP,2WMA,2WMB,2WPA,2WXV,2X1N,2XMY,2XNB,3BHT,3BHU,3BHV,3DDP,3DDQ,3DOG,3EID,3EJ1,3EOC,3EZR,3EZV,3F5X,3FZ1,3IG7,3IGG,3LE6,3LFN,3LFQ,3LFS,3MY5,3NS9,3PJ8,3PXF,3PXQ,3PXR,3PXY,3PXZ,3PY0,3PY1,3QHR,3QHW,3QL8,3QQF,3QQG,3QQH,3QQJ,3QQK,3QQL,3QRT,3QRU,3QTQ,3QTR,3QTS,3QTU,3QTW,3QTX,3QTZ,3QU0,3QWJ,3QWK,3QX2,3QX4,3QXO,3QXP,3QZF,3QZG,3QZH,3QZI,3R1Q,3R1S,3R1Y,3R28,3R6X,3R71,3R73,3R7E,3R7I,3R7U,3R7V,3R7Y,3R83,3R8L,3R8M,3R8P,3R8U,3R8V,3R8Z,3R9D,3R9H,3R9N,3R9O,3RAH,3RAI,3RAK,3RAL,3RJC,3RK5,3RK7,3RK9,3RKB,3RM6,3RM7,3RMF,3RNI,3ROY,3RPO,3RPR,3RPV,3RPY,3RZB,3S00,3S0O,3S1H,3S2P,3SQQ,3SW4,3SW7,3TI1,3TIY,3TIZ,3TNW,3ULI,3UNJ,3UNK,3WBL,4ACM,4BCK,4BCM,4BCN,4BCO,4BCP,4BCQ,4BGH,4BZD,4CFM,4CFN,4CFU,4CFV,4CFW,4CFX,4D1X,4D1Z,4EK3,4EK4,4EK5,4EK6,4EK8,4EOQ,4EOR,4EOS,4ERW,4EZ3,4EZ7,4FKG,4FKI,4FKJ,4FKL,4FKO,4FKP,4FKQ,4FKR,4FKS,4FKT,4FKU,4FKV,4FKW,4FX3,4GCJ,4I3Z,4II5,4KD1,4LYN,4NJ3,4RJ3,5A14,5AND,5ANE,5ANG,5ANI,5ANJ,5ANK,5ANO,5CYI,5D1J,5FP5,5FP6,5IEV,5IEX,5IEY,5IF1,5JQ5,5JQ8,5K4J,5L2W,5LMK,5MHQ,5NEV,5OO0,5OSJ,5UQ1,5UQ2,5UQ3,6ATH,6GUB,6GUC,6GUE,6GUF,6GUH,6GUK,6GVA,6INL,6JGM,6OQI,6P3W,6Q3B,6Q3C,6Q3F,6Q48,6Q49,6Q4A,6Q4B,6Q4C,6Q4D,6Q4E,6Q4F,6Q4H,6Q4J,6Q4K,6RIJ,6SG4,7ACK,7B5L,7B5R,7B7S,7E34,7KJS,7M2F,7NVQ,7RA5",
    "UniProt (SwissProt) Recommended Name of Target Chain": "Cyclin-dependent kinase 2",
    "UniProt (SwissProt) Entry Name of Target Chain": "CDK2_HUMAN",
    "UniProt (SwissProt) Primary ID of Target Chain": "P24941",
    "UniProt (SwissProt) Secondary ID(s) of Target Chain": "A8K7C6,O75100"
  },
  "object": {
    "BindingDB Reactant_set_id": "10159",
    "Ligand SMILES": "CCn1c2nc(Nc3ccccc3)ncc2ccc1=O",
    "Ligand InChI": "InChI=1S/C15H14N4O/c1-2-19-13(20)9-8-11-10-16-15(18-14(11)19)17-12-6-4-3-5-7-12/h3-10H,2H2,1H3,(H,16,17,18)",
    "Ligand InChI Key": "WSZLNFZLFQJSAJ-UHFFFAOYSA-N",
    "BindingDB MonomerID": "6221",
    "BindingDB Ligand Name": "8-Ethyl-2-phenylamino-8H-pyrido[2,3-d]pyrimidin-7-one::8-ethyl-2-(phenylamino)-7H,8H-pyrido[2,3-d]pyrimidin-7-one::C2 Pyrido[2,3-d]pyrimidin-7-one deriv. 1",
    "PubChem CID": "5330199",
    "PubChem SID": "8035820",
    "ZINC ID of Ligand": "ZINC12354795",
    "Number of Protein Chains in Target (>1 implies a multichain complex)": "2"
  },
  "relation": {
    "IC50 (nM)": " 129",
    "Curation/DataSource": "Curated from the literature by BindingDB",
    "Article DOI": "10.1021/jm000271k",
    "PMID": "11101352",
    "Authors": "Barvian, M; Boschelli, DH; Cossrow, J; Dobrusin, E; Fattaey, A; Fritsch, A; Fry, D; Harvey, P; Keller, P; Garrett, M; La, F; Leopold, W; McNamara, D; Quin, M; Trumpp-Kallmeyer, S; Toogood, P; Wu, Z; Zhang, E",
    "Institution": "Parke-Davis Pharmaceutical Research",
    "Link to Ligand in BindingDB": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=6221",    "Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=6221&enzyme=Cyclin-Dependent+Kinase+2+%28CDK2%29&column=ki&startPg=0&Increment=50&submit=Search"
  },
  "_id": "6221-P24941"
}

colleenXu · 2022-07-02T07:53:16Z

Just a note (Andrew and you are the decision-makers): is there a way to organize the documents by relationship? Maybe by what fields are present, or something else (values in the fields, organization of database, names of files)?

For example, if a document has "inhibition-specific" fields like IC50 and Ki, does that mean it will definitely lack the fields for EC50 (which can be "agonist / stimulator" but can also be more general "effect") and Kd (more general to receptor-ligand binding) (this source is helpful)? To me, that would imply that the document is representing an "inhibition" relationship that is more specific than "this binds to that"...

It would help the ingestion into BTE a lot if we could put a keyword under relation that defined the relationship this document actually represents. Like "inhibition", "stimulates", "binds to".....something like that.

rjawesome · 2022-07-02T23:07:42Z

I could probably look into other things, but I have found that some amount of documents (~2000 in the first 300000) have both Ki and EC50 so I don't believe your method would work unless we could have multiple relationships, or alternatively we could ignore/not have a relationship for ones with multiple of these fields

andrewsu · 2022-07-05T21:16:28Z

For now then, I think we should not worry about trying to characterize the relationship in more detail. Let's just add a top-level key for predicate with a value of 'physically interacts with'.

Also @rjawesome, can you add a mapping table between the original column names and the corresponding key to use in the JSON? For example, PDB ID(s) of Target Chain can be converted to just pdb, Ligand SMILES -> smiles, PubChem CID -> pubchem_cid. If you can create a document with the original column names we are using in the output JSON, I can provide the appropriate mapped values...

rjawesome · 2022-07-05T23:36:43Z

All column names are located in this sample doc (#70 (comment)). I will add the predicate and start on coding a mapping table, but I was wondering what our decision was relating to the documents with duplicate IDs?

andrewsu · 2022-07-06T03:10:43Z

In cases where the _id is duplicated, then let's combine the relation info as objects in an array. Using the example you describe in this comment, the updated document would look something like this:

{
  "subject": {
    "Target Name Assigned by Curator or DataSource": "Protein kinase C alpha type",
    "Target Source Organism According to Curator or DataSource": "Homo sapiens",
    "Link to Target in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=pol&polymerid=599&target=Protein+kinase+C+alpha+type&column=ki&startPg=0&Increment=50&submit=Search",
    "BindingDB Target Chain  Sequence": "MADVFPGNDSTASQDVANRFARKGALRQKNVHEVKDHKFIARFFKQPTFCSHCTDFIWGFGKQGFQCQVCCFVVHKRCHEFVTFSCPGADKGPDTDDPRSKHKFKIHTYGSPTFCDHCGSLLYGLIHQGMKCDTCDMNVHKQCVINVPSLCGMDHTEKRGRIYLKAEVADEKLHVTVRDAKNLIPMDPNGLSDPYVKLKLIPDPKNESKQKTKTIRSTLNPQWNESFTFKLKPSDKDRRLSVEIWDWDRTTRNDFMGSLSFGVSELMKMPASGWYKLLNQEEGEYYNVPIPEGDEEGNMELRQKFEKAKLGPAGNKVISPSEDRKQPSNNLDRVKLTDFNFLMVLGKGSFGKVMLADRKGTEELYAIKILKKDVVIQDDDVECTMVEKRVLALLDKPPFLTQLHSCFQTVDRLYFVMEYVNGGDLMYHIQQVGKFKEPQAVFYAAEISIGLFFLHKRGIIYRDLKLDNVMLDSEGHIKIADFGMCKEHMMDGVTTRTFCGTPDYIAPEIIAYQPYGKSVDWWAYGVLLYEMLAGQPPFDGEDEDELFQSIMEHNVSYPKSLSKEAVSVCKGLMTKHPAKRLGCGPEGERDVREHAFFRRIDWEKLENREIQPPFKPKVCGKGAENFDKFFTRGQPVLTPPDQLVIANIDQSDFEGFSYVNPQFVHPILQSAV",
    "PDB ID(s) of Target Chain": "4DNL,4RA4",
    "UniProt (SwissProt) Recommended Name of Target Chain": "Protein kinase C alpha type",
    "UniProt (SwissProt) Entry Name of Target Chain": "KPCA_HUMAN",
    "UniProt (SwissProt) Primary ID of Target Chain": "P17252",
    "UniProt (SwissProt) Secondary ID(s) of Target Chain": "B5BU22,Q15137,Q32M72,Q96RE4",
    "UniProt (TrEMBL) Primary ID of Target Chain": "L7RSM7"
  },
  "object": {
    "BindingDB Reactant_set_id": "4491",
    "Ligand SMILES": "OC(=O)c1cccc(O)c1C(=O)c1c(O)cc(cc1O)C(=O)O[C@@H]1CCCNC[C@H]1NC(=O)c1ccc(O)cc1",
    "Ligand InChI": "InChI=1S/C28H26N2O10/c31-16-8-6-14(7-9-16)26(36)30-18-13-29-10-2-5-22(18)40-28(39)15-11-20(33)24(21(34)12-15)25(35)23-17(27(37)38)3-1-4-19(23)32/h1,3-4,6-9,11-12,18,22,29,31-34H,2,5,10,13H2,(H,30,36)(H,37,38)/t18-,22-/m1/s1",
    "Ligand InChI Key": "XYUFCXJZFZPEJD-XMSQKQJNSA-N",
    "BindingDB MonomerID": "3149",
    "BindingDB Ligand Name": "2-{[2,6-dihydroxy-4-({[(3R,4R)-3-[(4-hydroxybenzene)amido]azepan-4-yl]oxy}carbonyl)phenyl]carbonyl}-3-hydroxybenzoic acid::Acyclic Balanol Analog (-)-1::Balanol analog 1::Balanol, 1::CHEMBL60254",        
    "PubChem CID": "5287736",
    "PubChem SID": "8032894",
    "ChEMBL ID of Ligand": "CHEMBL60254",
    "ZINC ID of Ligand": "ZINC03871640",
    "Number of Protein Chains in Target (>1 implies a multichain complex)": "1"
  },
  "relation": [
    {
      "BindingDB Reactant_set_id": "4491",
      "IC50 (nM)": " 30",
      "Curation/DataSource": "Curated from the literature by BindingDB",
      "Article DOI": "10.1021/jm960581w",
      "PMID": "8978850",
      "Authors": "Defauw, JM; Murphy, MM; Jagdmann, GE; Hu, H; Lampe, JW; Hollinshead, SP; Mitchell, TJ; Crane, HM; Heerding, JM; Mendoza, JS; Davis, JE; Darges, JW; Hubbard, FR; Hall, SE",
      "Institution": "Sphinx Laboratories",
      "Link to Ligand in BindingDB": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=3149",    "Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=3149&enzyme=Protein+kinase+C+alpha+type&column=ki&startPg=0&Increment=50&submit=Search",  
      "Ligand HET ID in PDB": "BA1",
      "PDB ID(s) for Ligand-Target Complex": "1BX6,3KRX,3KRW"
    }, {
      "BindingDB Reactant_set_id": "4239",
      "IC50 (nM)": " 74",
      "pH": "7.5",
      "Temp (C)": "30.00 C",
      "Curation/DataSource": "Curated from the literature by BindingDB",
      "Article DOI": "10.1016/0960-894X(95)00365-Z",
      "Authors": "Lai, YS; Menaldino, DS; Nichols, JB; Jagdmann , GE; Mylott, F; Gillespie, J; Hall, SE",
      "Institution": "Sphinx Laboratories",
      "Link to Ligand in BindingDB": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=3149",    "Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=3149&enzyme=Protein+kinase+C+alpha+type&column=ki&startPg=0&Increment=50&submit=Search",  
      "Ligand HET ID in PDB": "BA1",
      "PDB ID(s) for Ligand-Target Complex": "1BX6,3KRX,3KRW"
    }
  ],
  "_id": "3149-P17252"
}

Note also that the BindingDB Reactant_set_id should be moved from the object section to the relation section.

rjawesome · 2022-07-06T04:21:28Z

I can do this, but as I pointed out earlier, there are other fields that would seem to best fit in the object/subject which are also duplicated such as Target Name Assigned by Curator or DataSource, Link to Target in BindingDB, UniProt (TrEMBL) Primary ID of Target Chain Target Source Organism According to Curator or DataSource. Should I move those to the relation?

andrewsu · 2022-07-06T04:38:30Z

All the ones you explicitly listed in your last comment -- Target Name Assigned by Curator or DataSource, Link to Target in BindingDB, UniProt (TrEMBL) Primary ID of Target Chain, and Target Source Organism According to Curator or DataSource -- should remain in the subject section, and those values should be converted to arrays if they differ between records.

I also just noticed that Link to Ligand in BindingDB shows up in relation -- that should be moved to object.

Post here if there are any other fields whose behavior needs discussion...

rjawesome · 2022-07-06T16:57:02Z

Alright, I've updated the parser, here is a new sample document (all fields that could have duplicates have been turned into arrays)

{
  "subject": {
    "Target Name Assigned by Curator or DataSource": [
      "Cyclin-Dependent Kinase 2 (CDK2)",
      "CDK2/CycE"
    ],
    "Target Source Organism According to Curator or DataSource": [
      "Homo sapiens"
    ],
    "Link to Target in BindingDB": [
      "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=com&complexid=97&target=Cyclin-Dependent+Kinase+2+%28CDK2%29&column=ki&startPg=0&Increment=50&submit=Search",
      "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=com&complexid=81&target=CDK2%2FCycE&column=ki&startPg=0&Increment=50&submit=Search"
    ],
    "BindingDB Target Chain  Sequence": "MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNHPNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHSHRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEILLGCKYYSTAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSFPKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAALAHPFFQDVTKPVPHLRL",
    "PDB ID(s) of Target Chain": [
      [
        "1CKP",
        "1DI8",
        ...
        "7M2F",
        "7NVQ",
        "7RA5"
      ]
    ],
    "UniProt (SwissProt) Recommended Name of Target Chain": [
      "Cyclin-dependent kinase 2"
    ],
    "UniProt (SwissProt) Entry Name of Target Chain": "CDK2_HUMAN",
    "UniProt (SwissProt) Primary ID of Target Chain": "P24941",
    "UniProt (SwissProt) Secondary ID(s) of Target Chain": [
      "A8K7C6",
      "O75100"
    ]
  },
  "object": {
    "Ligand SMILES": "CCn1c2nc(Nc3ccccc3)ncc2ccc1=O",
    "Ligand InChI": "InChI=1S/C15H14N4O/c1-2-19-13(20)9-8-11-10-16-15(18-14(11)19)17-12-6-4-3-5-7-12/h3-10H,2H2,1H3,(H,16,17,18)",
    "Ligand InChI Key": "WSZLNFZLFQJSAJ-UHFFFAOYSA-N",
    "BindingDB MonomerID": 6221,
    "BindingDB Ligand Name": [
      "8-Ethyl-2-phenylamino-8H-pyrido[2,3-d]pyrimidin-7-one::8-ethyl-2-(phenylamino)-7H,8H-pyrido[2,3-d]pyrimidin-7-one::C2 Pyrido[2,3-d]pyrimidin-7-one deriv. 1"
    ],
    "Link to Ligand in BindingDB": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=6221",
    "PubChem CID": 5330199,
    "PubChem SID": 8035820,
    "ZINC ID of Ligand": "ZINC12354795"
  },
  "relation": [
    {
      "BindingDB Reactant_set_id": 10159,
      "IC50 (nM)": " 129",
      "Curation/DataSource": "Curated from the literature by BindingDB",
      "Article DOI": "10.1021/jm000271k",
      "PMID": "11101352",
      "Authors": [
        "Barvian, M",
        "Boschelli, DH",
        "Cossrow, J",
        "Dobrusin, E",
        "Fattaey, A",
        "Fritsch, A",
        "Fry, D",
        "Harvey, P",
        "Keller, P",
        "Garrett, M",
        "La, F",
        "Leopold, W",
        "McNamara, D",
        "Quin, M",
        "Trumpp-Kallmeyer, S",
        "Toogood, P",
        "Wu, Z",
        "Zhang, E"
      ],
      "Institution": "Parke-Davis Pharmaceutical Research",
      "Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=6221&enzyme=Cyclin-Dependent+Kinase+2+%28CDK2%29&column=ki&startPg=0&Increment=50&submit=Search",
      "Number of Protein Chains in Target (>1 implies a multichain complex)": 2
    },
    {
      "BindingDB Reactant_set_id": 10166,
      "IC50 (nM)": " 410",
      "Curation/DataSource": "Curated from the literature by BindingDB",
      "Article DOI": "10.1021/jm000271k",
      "PMID": "11101352",
      "Authors": [
        "Barvian, M",
        "Boschelli, DH",
        "Cossrow, J",
        "Dobrusin, E",
        "Fattaey, A",
        "Fritsch, A",
        "Fry, D",
        "Harvey, P",
        "Keller, P",
        "Garrett, M",
        "La, F",
        "Leopold, W",
        "McNamara, D",
        "Quin, M",
        "Trumpp-Kallmeyer, S",
        "Toogood, P",
        "Wu, Z",
        "Zhang, E"
      ],
      "Institution": "Parke-Davis Pharmaceutical Research",
      "Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=6221&enzyme=CDK2%2FCycE&column=ki&startPg=0&Increment=50&submit=Search",
      "Number of Protein Chains in Target (>1 implies a multichain complex)": 2
    }
  ],
  "_id": "6221-P24941"
}

rjawesome · 2022-07-06T18:57:44Z

Also @andrewsu you were mentioning you wanted the fields to be mapped, so if you still want that could you could send what you want each field to be mapped to (FYI, all fields located in this document)

andrewsu · 2022-07-07T04:15:42Z

You can use the mapping table below. Note that I also decided to collapse the "swissprot" and "trembl" sets of fields under a single subject.uniprot object, and then we'd add a subject.uniprot.type field that was either "SwissProt" or "TrEMBL", depending on what set of columns the data came from. Let me know if anything here doesn't make sense!

mapping table

original	mapped
BindingDB Reactant_set_id	relation.bindingdb_set_id
Ligand SMILES	object.smiles
Ligand InChI	object.inchi
Ligand InChI Key	object.inchikey
BindingDB MonomerID	object.monomer_id
BindingDB Ligand Name	object.name
Target Name Assigned by Curator or DataSource	subject.name
Target Source Organism According to Curator or DataSource	subject.organism
Ki (nM)	relation.ki_nm
IC50 (nM)	relation.ic50_nm
Kd (nM)	relation.kd_nm
EC50 (nM)	relation.ec50_nm
kon (M-1-s-1)	relation.kon
koff (s-1)	relation.koff
pH	relation.ph
Temp (C)	relation.temp_c
Curation/DataSource	relation.curation_datasource
Article DOI	relation.article_doi
PMID	relation.pmid
PubChem AID	relation.pubchem_aid
Patent Number	relation.patent_number
Authors	relation.authors
Institution	relation.institution
Link to Ligand in BindingDB	object.bindingdb_link
Link to Target in BindingDB	subject.bindingdb_link
Link to Ligand-Target Pair in BindingDB	relation.bindingdb_link
Ligand HET ID in PDB	object.het_id_pdb
PDB ID(s) for Ligand-Target Complex	relation.pdb
PubChem CID	object.pubchem_cid
PubChem SID	object.pubchem_sid
ChEBI ID of Ligand	object.chebi
ChEMBL ID of Ligand	object.chembl
DrugBank ID of Ligand	object.drugbank
IUPHAR_GRAC ID of Ligand	object.iuphar_grac_id
KEGG ID of Ligand	object.kegg
ZINC ID of Ligand	object.zinc
Number of Protein Chains in Target (>1 implies a multichain complex)	relation.num_protein_chains
BindingDB Target Chain Sequence	subject.sequence
PDB ID(s) of Target Chain	subject.pdb
UniProt (SwissProt) Recommended Name of Target Chain	subject.uniprot.fullname
UniProt (SwissProt) Entry Name of Target Chain	subject.uniprot.id
UniProt (SwissProt) Primary ID of Target Chain	subject.uniprot.accession
UniProt (SwissProt) Secondary ID(s) of Target Chain	subject.uniprot.secondary_accession
UniProt (SwissProt) Alternative ID(s) of Target Chain	subject.uniprot.alternative_accession
UniProt (TrEMBL) Submitted Name of Target Chain	subject.uniprot.fullname
UniProt (TrEMBL) Entry Name of Target Chain	subject.uniprot.id
UniProt (TrEMBL) Primary ID of Target Chain	subject.uniprot.accession
UniProt (TrEMBL) Secondary ID(s) of Target Chain	subject.uniprot.secondary_accession
UniProt (TrEMBL) Alternative ID(s) of Target Chain	subject.uniprot.alternative_accession

rjawesome · 2022-07-07T17:47:19Z

@andrewsu, for the subject.uniprot field, you put the Entry Name as the subject.uniprot.id so does this mean I should be using the Entry Name in _id instead of the Primary ID. Also, should subject.uniprot be an array to contain both SwissProt and TrEMBL or should I create separate documents for TrEMBL and SwissProt links?

andrewsu · 2022-07-07T21:45:24Z

for the subject.uniprot field, you put the Entry Name as the subject.uniprot.id so does this mean I should be using the Entry Name in _id instead of the Primary ID.

I think you can continue using the UniProt (SwissProt) Primary ID of Target Chain / subject.uniprot.accession in _id...

Also, should subject.uniprot be an array to contain both SwissProt and TrEMBL or should I create separate documents for TrEMBL and SwissProt links?

separate documents for TrEMBL and SwissProt please...

rjawesome · 2022-07-08T19:48:08Z

@andrewsu I've noticed that none of the TrEMBL documents meet my current criteria for determining if a protein is a human protein (ie. the entry name ends with _HUMAN). More specifically, none of the TrEMBL documents have their own entry names. We have a few options here:

Eliminate TrEMBL Documents
Include a TrEMBL Document if the SwissProt entry name contains _HUMAN
Broaden our criteria to also include any document with a target species of Homo sapiens as a human protein

andrewsu · 2022-07-12T04:30:38Z

Hmm, looks like for TrEMBL, they populate UniProt (TrEMBL) Primary ID of Target Chain but not UniProt (TrEMBL) Entry Name of Target Chain, which (as you point out) makes the species filtering challenging. We could download a uniprot table to look up the corresponding "Entry Name". But in practice, TrEMBL is a dataset of computationally predicted/annotated proteins, so they are of lesser importance than SwissProt entries.

So, bottom line, let's keep your filtering based on the "Entry Name". In practice, that means that no TrEMBL records will be created. But at least the logic will be in place in case they start populating those columns in the future...

rjawesome · 2022-07-12T18:01:18Z

Parser has been updated. New Sample Record below...

{
  "subject": {
    "name": [
      "Cyclin-Dependent Kinase 2 (CDK2)",
      "CDK2/CycE"
    ],
    "organism": [
      "Homo sapiens"
    ],
    "bindingdb_link": [
      "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=com&complexid=97&target=Cyclin-Dependent+Kinase+2+%28CDK2%29&column=ki&startPg=0&Increment=50&submit=Search",
      "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=com&complexid=81&target=CDK2%2FCycE&column=ki&startPg=0&Increment=50&submit=Search"
    ],
    "uniprot": {
      "type": "swissprot",
      "fullname": [
        "Cyclin-dependent kinase 2"
      ],
      "id": "CDK2_HUMAN",
      "accession": "P24941",
      "secondary_accession": [
        "A8K7C6",
        "O75100"
      ]
    },
    "sequence": "MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNHPNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHSHRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEILLGCKYYSTAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSFPKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAALAHPFFQDVTKPVPHLRL",
    "pdb": [
      [
        "1CKP",
        "1DI8",
        ...
        "7NVQ",
        "7RA5"
      ]
    ]
  },
  "object": {
    "smiles": "CCn1c2nc(Nc3ccccc3)ncc2ccc1=O",
    "inchi": "InChI=1S/C15H14N4O/c1-2-19-13(20)9-8-11-10-16-15(18-14(11)19)17-12-6-4-3-5-7-12/h3-10H,2H2,1H3,(H,16,17,18)",
    "inchikey": "WSZLNFZLFQJSAJ-UHFFFAOYSA-N",
    "monomer_id": 6221,
    "name": [
      "8-Ethyl-2-phenylamino-8H-pyrido[2,3-d]pyrimidin-7-one::8-ethyl-2-(phenylamino)-7H,8H-pyrido[2,3-d]pyrimidin-7-one::C2 Pyrido[2,3-d]pyrimidin-7-one deriv. 1"
    ],
    "bindingdb_link": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=6221",
    "pubchem_cid": 5330199,
    "pubchem_sid": 8035820,
    "zinc": "ZINC12354795"
  },
  "relation": [
    {
      "bindingdb_set_id": 10159,
      "ic50_nm": " 129",
      "curation_datasource": "Curated from the literature by BindingDB",
      "article_doi": "10.1021/jm000271k",
      "pmid": "11101352",
      "authors": [
        "Barvian, M",
        "Boschelli, DH",
        "Cossrow, J",
        "Dobrusin, E",
        "Fattaey, A",
        "Fritsch, A",
        "Fry, D",
        "Harvey, P",
        "Keller, P",
        "Garrett, M",
        "La, F",
        "Leopold, W",
        "McNamara, D",
        "Quin, M",
        "Trumpp-Kallmeyer, S",
        "Toogood, P",
        "Wu, Z",
        "Zhang, E"
      ],
      "institution": "Parke-Davis Pharmaceutical Research",
      "bindingdb_link": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=6221&enzyme=Cyclin-Dependent+Kinase+2+%28CDK2%29&column=ki&startPg=0&Increment=50&submit=Search",
      "num_protein_chains": 2
    },
    {
      "bindingdb_set_id": 10166,
      "ic50_nm": " 410",
      "curation_datasource": "Curated from the literature by BindingDB",
      "article_doi": "10.1021/jm000271k",
      "pmid": "11101352",
      "authors": [
        "Barvian, M",
        "Boschelli, DH",
        "Cossrow, J",
        "Dobrusin, E",
        "Fattaey, A",
        "Fritsch, A",
        "Fry, D",
        "Harvey, P",
        "Keller, P",
        "Garrett, M",
        "La, F",
        "Leopold, W",
        "McNamara, D",
        "Quin, M",
        "Trumpp-Kallmeyer, S",
        "Toogood, P",
        "Wu, Z",
        "Zhang, E"
      ],
      "institution": "Parke-Davis Pharmaceutical Research",
      "bindingdb_link": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=6221&enzyme=CDK2%2FCycE&column=ki&startPg=0&Increment=50&submit=Search",
      "num_protein_chains": 2
    }
  ],
  "_id": "6221-P24941",
  "predicate": "physically interacts with"
}

rjawesome · 2022-07-15T20:21:35Z

Note, just fixed a glitch in my parser. Another sample record (that was affected).

{
  "subject": {
    "name": "Epidermal growth factor receptor",
    "organism": "Homo sapiens",
    "bindingdb_link": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=pol&polymerid=520&target=Epidermal+growth+factor+receptor&column=ki&startPg=0&Increment=50&submit=Search",
    "uniprot": {
      "type": "swissprot",
      "fullname": "Epidermal growth factor receptor",
      "id": "EGFR_HUMAN",
      "accession": "P00533",
      "secondary_accession": [
        "O00688",
        "O00732",
        "P06268",
        "Q14225",
        "Q68GS5",
        "Q92795",
        "Q9BZS2",
        "Q9GZX1",
        "Q9H2C9",
        "Q9H3C9",
        "Q9UMD7",
        "Q9UMD8",
        "Q9UMG5"
      ]
    },
    "sequence": "MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEVVLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALAVLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDFQNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGCTGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYVVTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFKNCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAFENLEIIRGRTKQHGQFSLAVVSLNITSLGLRSLKEISDGDVIISGNKNLCYANTINWKKLFGTSGQKTKIISNRGENSCKATGQVCHALCSPEGCWGPEPRDCVSCRNVSRGRECVDKCNLLEGEPREFVENSECIQCHPECLPQAMNITCTGRGPDNCIQCAHYIDGPHCVKTCPAGVMGENNTLVWKYADAGHVCHLCHPNCTYGCTGPGLEGCPTNGPKIPSIATGMVGALLLLLVVALGIGLFMRRRHIVRKRTLRRLLQERELVEPLTPSGEAPNQALLRILKETEFKKIKVLGSGAFGTVYKGLWIPEGEKVKIPVAIKELREATSPKANKEILDEAYVMASVDNPHVCRLLGICLTSTVQLITQLMPFGCLLDYVREHKDNIGSQYLLNWCVQIAKGMNYLEDRRLVHRDLAARNVLVKTPQHVKITDFGLAKLLGAEEKEYHAEGGKVPIKWMALESILHRIYTHQSDVWSYGVTVWELMTFGSKPYDGIPASEISSILEKGERLPQPPICTIDVYMIMVKCWMIDADSRPKFRELIIEFSKMARDPQRYLVIQGDERMHLPSPTDSNFYRALMDEEDMDDVVDADEYLIPQQGFFSSPSTSRTPLLSSLSATSNNSTVACIDRNGLQSCPIKEDSFLQRYSSDPTGALTEDSIDDTFLPVPEYINQSVPKRPAGSVQNPVYHNQPLNPAPSRDPHYQDPHSTAVGNPEYLNTVQPTCVNSTFDSPAHWAQKGSHQISLDNPDYQQDFFPKEAKPNGIFKGSTAENAEYLRVAPQSSEFIGA",
    "pdb": [
      [
        "1IVO",
        "1M14",
        "1M17",
        ...
        "7AEI",
        "7AEM",
        "7OXB"
      ],
      [
        "1IVO",
        "1M14",
        "1M17",
        ...
        "6VHN",
        "6VHP",
        "7AEI",
        "7AEM"
      ]
    ]
  },
  "object": {
    "smiles": "Cc1ccc(cc1)-n1nc(cc1NC(=O)Nc1ccc(OCCN2CCOCC2)c2ccccc12)C(C)(C)C",
    "inchi": "InChI=1S/C31H37N5O3/c1-22-9-11-23(12-10-22)36-29(21-28(34-36)31(2,3)4)33-30(37)32-26-13-14-27(25-8-6-5-7-24(25)26)39-20-17-35-15-18-38-19-16-35/h5-14,21H,15-20H2,1-4H3,(H2,32,33,37)",
    "inchikey": "MVCOAUNKQVWQHZ-UHFFFAOYSA-N",
    "monomer_id": 13533,
    "name": "1-[2-(4-methylphenyl)-5-tert-butyl-pyrazol-3-yl]-3-[4-(2-morpholin-4-ylethoxy)naphthalen-1-yl]urea::1-[5-tert-butyl-2-(4-methylphenyl)-3-pyrazolyl]-3-[4-[2-(4-morpholinyl)ethoxy]-1-naphthalenyl]urea::1-[5-tert-butyl-2-(4-methylphenyl)pyrazol-3-yl]-3-[4-(2-morpholin-4-ylethoxy)naphthalen-1-yl]urea::1-[5-tert-butyl-2-(p-tolyl)pyrazol-3-yl]-3-[4-(2-morpholinoethoxy)-1-naphthyl]urea::3-[2-(4-methylphenyl)-5-tert-butyl-pyrazol-3-yl]-1-[4-(2-morpholin-4-ylethoxy)naphthalen-1-yl]urea::3-[3-tert-butyl-1-(4-methylphenyl)-1H-pyrazol-5-yl]-1-{4-[2-(morpholin-4-yl)ethoxy]naphthalen-1-yl}urea::BIRB 796::BIRB-796::BIRB-796, 3::CHEMBL103667::Doramapimod::US8933228, BIRB 796::US9187470, 43 (BIRB-796)::US9242960, BIRB 796::US9260410, BIRB796::cid_156422::diaryl urea compound 10",
    "bindingdb_link": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=13533",
    "het_id_pdb": "B96",
    "pubchem_cid": 156422,
    "pubchem_sid": 46513934,
    "chembl": "CHEMBL103667",
    "drugbank": "DB03044",
    "iuphar_grac_id": "5668",
    "zinc": "ZINC24044436"
  },
  "relation": [
    {
      "bindingdb_set_id": 65378,
      "kd_nm": " 7000",
      "curation_datasource": "PubChem",
      "pubchem_aid": "aid1433",
      "authors": [
        "PubChem, PC"
      ],
      "institution": "Ambit Biosciences",
      "bindingdb_link": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=13533&enzyme=Epidermal+growth+factor+receptor&column=ki&startPg=0&Increment=50&submit=Search",
      "pdb": [
        "4JVG",
        "1KV2",
        "6GTT",
        "5N66",
        "4TWN",
        "3NPC",
        "3FZS"
      ],
      "num_protein_chains": 1
    },
    {
      "bindingdb_set_id": 65395,
      "kd_nm": " 9100",
      "curation_datasource": "PubChem",
      "pubchem_aid": "aid1433",
      "authors": [
        "PubChem, PC"
      ],
      "institution": "Ambit Biosciences",
      "bindingdb_link": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=13533&enzyme=Epidermal+growth+factor+receptor&column=ki&startPg=0&Increment=50&submit=Search",
      "pdb": [
        "4JVG",
        "1KV2",
        "6GTT",
        "5N66",
        "4TWN",
        "3NPC",
        "3FZS"
      ],
      "num_protein_chains": 1
    },
    ...
    {
      "bindingdb_set_id": 50208407,
      "ic50_nm": ">20000",
      "curation_datasource": "ChEMBL",
      "article_doi": "10.1021/jm020057r",
      "pmid": "12086485",
      "authors": [
        "Regan, J",
        "Breitfelder, S",
        "Cirillo, P",
        "Gilmore, T",
        "Graham, AG",
        "Hickey, E",
        "Klaus, B",
        "Madwed, J",
        "Moriak, M",
        "Moss, N",
        "Pargellis, C",
        "Pav, S",
        "Proto, A",
        "Swinamer, A",
        "Tong, L",
        "Torcellini, C"
      ],
      "institution": "Boehringer Ingelheim Pharmaceuticals Inc",
      "bindingdb_link": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=13533&enzyme=Epidermal+growth+factor+receptor&column=ki&startPg=0&Increment=50&submit=Search",
      "pdb": [
        "4JVG",
        "1KV2",
        "6GTT",
        "5N66",
        "4TWN",
        "3NPC",
        "3FZS"
      ],
      "num_protein_chains": 1
    }
  ],
  "_id": "13533-P00533",
  "predicate": "physically interacts with"
}

andrewsu · 2022-07-15T21:00:43Z

Looks great, nice work @rjawesome! I think this one is also ready to pass off to @erikyao for API creation...

erikyao · 2022-07-28T00:05:01Z

API published to https://biothings.ncats.io/bindingdb

erikyao · 2022-07-28T00:09:17Z

Hi @rjawesome, I forked your repo to https://github.com/biothings/BindingDB and made some changes. The significant change is to move the content of your mappings.json into parser.py. The reason is that a data plugin is dynamically imported by importlib internally so the relative path "./mappings.json" no longer works.

colleenXu · 2022-07-28T05:50:11Z

Noting that the next step is writing a SmartAPI yaml for this API. An intern can try to do this, or I'll stick it in my to-do list...

rjawesome · 2022-07-28T16:33:14Z

@colleenXu I could work on that

rjawesome · 2022-07-29T20:11:09Z

@colleenXu do you know how I can determine which fields should be put in the bte-response-mapping (or should I put all of them)? List of fields

colleenXu · 2022-07-30T04:54:34Z

Just in case, I'll talk about the main fields first:

it looks like two-ish operations (depending on how many operations are needed to cover the "object" ligand id-prefixes): Gene -> SmallMolecule and SmallMolecule -> Gene
For Gene (aka the "subject" part of the API), the field to grab IDs from is probably subject.uniprot.id. The subject ID-prefix can then be looked up by searching for "uniprot" in the biolink-model (it's UniProtKB, yes it's spelled exactly that way. also it's fine that the id-prefix isn't under Gene. There's Gene/Protein conflation - aka they're interchangeable - and so we generally write the operations for Genes)
For SmallMolecule (aka the "object" part of the API), you'll need to know some stuff:
- are there multiple ID-prefixes for the object in 1 record/document? It looks like there probably is...that's tricky because we don't really want redundant querying (aka retrieve the same document in two different sub-queries using different id-namespaces)
- so...we want a set of id-prefixes that covers as much of the API as we can, without a lot of overlap between them (aka not many records/documents have all of those prefixes). Ideally, there's 1 main id-prefix that's in most of the records....and we can just use that. If there's a bunch, then we'll want to write operations for each (so there'll be more than 2 operations for the API)
  - I check this by doing queries like this to check how many records have a particular field, then comparing it to the number of total records in the API which I can see on the API's main page: https://pending.biothings.io/bindingdb/query?q=_exists_:object.pubchem_cid
- If a bunch of id-prefixes are equally good (in most of the records), use the id-prefixes list order for SmallMolecule (in biolink-model).
- I suspect that you will use PUBCHEM.COMPOUND (which is PUBCHEM CID), and the field object.pubchem_cid. Other "good" IDs (aka I know Translator/BTE works well with them) are CHEBI and CHEMBL.COMPOUND (looks like that's what the CHEMBL IDs in the api are).

colleenXu · 2022-07-30T05:11:47Z

For the "other" fields in the response-mapping / retrieved in the fields section of the parameters, I suggest:

from subject:
- subject.name: this looks like it may be different from what's retrieved during BTE's ID-resolution, since it's assigned by the curator/datasource. This looks useful
- subject.organism: need that info
from relation:
- relation.curation_datasource: this is an example of the "source" I put in response-mapping (not on the top-level next to predicates aka the infores thing). It's useful
- relation.pmid (and if some records don't have pmid and instead have these other fields, include them: relation.article_doi, relation.patent_number). Outside resources are super useful.
- relation.bindingdb_link: Again, outside links are super useful

Some other relation fields are kinda interesting, but seem like "a lot of clutter" / hard to interpret / can go to the bindingdb link to learn more. so I think we can add a comment about them but otherwise not include them for now...: relation.ki_nm, relation.ic50_nm, relation.kd_nm, relation.ec50_nm, relation.kon, relation.koff, relation.ph, relation.temp_c, relation.num_protein_chains

rjawesome · 2022-07-30T05:45:47Z

While it does have multiple idenitifiers, I think InChi/InChiKey is the most common identifier in the documents (a lot of them only have InChi/InChiKey), so I will probably just have an operation for that?

colleenXu · 2022-08-01T18:42:10Z

Huh....I count more documents with pubchem.cid than inchikey.

pubchem_cid: 1413051
inchikey: 1394153

so if we had to pick 1 id-prefix for the chemical stuff, I'd like pick pubchem_cid (aka PUBCHEM.COMPOUND).

[EDIT: hmmm so ~1.8% of records or so would be not retrieved if we only used pubchem_cid....maybe that's fine?]

Of the records that don't have pubchem_cid (25858):

nearly all (25856) have SMILES but that's not a valid ID-namespace...
many have inchikey (23227). so including operations for inchikey should cover a lot of the records that don't have pubchem_cid...
only 2631 records lack both an pubchem_cid and an inchikey. These records don't seem to have chebi or chembl IDs either....maybe just SMILES and bindingdb-monomer-id?

(also noting the inverse: of the 44756 records that don't have inchikey, almost all (42125) have a pubchem_cid.)

rjawesome · 2022-08-01T19:00:04Z

Oh, I missed pubchem_cid. I can change it to that or add a separate operation with the cid. Also, I made a PR with the YAML -- right now it is using InChiKey.

colleenXu · 2022-08-01T19:01:07Z

I saw that. I suggest switching to using pubchem_cid (PUBCHEM.COMPOUND)

colleenXu · 2022-08-17T02:49:35Z

@rjawesome I edited the SmartAPI yaml (commit NCATS-Tangerine/translator-api-registry@f3b4ca2) before registering it and hooking it up to BTE.

Notes:

check and edit the info.version, info.x-translator.biolink-version (this is whatever version BTE is using, the easiest way to check is to look at recent commit history for the module https://github.com/biothings/biolink-model.js/commits/main), examples for endpoints, and predicates in the x-bte operations.
we use the key "pubmed" for fields with PMIDs. This is because BTE has special handling for PMIDs.

colleenXu · 2022-08-17T02:53:00Z

We still have to deploy changes to our instances to use this API in our team/general endpoints. We can query this API through the SmartAPI-specific endpoint using its registration ID 38e9e5169a72aee3659c9ddba956790d

Once it is deployed, Example Query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": ["biolink:SmallMolecule"],
                    "ids": ["PUBCHEM.COMPOUND:134553288"]
                },
                "n1": {
                    "categories": ["biolink:Gene"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:physically_interacts_with"]
                }
            }
        }
    }
}

And there'd be an edge like this in the response:

                "b82527794b6190f21cfe9da2d11fff93": {
                    "predicate": "biolink:physically_interacts_with",
                    "subject": "PUBCHEM.COMPOUND:134553288",
                    "object": "NCBIGene:187",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:aggregator_knowledge_source",
                            "value": [
                                "infores:biothings-explorer"
                            ],
                            "value_type_id": "biolink:InformationResource"
                        },
                        {
                            "attribute_type_id": "biolink:primary_knowledge_source",
                            "value": [
                                "infores:bindingdb"
                            ],
                            "value_type_id": "biolink:InformationResource"
                        },
                        {
                            "attribute_type_id": "biolink:aggregator_knowledge_source",
                            "value": [
                                "infores:biothings-bindingdb"
                            ],
                            "value_type_id": "biolink:InformationResource"
                        },
                        {
                            "attribute_type_id": "biolink:original_subject",
                            "value": "Apelin receptor"
                        },
                        {
                            "attribute_type_id": "in_taxon",
                            "value": "Homo sapiens"
                        },
                        {
                            "attribute_type_id": "bindingdb_curation_datasource",
                            "value": "US Patent"
                        },
                        {
                            "attribute_type_id": "bindingdb_url",
                            "value": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=456871&enzyme=Apelin+receptor&column=ki&startPg=0&Increment=50&submit=Search"
                        },
                        {
                            "attribute_type_id": "patent_number",
                            "value": "US10736883"
                        }
                    ]
                }
            }

newgene added the data source Data source pending to create a new API label Jun 1, 2022

andrewsu assigned erikyao Jul 15, 2022

colleenXu closed this as completed Aug 17, 2022

andrewsu mentioned this issue Oct 5, 2022

Data source: FooDB #65

Closed

This was referenced Dec 20, 2022

Create creative mode templates for "What chemicals [qualified] affect (increases/decreases) a given protein/gene?" biothings/biothings_explorer#532

Closed

Discuss BindingDB issues: related to parser / API? #99

Closed

colleenXu mentioned this issue Jun 30, 2023

API Therapeutic Target Database (TTD) Deployment #123

Closed

This was referenced Sep 6, 2023

Issues with BioThings BindingDB object fields biothings/biothings_explorer#717

Open

BioThings BindingDB: can the relationship be more specific? biothings/biothings_explorer#718

Open

colleenXu mentioned this issue May 22, 2024

Problems with BioThings BindingDB #201

Open

Data source: BindingDB #70

Data source: BindingDB #70

Comments

newgene commented Jun 1, 2022

andrewsu commented Jun 24, 2022 • edited Loading

rjawesome commented Jun 30, 2022

rjawesome commented Jun 30, 2022

rjawesome commented Jun 30, 2022

andrewsu commented Jun 30, 2022

rjawesome commented Jun 30, 2022 • edited Loading

andrewsu commented Jul 1, 2022

rjawesome commented Jul 1, 2022 • edited Loading

rjawesome commented Jul 1, 2022

colleenXu commented Jul 2, 2022

rjawesome commented Jul 2, 2022 • edited Loading

andrewsu commented Jul 5, 2022

rjawesome commented Jul 5, 2022 • edited Loading

andrewsu commented Jul 6, 2022

rjawesome commented Jul 6, 2022 • edited Loading

andrewsu commented Jul 6, 2022

rjawesome commented Jul 6, 2022 • edited Loading

rjawesome commented Jul 6, 2022

andrewsu commented Jul 7, 2022

rjawesome commented Jul 7, 2022 • edited Loading

andrewsu commented Jul 7, 2022

rjawesome commented Jul 8, 2022 • edited Loading

andrewsu commented Jul 12, 2022

rjawesome commented Jul 12, 2022

rjawesome commented Jul 15, 2022

andrewsu commented Jul 15, 2022

erikyao commented Jul 28, 2022

erikyao commented Jul 28, 2022

colleenXu commented Jul 28, 2022

rjawesome commented Jul 28, 2022 • edited Loading

rjawesome commented Jul 29, 2022

colleenXu commented Jul 30, 2022 • edited Loading

colleenXu commented Jul 30, 2022

rjawesome commented Jul 30, 2022 • edited Loading

colleenXu commented Aug 1, 2022 • edited Loading

rjawesome commented Aug 1, 2022

colleenXu commented Aug 1, 2022

colleenXu commented Aug 17, 2022

colleenXu commented Aug 17, 2022 • edited Loading

andrewsu commented Jun 24, 2022 •

edited

Loading

rjawesome commented Jun 30, 2022 •

edited

Loading

rjawesome commented Jul 1, 2022 •

edited

Loading

rjawesome commented Jul 2, 2022 •

edited

Loading

rjawesome commented Jul 5, 2022 •

edited

Loading

rjawesome commented Jul 6, 2022 •

edited

Loading

rjawesome commented Jul 6, 2022 •

edited

Loading

rjawesome commented Jul 7, 2022 •

edited

Loading

rjawesome commented Jul 8, 2022 •

edited

Loading

rjawesome commented Jul 28, 2022 •

edited

Loading

colleenXu commented Jul 30, 2022 •

edited

Loading

rjawesome commented Jul 30, 2022 •

edited

Loading

colleenXu commented Aug 1, 2022 •

edited

Loading

colleenXu commented Aug 17, 2022 •

edited

Loading