-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data source: BindingDB #70
Comments
Conclusion: To start, let's go with the 1243406 records according to the minimal filtering described above. The goal is to create a JSON file with one record per row (compound - target association). The JSON structure should roughly follow the pattern described in #55 Code snippetsNOTE:alias gawkt='awk -F"\t" -v OFS="\t"'
[1]: |
I am currently working on this issue |
I made a basic parser for this: https://github.com/rjawesome/BindingDB_parser I realized that some rows have more than one chemical -> protein relationship (indicated by "Number of Protein Chains in Target (>1 implies a multichain complex)") so I have split those into separate documents At the moment I am currently filtering by checking if the primary uniprot name ends with _HUMAN. I checked one of the records with Homo Sapiens as the species that listed CGH2_SHV21 (ID Q01043) as the protein, from the UniProt website it seemed not to be a human protein but I could be mistaken? Also, the parser takes around 2 minutes to run so I am not sure if I need to optimize my code or if this is just a really big data file. Sample Record: {
"object": {
"BindingDB Reactant_set_id": "143",
"Ligand SMILES": "Cc1nc(CN2CCN(CC2)c2c(Cl)cnc3[nH]c(nc23)-c2cn(C)nc2C)no1",
"Ligand InChI": "InChI=1S/C19H22ClN9O/c1-11-13(9-27(3)25-11)18-23-16-17(14(20)8-21-19(16)24-18)29-6-4-28(5-7-29)10-15-22-12(2)30-26-15/h8-9H,4-7,10H2,1-3H3,(H,21,23,24)",
"Ligand InChI Key": "ZYQKMYRXVHUATB-UHFFFAOYSA-N",
"BindingDB MonomerID": "247370",
"BindingDB Ligand Name": "US9447092, 3",
"Target Name Assigned by Curator or DataSource": "Cytochrome P450 3A4",
"Target Source Organism According to Curator or DataSource": "Homo sapiens",
"Ki (nM)": "",
"IC50 (nM)": ">50000",
"Kd (nM)": "",
"EC50 (nM)": "",
"kon (M-1-s-1)": "",
"koff (s-1)": "",
"pH": "",
"Temp (C)": "",
"Curation/DataSource": "US Patent",
"Article DOI": "",
"PMID": "",
"PubChem AID": "",
"Patent Number": "US9447092",
"Authors": "Blagg, J; Bavetsias, V; Moore, AS; Linardopoulos, S",
"Institution": "Cancer Research Technology Limited",
"Link to Ligand in BindingDB": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=247370",
"Link to Target in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=pol&polymerid=2127&target=Cytochrome+P450+3A4&column=ki&startPg=0&Increment=50&submit=Search",
"Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=247370&enzyme=Cytochrome+P450+3A4&column=ki&startPg=0&Increment=50&submit=Search",
"Ligand HET ID in PDB": "",
"PDB ID(s) for Ligand-Target Complex": "",
"PubChem CID": "71463198",
"PubChem SID": "346541913",
"ChEBI ID of Ligand": "",
"ChEMBL ID of Ligand": "",
"DrugBank ID of Ligand": "",
"IUPHAR_GRAC ID of Ligand": "",
"KEGG ID of Ligand": "",
"ZINC ID of Ligand": "",
"Number of Protein Chains in Target (>1 implies a multichain complex)": "1"
},
"subject": {
"BindingDB Target Chain Sequence": "MALIPDLAMETWLLLAVSLVLLYLYGTHSHGLFKKLGIPGPTPLPFLGNILSYHKGFCMFDMECHKKYGKVWGFYDGQQPVLAITDPDMIKTVLVKECYSVFTNRRPFGPVGFMKSAISIAEDEEWKRLRSLLSPTFTSGKLKEMVPIIAQYGDVLVRNLRREAETGKPVTLKDVFGAYSMDVITSTSFGVNIDSLNNPQDPFVENTKKLLRFDFLDPFFLSITVFPFLIPILEVLNICVFPREVTNFLRKSVKRMKESRLEDTQKHRVDFLQLMIDSQNSKETESHKALSDLELVAQSIIFIFAGYETTSSVLSFIMYELATHPDVQQKLQEEIDAVLPNKAPPTYDTVLQMEYLDMVVNETLRLFPIAMRLERVCKKDVEINGMFIPKGVVVMIPSYALHRDPKYWTEPEKFLPERFSKKNKDNIDPYIYTPFGSGPRNCIGMRFALMNMKLALIRVLQNFSFKPCKETQIPLKLSLGGLLQPEKPVVLKVESRDGTVSGA",
"PDB ID(s) of Target Chain": "1W0E,1W0F,1W0G,2J0D,2V0M,3NXU,4NY4,7LXL",
"UniProt (SwissProt) Recommended Name of Target Chain": "Cytochrome P450 3A4",
"UniProt (SwissProt) Entry Name of Target Chain": "CP3A4_HUMAN",
"UniProt (SwissProt) Primary ID of Target Chain": "P08684",
"UniProt (SwissProt) Secondary ID(s) of Target Chain": "P05184,Q16757,Q9UK50",
"UniProt (SwissProt) Alternative ID(s) of Target Chain": "",
"UniProt (TrEMBL) Submitted Name of Target Chain": "",
"UniProt (TrEMBL) Entry Name of Target Chain": "",
"UniProt (TrEMBL) Primary ID of Target Chain": "Q6GRK0",
"UniProt (TrEMBL) Secondary ID(s) of Target Chain": "",
"UniProt (TrEMBL) Alternative ID(s) of Target Chain": ""
}
} |
Updated Sample Document {
"object": {
"BindingDB Reactant_set_id": 199,
"Ligand SMILES": "CN(Cc1ccc(s1)C(=O)N[C@@H](CC(O)=O)C(=O)CSCc1ccccc1Cl)Cc1ccc(O)c(c1)C(O)=O",
"Ligand InChI": "InChI=1S/C27H27ClN2O7S2/c1-30(12-16-6-8-22(31)19(10-16)27(36)37)13-18-7-9-24(39-18)26(35)29-21(11-25(33)34)23(32)15-38-14-17-4-2-3-5-20(17)28/h2-10,21,31H,11-15H2,1H3,(H,29,35)(H,33,34)(H,36,37)/t21-/m0/s1",
"Ligand InChI Key": "FIEQQFOHZKVJLV-NRFANRHFSA-N",
"BindingDB MonomerID": 219,
"BindingDB Ligand Name": "5-({[(5-{[(2S)-1-carboxy-4-{[(2-chlorophenyl)methyl]sulfanyl}-3-oxobutan-2-yl]carbamoyl}thiophen-2-yl)methyl](methyl)amino}methyl)-2-hydroxybenzoic acid::Inhibitor 47c::Thiophene Scaffold 47c",
"PubChem CID": 5327301,
"PubChem SID": 8030144,
"ZINC ID of Ligand": "ZINC14942804",
"Number of Protein Chains in Target (>1 implies a multichain complex)": 1
},
"subject": {
"Target Name Assigned by Curator or DataSource": "Caspase-3",
"Target Source Organism According to Curator or DataSource": "Homo sapiens",
"Link to Target in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=pol&polymerid=1072&target=Caspase-3&column=ki&startPg=0&Increment=50&submit=Search",
"BindingDB Target Chain Sequence": "MENTENSVDSKSIKNLEPKIIHGSESMDSGISLDNSYKMDYPEMGLCIIINNKNFHKSTGMTSRSGTDVDAANLRETFRNLKYEVRNKNDLTREEIVELMRDVSKEDHSKRSSFVCVLLSHGEEGIIFGTNGPVDLKKITNFFRGDRCRSLTGKPKLFIIQACRGTELDCGIETDSGVDDDMACHKIPVEADFLYAYSTAPGYYSWRNSKDGSWFIQSLCAMLKQYADKLEFMHILTRVNRKVATEFESFSFDATFHAKKQIPCIVSMLTKELYFYH",
"PDB ID(s) of Target Chain": [
"1GFW",
"1I3O",
"1NME",
"1PAU",
"1RE1",
"1RHJ",
"1RHK",
"1RHM",
"1RHQ",
"1RHR",
"1RHU",
"2C1E",
"2C2K",
"2C2M",
"2C2O",
"2CDR",
"2CJX",
"2CJY",
"2CNK",
"2CNL",
"2CNN",
"2CNO",
"2DKO",
"2H5I",
"2H5J",
"2H65",
"2J30",
"2XYG",
"2XYH",
"2XYP",
"2XZD",
"2XZT",
"2Y0B",
"3EDQ",
"3GJQ",
"3GJR",
"3GJS",
"3GJT",
"3H0E",
"3KJF",
"4DCJ",
"4DCO",
"4DCP",
"4JJE",
"4PRY",
"4PS0",
"5IC4",
"6BDV",
"6BFJ",
"6BFK",
"6BFL",
"6BFO",
"6BG0",
"6BG1",
"6BG4",
"6BGK",
"6BGQ",
"6BGR",
"6BGS",
"6BH9",
"6BHA",
"6CKZ",
"6CL0",
"6X8I",
"6X8K",
"7RN7",
"7RN8",
"7RN9",
"7RNB",
"7RND",
"7RNE",
"7RNF",
"7SEO"
],
"UniProt (SwissProt) Recommended Name of Target Chain": "Caspase-3",
"UniProt (SwissProt) Entry Name of Target Chain": "CASP3_HUMAN",
"UniProt (SwissProt) Primary ID of Target Chain": "P42574",
"UniProt (SwissProt) Secondary ID(s) of Target Chain": [
"A8K5M2",
"D3DP53",
"Q96AN1",
"Q96KP2"
]
},
"relation": {
"Ki (nM)": " 90",
"pH": "7.4",
"Temp (C)": "25.00 C",
"Curation/DataSource": "Curated from the literature by BindingDB",
"Article DOI": "10.1021/jm020230j",
"PMID": "12408711",
"Authors": "Choong, IC; Lew, W; Lee, D; Pham, P; Burdett, MT; Lam, JW; Wiesmann, C; Luong, TN; Fahr, B; DeLano, WL; McDowell, RS; Allen, DA; Erlanson, DA; Gordon, EM; O'Brien, T",
"Institution": "Sunesis Pharmaceuticals",
"Link to Ligand in BindingDB": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=219",
"Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=219&enzyme=Caspase-3&column=ki&startPg=0&Increment=50&submit=Search"
}
} |
Just a note/caveat. I've pasted a (transposed) snippet of the data file which shows four records that are exactly the same except for the Ki. In this case, best to collapse these four input records into a single output record, where the Ki is an array. We should be careful to identify any other columns that need similar treatment. (Possibly to help that effort, we should add an |
Ok, I will check for that. Just curious, what software are you using to view the file? It seems like it would be tricky to view the file given its large size. (Just off the first few, I found documents with the same ID according to your construction had different IC50 (nM) and/or Author/Institution/PMID/etc.) In this case would we turn the Author/Institution/etc. into an array or would we separate those entries? ... in the case we separate those entities, we couldn't use your idea for the _id key |
That's a screenshot in excel. I typically use command-line tools (like awk) to extract a very small subset of the file before trying to load it... Ok, I will check for that. Just curious, what software are you using to view the file? It seems like it would be tricky to view the file given its large size.
I think Ki, IC50, Kd, EC50, kon, and koff can all be treated the same -- put them in an array. I think that'd be fine for Author and Institution too, but can you post a couple examples, and if possible, a count of the number of times this occurs? Depending on those answers, we might add one level of grouping (under something like |
Here are the # of entries with the same ID but differing values for each key (this was for the first 300k entries on the table) {
"BindingDB Reactant_set_id":37936,
"IC50 (nM)":19365,
"PMID":7147,
"pH":6401,
"Authors":9407,
"Article DOI":6694,
"Temp (C)":6480,
"Institution":6862,
"Ki (nM)":9881,
"Target Name Assigned by Curator or DataSource":725,
"Link to Ligand-Target Pair in BindingDB":725,
"Link to Target in BindingDB":726,
"Kd (nM)":1138,
"Curation/DataSource":1753,
"Patent Number":4203,"EC50 (nM)":7173,
"Number of Protein Chains in Target (>1 implies a multichain complex)":433,
"UniProt (TrEMBL) Primary ID of Target Chain":298,
"PubChem AID":11479,
"Target Source Organism According to Curator or DataSource":106,
"PDB ID(s) for Ligand-Target Complex":29,
"kon (M-1-s-1)":68,
"koff (s-1)":48
} Here are some example documents where the ID is the same but the authors are different {
"subject": {
"Target Name Assigned by Curator or DataSource": "Protein kinase C alpha type",
"Target Source Organism According to Curator or DataSource": "Homo sapiens",
"Link to Target in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=pol&polymerid=599&target=Protein+kinase+C+alpha+type&column=ki&startPg=0&Increment=50&submit=Search",
"BindingDB Target Chain Sequence": "MADVFPGNDSTASQDVANRFARKGALRQKNVHEVKDHKFIARFFKQPTFCSHCTDFIWGFGKQGFQCQVCCFVVHKRCHEFVTFSCPGADKGPDTDDPRSKHKFKIHTYGSPTFCDHCGSLLYGLIHQGMKCDTCDMNVHKQCVINVPSLCGMDHTEKRGRIYLKAEVADEKLHVTVRDAKNLIPMDPNGLSDPYVKLKLIPDPKNESKQKTKTIRSTLNPQWNESFTFKLKPSDKDRRLSVEIWDWDRTTRNDFMGSLSFGVSELMKMPASGWYKLLNQEEGEYYNVPIPEGDEEGNMELRQKFEKAKLGPAGNKVISPSEDRKQPSNNLDRVKLTDFNFLMVLGKGSFGKVMLADRKGTEELYAIKILKKDVVIQDDDVECTMVEKRVLALLDKPPFLTQLHSCFQTVDRLYFVMEYVNGGDLMYHIQQVGKFKEPQAVFYAAEISIGLFFLHKRGIIYRDLKLDNVMLDSEGHIKIADFGMCKEHMMDGVTTRTFCGTPDYIAPEIIAYQPYGKSVDWWAYGVLLYEMLAGQPPFDGEDEDELFQSIMEHNVSYPKSLSKEAVSVCKGLMTKHPAKRLGCGPEGERDVREHAFFRRIDWEKLENREIQPPFKPKVCGKGAENFDKFFTRGQPVLTPPDQLVIANIDQSDFEGFSYVNPQFVHPILQSAV",
"PDB ID(s) of Target Chain": "4DNL,4RA4",
"UniProt (SwissProt) Recommended Name of Target Chain": "Protein kinase C alpha type",
"UniProt (SwissProt) Entry Name of Target Chain": "KPCA_HUMAN",
"UniProt (SwissProt) Primary ID of Target Chain": "P17252",
"UniProt (SwissProt) Secondary ID(s) of Target Chain": "B5BU22,Q15137,Q32M72,Q96RE4",
"UniProt (TrEMBL) Primary ID of Target Chain": "L7RSM7"
},
"object": {
"BindingDB Reactant_set_id": "4491",
"Ligand SMILES": "OC(=O)c1cccc(O)c1C(=O)c1c(O)cc(cc1O)C(=O)O[C@@H]1CCCNC[C@H]1NC(=O)c1ccc(O)cc1",
"Ligand InChI": "InChI=1S/C28H26N2O10/c31-16-8-6-14(7-9-16)26(36)30-18-13-29-10-2-5-22(18)40-28(39)15-11-20(33)24(21(34)12-15)25(35)23-17(27(37)38)3-1-4-19(23)32/h1,3-4,6-9,11-12,18,22,29,31-34H,2,5,10,13H2,(H,30,36)(H,37,38)/t18-,22-/m1/s1",
"Ligand InChI Key": "XYUFCXJZFZPEJD-XMSQKQJNSA-N",
"BindingDB MonomerID": "3149",
"BindingDB Ligand Name": "2-{[2,6-dihydroxy-4-({[(3R,4R)-3-[(4-hydroxybenzene)amido]azepan-4-yl]oxy}carbonyl)phenyl]carbonyl}-3-hydroxybenzoic acid::Acyclic Balanol Analog (-)-1::Balanol analog 1::Balanol, 1::CHEMBL60254",
"PubChem CID": "5287736",
"PubChem SID": "8032894",
"ChEMBL ID of Ligand": "CHEMBL60254",
"ZINC ID of Ligand": "ZINC03871640",
"Number of Protein Chains in Target (>1 implies a multichain complex)": "1"
},
"relation": {
"IC50 (nM)": " 30",
"Curation/DataSource": "Curated from the literature by BindingDB",
"Article DOI": "10.1021/jm960581w",
"PMID": "8978850",
"Authors": "Defauw, JM; Murphy, MM; Jagdmann, GE; Hu, H; Lampe, JW; Hollinshead, SP; Mitchell, TJ; Crane, HM; Heerding, JM; Mendoza, JS; Davis, JE; Darges, JW; Hubbard, FR; Hall, SE",
"Institution": "Sphinx Laboratories",
"Link to Ligand in BindingDB": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=3149", "Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=3149&enzyme=Protein+kinase+C+alpha+type&column=ki&startPg=0&Increment=50&submit=Search",
"Ligand HET ID in PDB": "BA1",
"PDB ID(s) for Ligand-Target Complex": "1BX6,3KRX,3KRW"
},
"_id": "3149-P17252"
} {
"subject": {
"Target Name Assigned by Curator or DataSource": "Protein kinase C alpha type",
"Target Source Organism According to Curator or DataSource": "Homo sapiens",
"Link to Target in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=pol&polymerid=599&target=Protein+kinase+C+alpha+type&column=ki&startPg=0&Increment=50&submit=Search",
"BindingDB Target Chain Sequence": "MADVFPGNDSTASQDVANRFARKGALRQKNVHEVKDHKFIARFFKQPTFCSHCTDFIWGFGKQGFQCQVCCFVVHKRCHEFVTFSCPGADKGPDTDDPRSKHKFKIHTYGSPTFCDHCGSLLYGLIHQGMKCDTCDMNVHKQCVINVPSLCGMDHTEKRGRIYLKAEVADEKLHVTVRDAKNLIPMDPNGLSDPYVKLKLIPDPKNESKQKTKTIRSTLNPQWNESFTFKLKPSDKDRRLSVEIWDWDRTTRNDFMGSLSFGVSELMKMPASGWYKLLNQEEGEYYNVPIPEGDEEGNMELRQKFEKAKLGPAGNKVISPSEDRKQPSNNLDRVKLTDFNFLMVLGKGSFGKVMLADRKGTEELYAIKILKKDVVIQDDDVECTMVEKRVLALLDKPPFLTQLHSCFQTVDRLYFVMEYVNGGDLMYHIQQVGKFKEPQAVFYAAEISIGLFFLHKRGIIYRDLKLDNVMLDSEGHIKIADFGMCKEHMMDGVTTRTFCGTPDYIAPEIIAYQPYGKSVDWWAYGVLLYEMLAGQPPFDGEDEDELFQSIMEHNVSYPKSLSKEAVSVCKGLMTKHPAKRLGCGPEGERDVREHAFFRRIDWEKLENREIQPPFKPKVCGKGAENFDKFFTRGQPVLTPPDQLVIANIDQSDFEGFSYVNPQFVHPILQSAV",
"PDB ID(s) of Target Chain": "4DNL,4RA4",
"UniProt (SwissProt) Recommended Name of Target Chain": "Protein kinase C alpha type",
"UniProt (SwissProt) Entry Name of Target Chain": "KPCA_HUMAN",
"UniProt (SwissProt) Primary ID of Target Chain": "P17252",
"UniProt (SwissProt) Secondary ID(s) of Target Chain": "B5BU22,Q15137,Q32M72,Q96RE4",
"UniProt (TrEMBL) Primary ID of Target Chain": "L7RSM7"
},
"object": {
"BindingDB Reactant_set_id": "4239",
"Ligand SMILES": "OC(=O)c1cccc(O)c1C(=O)c1c(O)cc(cc1O)C(=O)O[C@@H]1CCCNC[C@H]1NC(=O)c1ccc(O)cc1",
"Ligand InChI": "InChI=1S/C28H26N2O10/c31-16-8-6-14(7-9-16)26(36)30-18-13-29-10-2-5-22(18)40-28(39)15-11-20(33)24(21(34)12-15)25(35)23-17(27(37)38)3-1-4-19(23)32/h1,3-4,6-9,11-12,18,22,29,31-34H,2,5,10,13H2,(H,30,36)(H,37,38)/t18-,22-/m1/s1",
"Ligand InChI Key": "XYUFCXJZFZPEJD-XMSQKQJNSA-N",
"BindingDB MonomerID": "3149",
"BindingDB Ligand Name": "2-{[2,6-dihydroxy-4-({[(3R,4R)-3-[(4-hydroxybenzene)amido]azepan-4-yl]oxy}carbonyl)phenyl]carbonyl}-3-hydroxybenzoic acid::Acyclic Balanol Analog (-)-1::Balanol analog 1::Balanol, 1::CHEMBL60254",
"PubChem CID": "5287736",
"PubChem SID": "8032894",
"ChEMBL ID of Ligand": "CHEMBL60254",
"ZINC ID of Ligand": "ZINC03871640",
"Number of Protein Chains in Target (>1 implies a multichain complex)": "1"
},
"relation": {
"IC50 (nM)": " 74",
"pH": "7.5",
"Temp (C)": "30.00 C",
"Curation/DataSource": "Curated from the literature by BindingDB",
"Article DOI": "10.1016/0960-894X(95)00365-Z",
"Authors": "Lai, YS; Menaldino, DS; Nichols, JB; Jagdmann , GE; Mylott, F; Gillespie, J; Hall, SE",
"Institution": "Sphinx Laboratories",
"Link to Ligand in BindingDB": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=3149", "Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=3149&enzyme=Protein+kinase+C+alpha+type&column=ki&startPg=0&Increment=50&submit=Search",
"Ligand HET ID in PDB": "BA1",
"PDB ID(s) for Ligand-Target Complex": "1BX6,3KRX,3KRW"
},
"_id": "3149-P17252"
} For these two documents, the different fields were 'Authors', 'PMID', 'Temp (C)', 'BindingDB Reactant_set_id', 'pH', 'IC50 (nM)', 'Article DOI' Also, just a note if we are combining documents together, I believe that would force the code to store all previous documents which could cause high ram usage (possibly around the size of the tsv itself) |
Also, this doesn't just apply to fields in the relation, fields in the subject as well can be different even with the same ID. Here is an example where the fields 'Link to Target in BindingDB', 'Link to Ligand-Target Pair in BindingDB', 'IC50 (nM)', 'Target Name Assigned by Curator or DataSource' are different. {
"subject": {
"Target Name Assigned by Curator or DataSource": "CDK2/CycE",
"Target Source Organism According to Curator or DataSource": "Homo sapiens",
"Link to Target in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=com&complexid=81&target=CDK2%2FCycE&column=ki&startPg=0&Increment=50&submit=Search",
"BindingDB Target Chain Sequence": "MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNHPNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHSHRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEILLGCKYYSTAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSFPKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAALAHPFFQDVTKPVPHLRL",
"PDB ID(s) of Target Chain": "1CKP,1DI8,1DM2,1E1V,1E1X,1E9H,1F5Q,1FIN,1FQ1,1FVT,1FVV,1G5S,1GIH,1GY3,1GZ8,1H00,1H01,1H07,1H08,1H0V,1H0W,1H1P,1H1Q,1H1R,1H1S,1H24,1H25,1H26,1H27,1H28,1HCK,1HCL,1JST,1JSU,1JSV,1JVP,1KE5,1KE6,1KE7,1KE8,1KE9,1OGU,1OI9,1OIQ,1OIR,1OIU,1OIY,1OKV,1OKW,1OL1,1OL2,1P2A,1P5E,1PF8,1PKD,1PW2,1PXI,1PXJ,1PXK,1PXL,1PXM,1PXN,1PXO,1PXP,1PYE,1QMZ,1R78,1URC,1URW,1V1K,1VYW,1VYZ,1W0X,1W8C,1W98,1WCC,1Y8Y,1Y91,1YKR,2A0C,2A4L,2B52,2B53,2B54,2B55,2BHE,2BHH,2BKZ,2BPM,2BTR,2BTS,2C4G,2C5N,2C5O,2C5V,2C5X,2C5Y,2C68,2C69,2C6I,2C6K,2C6L,2C6M,2C6O,2C6T,2CCH,2CCI,2CJM,2CLX,2DUV,2EXM,2FVD,2G9X,2I40,2J9M,2JGZ,2R3F,2R3G,2R3H,2R3I,2R3J,2R3K,2R3L,2R3M,2R3N,2R3O,2R3P,2R3Q,2R3R,2R64,2UUE,2UZB,2UZD,2UZE,2UZL,2UZN,2UZO,2V0D,2V22,2VTA,2VTH,2VTI,2VTJ,2VTL,2VTM,2VTN,2VTO,2VTP,2VTQ,2VTR,2VTS,2VTT,2VU3,2VV9,2W05,2W06,2W17,2W1H,2WEV,2WFY,2WHB,2WIH,2WIP,2WMA,2WMB,2WPA,2WXV,2X1N,2XMY,2XNB,3BHT,3BHU,3BHV,3DDP,3DDQ,3DOG,3EID,3EJ1,3EOC,3EZR,3EZV,3F5X,3FZ1,3IG7,3IGG,3LE6,3LFN,3LFQ,3LFS,3MY5,3NS9,3PJ8,3PXF,3PXQ,3PXR,3PXY,3PXZ,3PY0,3PY1,3QHR,3QHW,3QL8,3QQF,3QQG,3QQH,3QQJ,3QQK,3QQL,3QRT,3QRU,3QTQ,3QTR,3QTS,3QTU,3QTW,3QTX,3QTZ,3QU0,3QWJ,3QWK,3QX2,3QX4,3QXO,3QXP,3QZF,3QZG,3QZH,3QZI,3R1Q,3R1S,3R1Y,3R28,3R6X,3R71,3R73,3R7E,3R7I,3R7U,3R7V,3R7Y,3R83,3R8L,3R8M,3R8P,3R8U,3R8V,3R8Z,3R9D,3R9H,3R9N,3R9O,3RAH,3RAI,3RAK,3RAL,3RJC,3RK5,3RK7,3RK9,3RKB,3RM6,3RM7,3RMF,3RNI,3ROY,3RPO,3RPR,3RPV,3RPY,3RZB,3S00,3S0O,3S1H,3S2P,3SQQ,3SW4,3SW7,3TI1,3TIY,3TIZ,3TNW,3ULI,3UNJ,3UNK,3WBL,4ACM,4BCK,4BCM,4BCN,4BCO,4BCP,4BCQ,4BGH,4BZD,4CFM,4CFN,4CFU,4CFV,4CFW,4CFX,4D1X,4D1Z,4EK3,4EK4,4EK5,4EK6,4EK8,4EOQ,4EOR,4EOS,4ERW,4EZ3,4EZ7,4FKG,4FKI,4FKJ,4FKL,4FKO,4FKP,4FKQ,4FKR,4FKS,4FKT,4FKU,4FKV,4FKW,4FX3,4GCJ,4I3Z,4II5,4KD1,4LYN,4NJ3,4RJ3,5A14,5AND,5ANE,5ANG,5ANI,5ANJ,5ANK,5ANO,5CYI,5D1J,5FP5,5FP6,5IEV,5IEX,5IEY,5IF1,5JQ5,5JQ8,5K4J,5L2W,5LMK,5MHQ,5NEV,5OO0,5OSJ,5UQ1,5UQ2,5UQ3,6ATH,6GUB,6GUC,6GUE,6GUF,6GUH,6GUK,6GVA,6INL,6JGM,6OQI,6P3W,6Q3B,6Q3C,6Q3F,6Q48,6Q49,6Q4A,6Q4B,6Q4C,6Q4D,6Q4E,6Q4F,6Q4H,6Q4J,6Q4K,6RIJ,6SG4,7ACK,7B5L,7B5R,7B7S,7E34,7KJS,7M2F,7NVQ,7RA5",
"UniProt (SwissProt) Recommended Name of Target Chain": "Cyclin-dependent kinase 2",
"UniProt (SwissProt) Entry Name of Target Chain": "CDK2_HUMAN",
"UniProt (SwissProt) Primary ID of Target Chain": "P24941",
"UniProt (SwissProt) Secondary ID(s) of Target Chain": "A8K7C6,O75100"
},
"object": {
"BindingDB Reactant_set_id": "10166",
"Ligand SMILES": "CCn1c2nc(Nc3ccccc3)ncc2ccc1=O",
"Ligand InChI": "InChI=1S/C15H14N4O/c1-2-19-13(20)9-8-11-10-16-15(18-14(11)19)17-12-6-4-3-5-7-12/h3-10H,2H2,1H3,(H,16,17,18)",
"Ligand InChI Key": "WSZLNFZLFQJSAJ-UHFFFAOYSA-N",
"BindingDB MonomerID": "6221",
"BindingDB Ligand Name": "8-Ethyl-2-phenylamino-8H-pyrido[2,3-d]pyrimidin-7-one::8-ethyl-2-(phenylamino)-7H,8H-pyrido[2,3-d]pyrimidin-7-one::C2 Pyrido[2,3-d]pyrimidin-7-one deriv. 1",
"PubChem CID": "5330199",
"PubChem SID": "8035820",
"ZINC ID of Ligand": "ZINC12354795",
"Number of Protein Chains in Target (>1 implies a multichain complex)": "2"
},
"relation": {
"IC50 (nM)": " 410",
"Curation/DataSource": "Curated from the literature by BindingDB",
"Article DOI": "10.1021/jm000271k",
"PMID": "11101352",
"Authors": "Barvian, M; Boschelli, DH; Cossrow, J; Dobrusin, E; Fattaey, A; Fritsch, A; Fry, D; Harvey, P; Keller, P; Garrett, M; La, F; Leopold, W; McNamara, D; Quin, M; Trumpp-Kallmeyer, S; Toogood, P; Wu, Z; Zhang, E",
"Institution": "Parke-Davis Pharmaceutical Research",
"Link to Ligand in BindingDB": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=6221", "Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=6221&enzyme=CDK2%2FCycE&column=ki&startPg=0&Increment=50&submit=Search"
},
"_id": "6221-P24941"
} {
"subject": {
"Target Name Assigned by Curator or DataSource": "Cyclin-Dependent Kinase 2 (CDK2)",
"Target Source Organism According to Curator or DataSource": "Homo sapiens",
"Link to Target in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=com&complexid=97&target=Cyclin-Dependent+Kinase+2+%28CDK2%29&column=ki&startPg=0&Increment=50&submit=Search",
"BindingDB Target Chain Sequence": "MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNHPNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHSHRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEILLGCKYYSTAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSFPKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAALAHPFFQDVTKPVPHLRL",
"PDB ID(s) of Target Chain": "1CKP,1DI8,1DM2,1E1V,1E1X,1E9H,1F5Q,1FIN,1FQ1,1FVT,1FVV,1G5S,1GIH,1GY3,1GZ8,1H00,1H01,1H07,1H08,1H0V,1H0W,1H1P,1H1Q,1H1R,1H1S,1H24,1H25,1H26,1H27,1H28,1HCK,1HCL,1JST,1JSU,1JSV,1JVP,1KE5,1KE6,1KE7,1KE8,1KE9,1OGU,1OI9,1OIQ,1OIR,1OIU,1OIY,1OKV,1OKW,1OL1,1OL2,1P2A,1P5E,1PF8,1PKD,1PW2,1PXI,1PXJ,1PXK,1PXL,1PXM,1PXN,1PXO,1PXP,1PYE,1QMZ,1R78,1URC,1URW,1V1K,1VYW,1VYZ,1W0X,1W8C,1W98,1WCC,1Y8Y,1Y91,1YKR,2A0C,2A4L,2B52,2B53,2B54,2B55,2BHE,2BHH,2BKZ,2BPM,2BTR,2BTS,2C4G,2C5N,2C5O,2C5V,2C5X,2C5Y,2C68,2C69,2C6I,2C6K,2C6L,2C6M,2C6O,2C6T,2CCH,2CCI,2CJM,2CLX,2DUV,2EXM,2FVD,2G9X,2I40,2J9M,2JGZ,2R3F,2R3G,2R3H,2R3I,2R3J,2R3K,2R3L,2R3M,2R3N,2R3O,2R3P,2R3Q,2R3R,2R64,2UUE,2UZB,2UZD,2UZE,2UZL,2UZN,2UZO,2V0D,2V22,2VTA,2VTH,2VTI,2VTJ,2VTL,2VTM,2VTN,2VTO,2VTP,2VTQ,2VTR,2VTS,2VTT,2VU3,2VV9,2W05,2W06,2W17,2W1H,2WEV,2WFY,2WHB,2WIH,2WIP,2WMA,2WMB,2WPA,2WXV,2X1N,2XMY,2XNB,3BHT,3BHU,3BHV,3DDP,3DDQ,3DOG,3EID,3EJ1,3EOC,3EZR,3EZV,3F5X,3FZ1,3IG7,3IGG,3LE6,3LFN,3LFQ,3LFS,3MY5,3NS9,3PJ8,3PXF,3PXQ,3PXR,3PXY,3PXZ,3PY0,3PY1,3QHR,3QHW,3QL8,3QQF,3QQG,3QQH,3QQJ,3QQK,3QQL,3QRT,3QRU,3QTQ,3QTR,3QTS,3QTU,3QTW,3QTX,3QTZ,3QU0,3QWJ,3QWK,3QX2,3QX4,3QXO,3QXP,3QZF,3QZG,3QZH,3QZI,3R1Q,3R1S,3R1Y,3R28,3R6X,3R71,3R73,3R7E,3R7I,3R7U,3R7V,3R7Y,3R83,3R8L,3R8M,3R8P,3R8U,3R8V,3R8Z,3R9D,3R9H,3R9N,3R9O,3RAH,3RAI,3RAK,3RAL,3RJC,3RK5,3RK7,3RK9,3RKB,3RM6,3RM7,3RMF,3RNI,3ROY,3RPO,3RPR,3RPV,3RPY,3RZB,3S00,3S0O,3S1H,3S2P,3SQQ,3SW4,3SW7,3TI1,3TIY,3TIZ,3TNW,3ULI,3UNJ,3UNK,3WBL,4ACM,4BCK,4BCM,4BCN,4BCO,4BCP,4BCQ,4BGH,4BZD,4CFM,4CFN,4CFU,4CFV,4CFW,4CFX,4D1X,4D1Z,4EK3,4EK4,4EK5,4EK6,4EK8,4EOQ,4EOR,4EOS,4ERW,4EZ3,4EZ7,4FKG,4FKI,4FKJ,4FKL,4FKO,4FKP,4FKQ,4FKR,4FKS,4FKT,4FKU,4FKV,4FKW,4FX3,4GCJ,4I3Z,4II5,4KD1,4LYN,4NJ3,4RJ3,5A14,5AND,5ANE,5ANG,5ANI,5ANJ,5ANK,5ANO,5CYI,5D1J,5FP5,5FP6,5IEV,5IEX,5IEY,5IF1,5JQ5,5JQ8,5K4J,5L2W,5LMK,5MHQ,5NEV,5OO0,5OSJ,5UQ1,5UQ2,5UQ3,6ATH,6GUB,6GUC,6GUE,6GUF,6GUH,6GUK,6GVA,6INL,6JGM,6OQI,6P3W,6Q3B,6Q3C,6Q3F,6Q48,6Q49,6Q4A,6Q4B,6Q4C,6Q4D,6Q4E,6Q4F,6Q4H,6Q4J,6Q4K,6RIJ,6SG4,7ACK,7B5L,7B5R,7B7S,7E34,7KJS,7M2F,7NVQ,7RA5",
"UniProt (SwissProt) Recommended Name of Target Chain": "Cyclin-dependent kinase 2",
"UniProt (SwissProt) Entry Name of Target Chain": "CDK2_HUMAN",
"UniProt (SwissProt) Primary ID of Target Chain": "P24941",
"UniProt (SwissProt) Secondary ID(s) of Target Chain": "A8K7C6,O75100"
},
"object": {
"BindingDB Reactant_set_id": "10159",
"Ligand SMILES": "CCn1c2nc(Nc3ccccc3)ncc2ccc1=O",
"Ligand InChI": "InChI=1S/C15H14N4O/c1-2-19-13(20)9-8-11-10-16-15(18-14(11)19)17-12-6-4-3-5-7-12/h3-10H,2H2,1H3,(H,16,17,18)",
"Ligand InChI Key": "WSZLNFZLFQJSAJ-UHFFFAOYSA-N",
"BindingDB MonomerID": "6221",
"BindingDB Ligand Name": "8-Ethyl-2-phenylamino-8H-pyrido[2,3-d]pyrimidin-7-one::8-ethyl-2-(phenylamino)-7H,8H-pyrido[2,3-d]pyrimidin-7-one::C2 Pyrido[2,3-d]pyrimidin-7-one deriv. 1",
"PubChem CID": "5330199",
"PubChem SID": "8035820",
"ZINC ID of Ligand": "ZINC12354795",
"Number of Protein Chains in Target (>1 implies a multichain complex)": "2"
},
"relation": {
"IC50 (nM)": " 129",
"Curation/DataSource": "Curated from the literature by BindingDB",
"Article DOI": "10.1021/jm000271k",
"PMID": "11101352",
"Authors": "Barvian, M; Boschelli, DH; Cossrow, J; Dobrusin, E; Fattaey, A; Fritsch, A; Fry, D; Harvey, P; Keller, P; Garrett, M; La, F; Leopold, W; McNamara, D; Quin, M; Trumpp-Kallmeyer, S; Toogood, P; Wu, Z; Zhang, E",
"Institution": "Parke-Davis Pharmaceutical Research",
"Link to Ligand in BindingDB": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=6221", "Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=6221&enzyme=Cyclin-Dependent+Kinase+2+%28CDK2%29&column=ki&startPg=0&Increment=50&submit=Search"
},
"_id": "6221-P24941"
} |
Just a note (Andrew and you are the decision-makers): is there a way to organize the documents by relationship? Maybe by what fields are present, or something else (values in the fields, organization of database, names of files)? For example, if a document has "inhibition-specific" fields like IC50 and Ki, does that mean it will definitely lack the fields for EC50 (which can be "agonist / stimulator" but can also be more general "effect") and Kd (more general to receptor-ligand binding) (this source is helpful)? To me, that would imply that the document is representing an "inhibition" relationship that is more specific than "this binds to that"... It would help the ingestion into BTE a lot if we could put a keyword under relation that defined the relationship this document actually represents. Like "inhibition", "stimulates", "binds to".....something like that. |
I could probably look into other things, but I have found that some amount of documents (~2000 in the first 300000) have both Ki and EC50 so I don't believe your method would work unless we could have multiple relationships, or alternatively we could ignore/not have a relationship for ones with multiple of these fields |
For now then, I think we should not worry about trying to characterize the relationship in more detail. Let's just add a top-level key for Also @rjawesome, can you add a mapping table between the original column names and the corresponding key to use in the JSON? For example, |
All column names are located in this sample doc (#70 (comment)). I will add the predicate and start on coding a mapping table, but I was wondering what our decision was relating to the documents with duplicate IDs? |
In cases where the
Note also that the |
I can do this, but as I pointed out earlier, there are other fields that would seem to best fit in the object/subject which are also duplicated such as Target Name Assigned by Curator or DataSource, Link to Target in BindingDB, UniProt (TrEMBL) Primary ID of Target Chain Target Source Organism According to Curator or DataSource. Should I move those to the relation? |
All the ones you explicitly listed in your last comment -- I also just noticed that Post here if there are any other fields whose behavior needs discussion... |
Alright, I've updated the parser, here is a new sample document (all fields that could have duplicates have been turned into arrays) {
"subject": {
"Target Name Assigned by Curator or DataSource": [
"Cyclin-Dependent Kinase 2 (CDK2)",
"CDK2/CycE"
],
"Target Source Organism According to Curator or DataSource": [
"Homo sapiens"
],
"Link to Target in BindingDB": [
"http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=com&complexid=97&target=Cyclin-Dependent+Kinase+2+%28CDK2%29&column=ki&startPg=0&Increment=50&submit=Search",
"http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=com&complexid=81&target=CDK2%2FCycE&column=ki&startPg=0&Increment=50&submit=Search"
],
"BindingDB Target Chain Sequence": "MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNHPNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHSHRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEILLGCKYYSTAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSFPKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAALAHPFFQDVTKPVPHLRL",
"PDB ID(s) of Target Chain": [
[
"1CKP",
"1DI8",
...
"7M2F",
"7NVQ",
"7RA5"
]
],
"UniProt (SwissProt) Recommended Name of Target Chain": [
"Cyclin-dependent kinase 2"
],
"UniProt (SwissProt) Entry Name of Target Chain": "CDK2_HUMAN",
"UniProt (SwissProt) Primary ID of Target Chain": "P24941",
"UniProt (SwissProt) Secondary ID(s) of Target Chain": [
"A8K7C6",
"O75100"
]
},
"object": {
"Ligand SMILES": "CCn1c2nc(Nc3ccccc3)ncc2ccc1=O",
"Ligand InChI": "InChI=1S/C15H14N4O/c1-2-19-13(20)9-8-11-10-16-15(18-14(11)19)17-12-6-4-3-5-7-12/h3-10H,2H2,1H3,(H,16,17,18)",
"Ligand InChI Key": "WSZLNFZLFQJSAJ-UHFFFAOYSA-N",
"BindingDB MonomerID": 6221,
"BindingDB Ligand Name": [
"8-Ethyl-2-phenylamino-8H-pyrido[2,3-d]pyrimidin-7-one::8-ethyl-2-(phenylamino)-7H,8H-pyrido[2,3-d]pyrimidin-7-one::C2 Pyrido[2,3-d]pyrimidin-7-one deriv. 1"
],
"Link to Ligand in BindingDB": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=6221",
"PubChem CID": 5330199,
"PubChem SID": 8035820,
"ZINC ID of Ligand": "ZINC12354795"
},
"relation": [
{
"BindingDB Reactant_set_id": 10159,
"IC50 (nM)": " 129",
"Curation/DataSource": "Curated from the literature by BindingDB",
"Article DOI": "10.1021/jm000271k",
"PMID": "11101352",
"Authors": [
"Barvian, M",
"Boschelli, DH",
"Cossrow, J",
"Dobrusin, E",
"Fattaey, A",
"Fritsch, A",
"Fry, D",
"Harvey, P",
"Keller, P",
"Garrett, M",
"La, F",
"Leopold, W",
"McNamara, D",
"Quin, M",
"Trumpp-Kallmeyer, S",
"Toogood, P",
"Wu, Z",
"Zhang, E"
],
"Institution": "Parke-Davis Pharmaceutical Research",
"Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=6221&enzyme=Cyclin-Dependent+Kinase+2+%28CDK2%29&column=ki&startPg=0&Increment=50&submit=Search",
"Number of Protein Chains in Target (>1 implies a multichain complex)": 2
},
{
"BindingDB Reactant_set_id": 10166,
"IC50 (nM)": " 410",
"Curation/DataSource": "Curated from the literature by BindingDB",
"Article DOI": "10.1021/jm000271k",
"PMID": "11101352",
"Authors": [
"Barvian, M",
"Boschelli, DH",
"Cossrow, J",
"Dobrusin, E",
"Fattaey, A",
"Fritsch, A",
"Fry, D",
"Harvey, P",
"Keller, P",
"Garrett, M",
"La, F",
"Leopold, W",
"McNamara, D",
"Quin, M",
"Trumpp-Kallmeyer, S",
"Toogood, P",
"Wu, Z",
"Zhang, E"
],
"Institution": "Parke-Davis Pharmaceutical Research",
"Link to Ligand-Target Pair in BindingDB": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=6221&enzyme=CDK2%2FCycE&column=ki&startPg=0&Increment=50&submit=Search",
"Number of Protein Chains in Target (>1 implies a multichain complex)": 2
}
],
"_id": "6221-P24941"
} |
Also @andrewsu you were mentioning you wanted the fields to be mapped, so if you still want that could you could send what you want each field to be mapped to (FYI, all fields located in this document) |
You can use the mapping table below. Note that I also decided to collapse the "swissprot" and "trembl" sets of fields under a single mapping table
|
@andrewsu, for the |
I think you can continue using the
separate documents for TrEMBL and SwissProt please... |
@andrewsu I've noticed that none of the TrEMBL documents meet my current criteria for determining if a protein is a human protein (ie. the entry name ends with _HUMAN). More specifically, none of the TrEMBL documents have their own entry names. We have a few options here:
|
Hmm, looks like for TrEMBL, they populate So, bottom line, let's keep your filtering based on the "Entry Name". In practice, that means that no TrEMBL records will be created. But at least the logic will be in place in case they start populating those columns in the future... |
Parser has been updated. New Sample Record below... {
"subject": {
"name": [
"Cyclin-Dependent Kinase 2 (CDK2)",
"CDK2/CycE"
],
"organism": [
"Homo sapiens"
],
"bindingdb_link": [
"http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=com&complexid=97&target=Cyclin-Dependent+Kinase+2+%28CDK2%29&column=ki&startPg=0&Increment=50&submit=Search",
"http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=com&complexid=81&target=CDK2%2FCycE&column=ki&startPg=0&Increment=50&submit=Search"
],
"uniprot": {
"type": "swissprot",
"fullname": [
"Cyclin-dependent kinase 2"
],
"id": "CDK2_HUMAN",
"accession": "P24941",
"secondary_accession": [
"A8K7C6",
"O75100"
]
},
"sequence": "MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNHPNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHSHRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEILLGCKYYSTAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSFPKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAALAHPFFQDVTKPVPHLRL",
"pdb": [
[
"1CKP",
"1DI8",
...
"7NVQ",
"7RA5"
]
]
},
"object": {
"smiles": "CCn1c2nc(Nc3ccccc3)ncc2ccc1=O",
"inchi": "InChI=1S/C15H14N4O/c1-2-19-13(20)9-8-11-10-16-15(18-14(11)19)17-12-6-4-3-5-7-12/h3-10H,2H2,1H3,(H,16,17,18)",
"inchikey": "WSZLNFZLFQJSAJ-UHFFFAOYSA-N",
"monomer_id": 6221,
"name": [
"8-Ethyl-2-phenylamino-8H-pyrido[2,3-d]pyrimidin-7-one::8-ethyl-2-(phenylamino)-7H,8H-pyrido[2,3-d]pyrimidin-7-one::C2 Pyrido[2,3-d]pyrimidin-7-one deriv. 1"
],
"bindingdb_link": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=6221",
"pubchem_cid": 5330199,
"pubchem_sid": 8035820,
"zinc": "ZINC12354795"
},
"relation": [
{
"bindingdb_set_id": 10159,
"ic50_nm": " 129",
"curation_datasource": "Curated from the literature by BindingDB",
"article_doi": "10.1021/jm000271k",
"pmid": "11101352",
"authors": [
"Barvian, M",
"Boschelli, DH",
"Cossrow, J",
"Dobrusin, E",
"Fattaey, A",
"Fritsch, A",
"Fry, D",
"Harvey, P",
"Keller, P",
"Garrett, M",
"La, F",
"Leopold, W",
"McNamara, D",
"Quin, M",
"Trumpp-Kallmeyer, S",
"Toogood, P",
"Wu, Z",
"Zhang, E"
],
"institution": "Parke-Davis Pharmaceutical Research",
"bindingdb_link": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=6221&enzyme=Cyclin-Dependent+Kinase+2+%28CDK2%29&column=ki&startPg=0&Increment=50&submit=Search",
"num_protein_chains": 2
},
{
"bindingdb_set_id": 10166,
"ic50_nm": " 410",
"curation_datasource": "Curated from the literature by BindingDB",
"article_doi": "10.1021/jm000271k",
"pmid": "11101352",
"authors": [
"Barvian, M",
"Boschelli, DH",
"Cossrow, J",
"Dobrusin, E",
"Fattaey, A",
"Fritsch, A",
"Fry, D",
"Harvey, P",
"Keller, P",
"Garrett, M",
"La, F",
"Leopold, W",
"McNamara, D",
"Quin, M",
"Trumpp-Kallmeyer, S",
"Toogood, P",
"Wu, Z",
"Zhang, E"
],
"institution": "Parke-Davis Pharmaceutical Research",
"bindingdb_link": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=6221&enzyme=CDK2%2FCycE&column=ki&startPg=0&Increment=50&submit=Search",
"num_protein_chains": 2
}
],
"_id": "6221-P24941",
"predicate": "physically interacts with"
} |
Note, just fixed a glitch in my parser. Another sample record (that was affected). {
"subject": {
"name": "Epidermal growth factor receptor",
"organism": "Homo sapiens",
"bindingdb_link": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=pol&polymerid=520&target=Epidermal+growth+factor+receptor&column=ki&startPg=0&Increment=50&submit=Search",
"uniprot": {
"type": "swissprot",
"fullname": "Epidermal growth factor receptor",
"id": "EGFR_HUMAN",
"accession": "P00533",
"secondary_accession": [
"O00688",
"O00732",
"P06268",
"Q14225",
"Q68GS5",
"Q92795",
"Q9BZS2",
"Q9GZX1",
"Q9H2C9",
"Q9H3C9",
"Q9UMD7",
"Q9UMD8",
"Q9UMG5"
]
},
"sequence": "MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEVVLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALAVLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDFQNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGCTGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYVVTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFKNCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAFENLEIIRGRTKQHGQFSLAVVSLNITSLGLRSLKEISDGDVIISGNKNLCYANTINWKKLFGTSGQKTKIISNRGENSCKATGQVCHALCSPEGCWGPEPRDCVSCRNVSRGRECVDKCNLLEGEPREFVENSECIQCHPECLPQAMNITCTGRGPDNCIQCAHYIDGPHCVKTCPAGVMGENNTLVWKYADAGHVCHLCHPNCTYGCTGPGLEGCPTNGPKIPSIATGMVGALLLLLVVALGIGLFMRRRHIVRKRTLRRLLQERELVEPLTPSGEAPNQALLRILKETEFKKIKVLGSGAFGTVYKGLWIPEGEKVKIPVAIKELREATSPKANKEILDEAYVMASVDNPHVCRLLGICLTSTVQLITQLMPFGCLLDYVREHKDNIGSQYLLNWCVQIAKGMNYLEDRRLVHRDLAARNVLVKTPQHVKITDFGLAKLLGAEEKEYHAEGGKVPIKWMALESILHRIYTHQSDVWSYGVTVWELMTFGSKPYDGIPASEISSILEKGERLPQPPICTIDVYMIMVKCWMIDADSRPKFRELIIEFSKMARDPQRYLVIQGDERMHLPSPTDSNFYRALMDEEDMDDVVDADEYLIPQQGFFSSPSTSRTPLLSSLSATSNNSTVACIDRNGLQSCPIKEDSFLQRYSSDPTGALTEDSIDDTFLPVPEYINQSVPKRPAGSVQNPVYHNQPLNPAPSRDPHYQDPHSTAVGNPEYLNTVQPTCVNSTFDSPAHWAQKGSHQISLDNPDYQQDFFPKEAKPNGIFKGSTAENAEYLRVAPQSSEFIGA",
"pdb": [
[
"1IVO",
"1M14",
"1M17",
...
"7AEI",
"7AEM",
"7OXB"
],
[
"1IVO",
"1M14",
"1M17",
...
"6VHN",
"6VHP",
"7AEI",
"7AEM"
]
]
},
"object": {
"smiles": "Cc1ccc(cc1)-n1nc(cc1NC(=O)Nc1ccc(OCCN2CCOCC2)c2ccccc12)C(C)(C)C",
"inchi": "InChI=1S/C31H37N5O3/c1-22-9-11-23(12-10-22)36-29(21-28(34-36)31(2,3)4)33-30(37)32-26-13-14-27(25-8-6-5-7-24(25)26)39-20-17-35-15-18-38-19-16-35/h5-14,21H,15-20H2,1-4H3,(H2,32,33,37)",
"inchikey": "MVCOAUNKQVWQHZ-UHFFFAOYSA-N",
"monomer_id": 13533,
"name": "1-[2-(4-methylphenyl)-5-tert-butyl-pyrazol-3-yl]-3-[4-(2-morpholin-4-ylethoxy)naphthalen-1-yl]urea::1-[5-tert-butyl-2-(4-methylphenyl)-3-pyrazolyl]-3-[4-[2-(4-morpholinyl)ethoxy]-1-naphthalenyl]urea::1-[5-tert-butyl-2-(4-methylphenyl)pyrazol-3-yl]-3-[4-(2-morpholin-4-ylethoxy)naphthalen-1-yl]urea::1-[5-tert-butyl-2-(p-tolyl)pyrazol-3-yl]-3-[4-(2-morpholinoethoxy)-1-naphthyl]urea::3-[2-(4-methylphenyl)-5-tert-butyl-pyrazol-3-yl]-1-[4-(2-morpholin-4-ylethoxy)naphthalen-1-yl]urea::3-[3-tert-butyl-1-(4-methylphenyl)-1H-pyrazol-5-yl]-1-{4-[2-(morpholin-4-yl)ethoxy]naphthalen-1-yl}urea::BIRB 796::BIRB-796::BIRB-796, 3::CHEMBL103667::Doramapimod::US8933228, BIRB 796::US9187470, 43 (BIRB-796)::US9242960, BIRB 796::US9260410, BIRB796::cid_156422::diaryl urea compound 10",
"bindingdb_link": "http://www.bindingdb.org/bind/chemsearch/marvin/MolStructure.jsp?monomerid=13533",
"het_id_pdb": "B96",
"pubchem_cid": 156422,
"pubchem_sid": 46513934,
"chembl": "CHEMBL103667",
"drugbank": "DB03044",
"iuphar_grac_id": "5668",
"zinc": "ZINC24044436"
},
"relation": [
{
"bindingdb_set_id": 65378,
"kd_nm": " 7000",
"curation_datasource": "PubChem",
"pubchem_aid": "aid1433",
"authors": [
"PubChem, PC"
],
"institution": "Ambit Biosciences",
"bindingdb_link": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=13533&enzyme=Epidermal+growth+factor+receptor&column=ki&startPg=0&Increment=50&submit=Search",
"pdb": [
"4JVG",
"1KV2",
"6GTT",
"5N66",
"4TWN",
"3NPC",
"3FZS"
],
"num_protein_chains": 1
},
{
"bindingdb_set_id": 65395,
"kd_nm": " 9100",
"curation_datasource": "PubChem",
"pubchem_aid": "aid1433",
"authors": [
"PubChem, PC"
],
"institution": "Ambit Biosciences",
"bindingdb_link": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=13533&enzyme=Epidermal+growth+factor+receptor&column=ki&startPg=0&Increment=50&submit=Search",
"pdb": [
"4JVG",
"1KV2",
"6GTT",
"5N66",
"4TWN",
"3NPC",
"3FZS"
],
"num_protein_chains": 1
},
...
{
"bindingdb_set_id": 50208407,
"ic50_nm": ">20000",
"curation_datasource": "ChEMBL",
"article_doi": "10.1021/jm020057r",
"pmid": "12086485",
"authors": [
"Regan, J",
"Breitfelder, S",
"Cirillo, P",
"Gilmore, T",
"Graham, AG",
"Hickey, E",
"Klaus, B",
"Madwed, J",
"Moriak, M",
"Moss, N",
"Pargellis, C",
"Pav, S",
"Proto, A",
"Swinamer, A",
"Tong, L",
"Torcellini, C"
],
"institution": "Boehringer Ingelheim Pharmaceuticals Inc",
"bindingdb_link": "http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_ki.jsp?energyterm=kJ/mole&tag=r21&monomerid=13533&enzyme=Epidermal+growth+factor+receptor&column=ki&startPg=0&Increment=50&submit=Search",
"pdb": [
"4JVG",
"1KV2",
"6GTT",
"5N66",
"4TWN",
"3NPC",
"3FZS"
],
"num_protein_chains": 1
}
],
"_id": "13533-P00533",
"predicate": "physically interacts with"
} |
Looks great, nice work @rjawesome! I think this one is also ready to pass off to @erikyao for API creation... |
API published to https://biothings.ncats.io/bindingdb |
Hi @rjawesome, I forked your repo to https://github.com/biothings/BindingDB and made some changes. The significant change is to move the content of your |
Noting that the next step is writing a SmartAPI yaml for this API. An intern can try to do this, or I'll stick it in my to-do list... |
@colleenXu I could work on that |
@colleenXu do you know how I can determine which fields should be put in the bte-response-mapping (or should I put all of them)? List of fields |
Just in case, I'll talk about the main fields first:
|
For the "other" fields in the response-mapping / retrieved in the fields section of the parameters, I suggest:
Some other relation fields are kinda interesting, but seem like "a lot of clutter" / hard to interpret / can go to the bindingdb link to learn more. so I think we can add a comment about them but otherwise not include them for now...: relation.ki_nm, relation.ic50_nm, relation.kd_nm, relation.ec50_nm, relation.kon, relation.koff, relation.ph, relation.temp_c, relation.num_protein_chains |
While it does have multiple idenitifiers, I think InChi/InChiKey is the most common identifier in the documents (a lot of them only have InChi/InChiKey), so I will probably just have an operation for that? |
Huh....I count more documents with pubchem.cid than inchikey. pubchem_cid: 1413051 so if we had to pick 1 id-prefix for the chemical stuff, I'd like pick pubchem_cid (aka PUBCHEM.COMPOUND). [EDIT: hmmm so ~1.8% of records or so would be not retrieved if we only used pubchem_cid....maybe that's fine?] Of the records that don't have pubchem_cid (25858):
(also noting the inverse: of the 44756 records that don't have inchikey, almost all (42125) have a pubchem_cid.) |
Oh, I missed pubchem_cid. I can change it to that or add a separate operation with the cid. Also, I made a PR with the YAML -- right now it is using InChiKey. |
I saw that. I suggest switching to using pubchem_cid (PUBCHEM.COMPOUND) |
@rjawesome I edited the SmartAPI yaml (commit NCATS-Tangerine/translator-api-registry@f3b4ca2) before registering it and hooking it up to BTE. Notes:
|
We still have to deploy changes to our instances to use this API in our team/general endpoints. We can query this API through the SmartAPI-specific endpoint using its registration ID 38e9e5169a72aee3659c9ddba956790d Once it is deployed, Example Query:
And there'd be an edge like this in the response:
|
Name: BindingDB - curated protein-chemical bindings
URL: https://www.bindingdb.org/rwd/bind/index.jsp
Download: https://www.bindingdb.org/rwd/bind/chemsearch/marvin/SDFdownload.jsp?all_download=yes
License: publicly available with no license specified
The text was updated successfully, but these errors were encountered: