Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data source: RaMP #69

Closed
newgene opened this issue Jun 1, 2022 · 10 comments
Closed

Data source: RaMP #69

newgene opened this issue Jun 1, 2022 · 10 comments
Assignees
Labels
data source Data source pending to create a new API

Comments

@newgene
Copy link
Member

newgene commented Jun 1, 2022

name: RaMP (Relational database of Metabolomic Pathways)
url: https://rampdb.nih.gov/
download:
https://rampdb.nih.gov/about
https://figshare.com/articles/dataset/RaMP_Database_MySQL_Dump_v2_0_7_20220428/19674540
license: GPL v2

@newgene newgene added the data source Data source pending to create a new API label Jun 1, 2022
@andrewsu
Copy link
Member

From https://rampdb.nih.gov/about, I'm wondering if we already get many/most of these resources from the primary source?
image

@andrewsu
Copy link
Member

@colleenXu pointed out that RaMP has an API. Given Translator interest in this data source, let's look at generating a SmartAPI annotation...

@colleenXu
Copy link

With earlier post in that issue biothings/biothings_explorer#372 (comment)

@colleenXu
Copy link

colleenXu commented Mar 17, 2023

Info

RaMP's github repo has two SmartAPI yamls; it's not clear which one we'd want to edit.

Also there's an older (alpha-version??) version of RaMP that was registered: http://smart-api.info/registry?q=3bfd9cecbcf799f800539ce24df1d754. Perhaps that registration needs removing / adjusting?

Other endpoints

There are also endpoints that look interesting, but we can't annotate them because they don't use IDs as inputs or as outputs

  • analytes-from-pathways: takes pathway names as input, not IDs. Output: Gene or SmallMolecule (chemical/metabolite). Should retrieve same info/opposite-direction compared to pathways-from-analytes.
  • ontologies-from-metabolites / metabolites-from-ontologies: interesting, but the ontologies are human-readable labels with no IDs
    • health condition
    • found in what subcellular location
    • found in what tissue/substructure, organ/component
    • perhaps more a node attribute (wouldn't annotate for associations)
      • found in what biofluid/excreta
      • found in what source in the world
      • used in what industrial application

@colleenXu
Copy link

colleenXu commented Mar 17, 2023

Issues writing x-bte annotation

There are endpoints that meet our criteria (relationships between entities, entities have IDs), but we encounter issues parsing their responses.

I think these issues can be addressed with post-query processing, perhaps with the api-response-transform module (custom handler) or JQ (which hasn't been incorporated into BTE yet... )

pathways-from-analytes

  • ISSUE 1: the output ID field data.pathwayId value can be a WIKIPATHWAYS, REACT, or KEGG.PATHWAY ID (I'm not sure if it can be others). BTE then has trouble correctly processing this output, similar to CTD processing 3: handling output IDs when multiple ID prefixes are possible biothings_explorer#585
    • note: these IDs don't have prefixes, but the value of the data.pathwaySource seems to correspond to the ID-namespace for each record (values seem to be wiki, reactome, or kegg). Perhaps JQ list-filter could help in this particular case
  • ISSUE 2: would have the batch-querying processing issue similar to CTD processing 2: batch-queries biothings_explorer#584. The matching input is provided in the data.inputID field (using RaMP's format for prefix-spelling / capitalization)
Example of API response with 3 different output ID-namespaces
    {
      "pathwayName": "7q11.23 copy number variation syndrome",
      "pathwaySource": "wiki",
      "pathwayId": "WP4932",
      "inputId": "hmdb:HMDB0000148",
      "commonName": "Glutamate; L-Glutamic acid"
    },
    {
      "pathwayName": "Activation of AMPA receptors",
      "pathwaySource": "reactome",
      "pathwayId": "R-HSA-399710",
      "inputId": "hmdb:HMDB0000148",
      "commonName": "Glutamate; L-Glutamic acid"
    },

    {
      "pathwayName": "Glycine, serine and threonine metabolism",
      "pathwaySource": "kegg",
      "pathwayId": "map00260",
      "inputId": "hmdb:HMDB0000148",
      "commonName": "Glutamate; L-Glutamic acid"
    },
Example of API response starting with 2 different input IDs
    {
      "pathwayName": "Glycine, serine and threonine metabolism",
      "pathwaySource": "kegg",
      "pathwayId": "map00260",
      "inputId": "hmdb:HMDB0000064",
      "commonName": "Creatine"
    },
    {
      "pathwayName": "Glycine, serine and threonine metabolism",
      "pathwaySource": "kegg",
      "pathwayId": "map00260",
      "inputId": "hmdb:HMDB0000148",
      "commonName": "Glutamate; L-Glutamic acid"
    },

common-reaction-analytes

The endpoint seems to provide gene -> chem (gene2met) and chem -> gene (met2gene) that are involved in the same reaction. We'd want to confirm this (note that what reaction/pathway they're both in...isn't provided)

  • ISSUE 3: the output ID field data.rxn_partner_ids value is a ;-delimited string of the entity's IDs in multiple ID-namespaces, all using RaMP's ID-prefix spellings
    • would like to separate these IDs by namespace. May involve custom processing with JQ or code
  • also would have ISSUE 2 (batch-querying). The matching input is provided in the data.input_analyte field (using RaMP's format for prefix-spelling / capitalization)
Example output from two chemical input IDs
    {
      "query_relation": "met2gene",
      "input_analyte": "hmdb:HMDB0000148",
      "input_common_names": "Glutamate; L-Glutamic acid",
      "rxn_partner_common_name": "PPAT",
      "rxn_partner_ids": "ensembl:ENSG00000128059; entrez:5471; gene_symbol:PPAT; hmdb:HMDBP00331; uniprot:A8K4H7; uniprot:D6RCC8; uniprot:D6RE15; uniprot:Q06203"
    },


    {
      "query_relation": "met2gene",
      "input_analyte": "hmdb:HMDB0000064",
      "input_common_names": "Creatine",
      "rxn_partner_common_name": "CKMT2",
      "rxn_partner_ids": "ensembl:ENSG00000131730; entrez:1160; gene_symbol:CKMT2; hmdb:HMDBP00719; uniprot:A0A024RAK5; uniprot:D6R998; uniprot:D6RHV3; uniprot:P17540"
    },

@colleenXu
Copy link

colleenXu commented Mar 18, 2023

I have the work I've done so far in this fork: https://github.com/colleenXu/RaMP-Client/blob/x-bte-annotation/libs/features/ramp/ramp-api/src/assets/data/ramp_openapi_with_extensions.yml

I annotated the pathways-from-analytes endpoint for HMDB SmallMolecule -> REACT Pathway and NCBIGene Gene -> REACT Pathway. I tested in with prod/test code and main branch code for BTE and it "works", BUT...

  • it has ISSUE 1 described in the previous post (so the responses have nodes with incorrectly-formatted IDs because the IDs are actually KEGG.PATHWAY or WIKIPATHWAYS and not REACT). UPDATE: See examples in the collapsed sections of the next post
  • UPDATE 3-21: I've figured out how to direct BTE to write batch-queries to this API. However, I encounter ISSUE 2 described in the previous post. I explain it in more detail in the collapsed section below
walking through the batch-query processing issue
  • In a local copy of the yaml, set supportBatch: true for the chemical2pathway_1 operation.
  • Then set up your local BTE instance to use this yaml.
  • Query just this API with this TRAPI query that has two chemical IDs:
    • HMDB:HMDB0000064 creatine
    • HMDB:HMDB0000148 glutamate
BTE query
{
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:related_to"]
                }
            },
            "nodes": {
                "n0": {
                    "ids": ["HMDB:HMDB0000064", "HMDB:HMDB0000148"],
                    "categories": ["biolink:SmallMolecule"]
                },
                "n1": {
                    "categories": ["biolink:Pathway"]
                }
            }
        }
    }
}
BTE correctly sets up the sub-query; these are the console logs I see
  bte:call-apis:query using template builder +0ms
  bte:call-apis:query {
  bte:call-apis:query   url: 'https://rampdb.nih.gov/api/pathways-from-analytes',
  bte:call-apis:query   params: {},
  bte:call-apis:query   data: { analytes: [ 'hmdb:HMDB0000064', 'hmdb:HMDB0000148' ] },
  bte:call-apis:query   method: 'post',
  bte:call-apis:query   timeout: 50000,
  bte:call-apis:query   headers: { 'User-Agent': 'BTE/dev Node/v16.18.0 darwin' }
  bte:call-apis:query } +7ms

That sub-query will return pathways linked to both IDs. When I query RaMP directly for each ID, I can see that there are 12 pathways linked to Creatine (hmdb:HMDB0000064) and 231 pathways linked to glutamate (hmdb:HMDB0000064).

Example of objects in the response, one linked to creatine and the other to glutamate
{
    "data": [

    {
      "pathwayName": "Glycine, serine and threonine metabolism",
      "pathwaySource": "kegg",
      "pathwayId": "map00260",
      "inputId": "hmdb:HMDB0000064",
      "commonName": "Creatine"
    },
    {
      "pathwayName": "Glycine, serine and threonine metabolism",
      "pathwaySource": "kegg",
      "pathwayId": "map00260",
      "inputId": "hmdb:HMDB0000148",
      "commonName": "Glutamate; L-Glutamic acid"
    },

But BTE's response has 234 edges (not 243 = 12 to creatine + 231 to glutamate) and all edges say their input ID is creatine (PUBCHEM.COMPOUND:586 / HMDB:HMDB0000064)...which isn't right.

I think there are fewer edges than expected because some pathways were linked to both chemicals, but after the records for glutamate were incorrectly assigned to creatine, those records were merged (notice how the map00260 from the raw response above shows up in the console logs below as having two records bound to that result).

portion of the console logs:

  bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:586_&_n1-REACT:WP1495 has 2 +0ms
  bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:586_&_n1-REACT:map00260 has 2 +0ms
  bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:586_&_n1-REACT:R-HSA-388396 has 1 +0ms
  bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:586_&_n1-REACT:R-HSA-500792 has 1 +0ms

@colleenXu
Copy link

colleenXu commented Mar 21, 2023

My fork's yaml has been registered https://smart-api.info/registry?q=ac9c2ad11c5c442a1a1271223468ced1, so RaMP is accessible through BTE using an api-specific endpoint.

For now, sending POST-queries to the dev/ci instances of BTE is preferred (for the node label support). To query specifically RaMP through dev-BTE, POST to this url: https://api.bte.ncats.io/v1/smartapi/ac9c2ad11c5c442a1a1271223468ced1/query

Example query for Chemical -> Pathway

In the request-body:

{
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:related_to"]
                }
            },
            "nodes": {
                "n0": {
                    "ids": ["HMDB:HMDB0000148", "HMDB:HMDB0000064"],
                    "categories": ["biolink:SmallMolecule"]
                },
                "n1": {
                    "categories": ["biolink:Pathway"]
                }
            }
        }
    }
}

The response will have 242 results. Some nodes will have some incorrect curies (will have the REACT prefix but the ID is actually KEGG.PATHWAY or WIKIPATHWAYS)

Correct prefix (this is a REACT ID)
Screen Shot 2023-03-21 at 12 48 54 PM

Incorrect prefix (this is actually a WIKIPATHWAYS ID but has the wrong prefix)
Screen Shot 2023-03-21 at 12 49 02 PM

Example query for Gene -> Pathway

In the request-body:

{
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:related_to"]
                }
            },
            "nodes": {
                "n0": {
                    "ids": ["NCBIGene:5241", "NCBIGene:4193"],
                    "categories": ["biolink:Gene"]
                },
                "n1": {
                    "categories": ["biolink:Pathway"]
                }
            }
        }
    }
}

The response will have 114 results. Some nodes will have some incorrect curies (will have the REACT prefix but the ID is actually KEGG.PATHWAY or WIKIPATHWAYS) like REACT:WP4262 (actually WIKIPATHWAYS

Notes:

  • incorrect curies is related to ISSUE 1 above
  • edges are missing a primary_knowledge_source. This will be fixed when RaMP is added to the API_LIST with the primarySource tag (like below)
        {
            id: 'ac9c2ad11c5c442a1a1271223468ced1',
            name: 'RaMP API v1.0.1',
            primarySource: true
        },

@colleenXu
Copy link

colleenXu commented Mar 22, 2023

Note that I've updated this post because I figured out how to get BTE to generate batch-queries and I was able to test how BTE processed the responses

(the yaml was updated colleenXu/RaMP-Client@456c022)

@andrewsu
Copy link
Member

More info below from the RaMP developers:

The updated analytes-from-pathways endpoint is now in our RaMP production API. Below is information on using the analytes-from-pathways endpoint.

Here’s the url for the endpoint.

https://rampdb.nih.gov/api/analytes-from-pathways

It’s a post. Here’s a sample post body:

{
  "pathway": [
    "WP1601", "WP4846"   
  ],
  "analyte_type": "both",
  "names_or_ids": "ids",
  "match": "exact",
  "max_pathway_size": 500
}

The pathway argument can be an array of pathway ids, for Wikipathways, or Reactome pathways.

We don’t license KEGG. We do have some KEGG ‘maps’ (map ids), but it’s not comprehensive.

The analyte_type can be set to ‘metabolite’, ‘gene’ or both. The geneOrCompound field in the return json will be either ‘gene’ or ‘compound’ (compound is the value on metabolites).

The ‘names_or_ids’ parameter specifies if you are searching by IDs or by pathway names.

The ‘match’ paramether is set to ‘exact’ internally if the search is working on an id list. Otherwise, the ‘match’ parameter can be set to ‘exact’ for and exact pathway name match or ‘fuzzy’.

Here fuzzy really just indicates that you can have a partial match on the names. That’s so that people might look for pathways related to TCA Cycle and just want to search on TCA.

For instance, if you wanted to get all pathways related to covid, and you wanted all genes and metabolites you could use this query body:

{
  "pathway": [
    "covid"   
  ],
  "analyte_type": "both",
  "names_or_ids": "names",
  "match": "fuzzy",
  "max_pathway_size": 500
}

The list of returned entities would be structure like this example:

        {
            "analyteName": "ACE2",
            "sourceAnalyteIDs": "ensembl:ENSG00000130234; entrez:59272; gene_symbol:ACE2; hmdb:HMDBP08177; hmdb:HMDBP13364; hmdb:HMDBP13365; uniprot:A0A7I2V2E9; uniprot:A0A7I2V3N4; uniprot:A0A7I2V3X6; uniprot:A0A7I2V4H0; uniprot:A0A7I2V5W5; uniprot:Q56NL1; uniprot:Q5EGZ1; uniprot:Q9BYF1",
            "geneOrCompound": "gene",
            "pathwayName": "COVID-19 adverse outcome pathway",
            "pathwayId": "WP4891",
            "pathwayCategory": "",
            "pathwayType": "wiki"
        }

The input parameter of max_pathway_size limits the pathway size (number of genes + metabolites) to be returned.

For instance, some pathways have an all-encompassing pathway called ‘Metabolism’ which really isn’t informative and contains a few thousand analytes.

The default if no limit is set is that pathways with up to 1000 analytes will be return.

API swagger documentation is here:

https://rampdb.nih.gov/api

*Note that this endpoint’s documentation has to be updated on this api swagger page to add the names_or_ids field, match, and max_pathway_size parameter descriptions.

The query bodies shown above will work on the swagger page, but we don’t have these new parameters described there.

@colleenXu
Copy link

Closing in favor of biothings/biothings_explorer#705, because (1) we seem to have decided to NOT make a pending BioThings API from this data and (2) it's easier to track this effort using the BioThings Explorer repo's tags.

However, the info in this issue are the basis of that issue.

@colleenXu colleenXu closed this as not planned Won't fix, can't repro, duplicate, stale Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data source Data source pending to create a new API
Projects
None yet
Development

No branches or pull requests

3 participants