Access to GMrepo using R through RESTful APIs
- Table of contents
- Install and load required libraries
- Phenotypes
- Get all phenotypes and statistics
- Get statistics on a phenotype
- Get associated species of a phenotype
- Get associated genera of a phenotype
- Calculate species/genera prevalence
- Get associated projects
- Get curated projects
- Get associated runs
- Get relative species/genus abundances in samples/runs associated with a phenotype
- Species/genera
- Get an overview of the species/genera
- Get summary information of the prevalence and relative abundance of a species/genus in all associated phenotypes
- Get detailed information of the prevalence and relative abundance of a species/genus in all associated phenotypes
- Get relative species/genus abundances for a sample/run
- Get relative species/genus abundances for a project
- Projects and runs
Please install the following required libraries:
## -- install httr
if( !requireNamespace("httr") ){
install.packages( "httr" );
}
## -- install jsonlite
if( !requireNamespace("jsonlite") ){
install.packages( "jsonlite" );
}
## -- install xml2
if( !requireNamespace("xml2") ){
install.packages( "xml2" );
}
## -- simply install all of them at once:
install.packages( c( "httr", "jsonlite", "xml2" ) );
Load required libraries:
require("httr");
require("jsonlite");
require("xml2");
input
: none,
output
: a list
### -- get all phenotypes --
pheno_01 <- POST("https://gmrepo.humangut.info/api/get_all_phenotypes", body = list());
pheno_01_cont <- content(pheno_01);
all_phenotypes <- fromJSON( xml_text( pheno_01_cont ));
## -- a list --
str(all_phenotypes);
head(all_phenotypes$phenotypes);
The data.frame
phenotypes
in the retrieved list all_phenotypes
(can be accessed using all_phenotypes$phenotypes
) contains a list of phenotypes and related statistics as shown in https://gmrepo.humangut.info/phenotypes.
Using the corresponding MeSH ID (e.g. D006262
for Health
), uses can first obtain some statistics information of the phenotype, including:
- nr of associated species/genera
- nr of total/processed/qualified runs
- and other information
input
: a MeSH ID,
output
: a list
## -- get summary information by mesh_id
pheno_02_query <- list( "mesh_id" = "D006262"); ## -- to get statistics on MeSH ID D006262
pheno_02 <- POST("https://gmrepo.humangut.info/api/getStatisticsByProjectsByMeshID", body = pheno_02_query, encode = "json");
pheno_02_cont <- content( pheno_02 );
phenotyp_stats <- fromJSON( xml_text( pheno_02_cont ));
## show data structure of the resulting list
str(phenotyp_stats);
Again the MeSH ID D006262
will be needed as the input:
input
: a MeSH ID,
output
: a data.frame
## -- all associted species --
## -- please note only species that are found in >= 2 runs & with median relative abundance >= 0.01% will be retrieved
pheno_03 <- POST("https://gmrepo.humangut.info/api/getAssociatedSpeciesByMeshID", body = list( "mesh_id" = "D006262"), encode = "json");
pheno_03_cont <- content( pheno_03 );
phenotyp_assoc_species <- fromJSON( xml_text( pheno_03_cont ));
## -- the resulting variable is a data.frame --
head( phenotyp_assoc_species );
input
: a MeSH ID,
output
: a data.frame
## -- all associted genera --
## -- please note only genera that are found in >= 2 runs & with median relative abundance >= 0.01% will be retrieved
pheno_04 <- POST("https://gmrepo.humangut.info/api/getAssociatedGeneraByMeshID", body = list( "mesh_id" = "D006262"), encode = "json");
pheno_04_cont <- content( pheno_04 );
phenotyp_assoc_genera <- fromJSON( xml_text( pheno_04_cont ));
## -- the resulting variable is a data.frame --
head( phenotyp_assoc_genera );
Prevalence refers to the percentage of runs in which a species/genus is found out of the total number of valid runs; the latter can be found in phenotyp_stats$stats$nr_valid_samples
.
## calculate species/genera prevalence for all species associated with Health (D006262):
phenotyp_assoc_species$species_prevalence <- phenotyp_assoc_species$samples / phenotyp_stats$stats$nr_valid_samples * 100;
## then plot
plot( density( phenotyp_assoc_species$species_prevalence ) );
## calculate species prevalence for all genera associated with Health (D006262):
phenotyp_assoc_genera$genus_prevalence <- phenotyp_assoc_genera$samples / phenotyp_stats$stats$nr_valid_samples * 100;
## then plot
plot( density( phenotyp_assoc_genera$genus_prevalence ) );
Users can also use the above data to find, for example, the top 10 most prevalent species/genera:
if( !requireNamespace("dplyr") ){
install.packages( "dplyr" );
}
require(dplyr);
## -- sort by species_prevalence in decreasing order and select the top 10
phenotyp_assoc_species %>% arrange( desc( species_prevalence ) ) %>% top_n( 10 );
input
: a MeSH ID,
output
: a data.frame
## -- all associted projects --
pheno_05 <- POST("https://gmrepo.humangut.info/api/getAssociatedProjectsByMeshID", body = list( "mesh_id" = "D006262"), encode = "json");
pheno_05_cont <- content( pheno_05 );
phenotyp_assoc_pros <- fromJSON( xml_text( pheno_05_cont ));
## -- the resulting variable is a data.frame --
head( phenotyp_assoc_pros );
Please note very often a project may contain samples/runs of multiple phenotypes.
input
: NULL,
output
: a data.frame
## -- all associted projects --
query <- POST("https://gmrepo.humangut.info/api/getCuratedProjectsList", body = list(), encode = "json");
retrieved_contents <- content( query );
curated_pros <- fromJSON( xml_text( retrieved_contents ));
## -- the resulting variable is a data.frame --
head( curated_pros );
Some phenotypes are associated with tens of thousands of runs (e.g. Health (D006262)
) that are too many to be retrieved with one call. Therefore it may take a two-step procedure to retrieve all runs associated with phenotype.
First, count the number of runs associated with a phenotype:
input
: a MeSH ID,
output
: a vector
## -- count associated runs --
pheno_07 <- POST("https://gmrepo.humangut.info/api/countAssociatedRunsByPhenotypeMeshID", body = list( "mesh_id" = "D006262"), encode = "json");
pheno_07_cont <- content( pheno_07 );
phenotyp_nr_assoc_runs <- fromJSON( xml_text( pheno_07_cont ));
## -- the resulting variable is a vector --
head( phenotyp_nr_assoc_runs );
Then users can use a loop retrieve the associated runs, 100 runs at a time:
input
: a MeSH ID, the number of records to skip, the number of records to retrieve; see below.
output
: a data.frame
## -- get all associted runs --
## use skip = 0, limit = 100 to retrieve the first 100 runs, then
## skip = 100, limit = 100 to retrieve the next 100 runs ....
params <- list( "mesh_id" = "D006262", "skip" = 0, "limit" = 100 );
pheno_08 <- POST("https://gmrepo.humangut.info/api/getAssociatedRunsByPhenotypeMeshIDLimit", body = params, encode = "json");
pheno_08_cont <- content( pheno_08 );
phenotyp_a_page_of_assoc_runs <- fromJSON( xml_text( pheno_08_cont ));
## -- the resulting variable is a data.frame --
head( phenotyp_a_page_of_assoc_runs );
To get the related information, two input parameters are required:
- MeSH ID of interests, e.g.
D003093
forColitis, Ulcerative
- NCBI taxonomy ID of the species/genus of interests, e.g.
40520
forBlautia obeum (species)
.
params <- list( "mesh_id" = "D003093", "ncbi_taxon_id" = "40520" );
query <- POST("https://gmrepo.humangut.info/api/getMicrobeAbundancesByPhenotypeMeshIDAndNCBITaxonID", body = params, encode = "json");
retrieved_contents <- content( query );
data <- fromJSON( xml_text( retrieved_contents ));
## -- the resulting variable is a list --
str( data );
The resulting data
is a list containing:
hist_data_for_phenotype
: a data frame contains the distribution of the relative abundances of the species/genus of interests in all samples of current phenotype,hist_data_for_health
: if current phenotype is notHealth
, the distribution of the relative abundances of the species/genus of interests in all samples ofHealth
will also be retrieved,abundant_data_for_disease
: a numeric vector contains the relative abundance data of the species/genus of interests in all samples of current phenotype,abundant_data_for_health
: if current phenotype is notHealth
, the relative abundances of the species/genus of interests in all samples ofHealth
will also be retrieved,taxon
: NCBI taxonomy information for current taxonomy ID,disease
: details of current phenotype,abundance_and_meta_data
: runs in which current taxon is found and related meta data,co_occurred_taxa
: cooccurred taxa of the taxon of interests in current phenotype
See https://gmrepo.humangut.info/phenotypes/D003093/40520 for more details.
input
: none,
output
: a list
.
### --- get all species and genera that presented in >= 2 runs with median relative abundance >= 0.01%
query <- POST("https://gmrepo.humangut.info/api/get_all_gut_microbes", body = NULL, encode = "json");
retrieved_contents <- content( query );
data <- fromJSON( xml_text( retrieved_contents ));
## --- a list --
str(data);
The retrieved data
is a list containing:
all_species
: adata.frame
that contains all species that presented in >= 2 runs with median relative abundance >= 0.01%,all_genus
: adata.frame
that contains all genera that presented in >= 2 runs with median relative abundance >= 0.01%,metadata
: a list contains additional statistics:loaded_samples
: nr. qualified runs for which the relative abundance data are available,all_species_count
: nr. all speciesretrieved_species_count
: nr. species in thedata.frame
:all_species
,all_genus_count
: nr. all genera,retrieved_genus_count
: nr. genera in thedata.frame
:all_genus
.
With the retrieved data, users can plot the species prevalence in phenotypes and species prevalence in samples, as shown below:
## -- species prevalence in samples
plot(density( data$all_species$pct_of_all_samples ));
## -- genus prevalence in samples
plot(density( data$all_genus$pct_of_all_samples ));
## -- species prevalence in phenotypes --
plot(density( data$all_species$nr_phenotypes ));
## -- genus prevalence in phenotypes --
plot(density( data$all_genus$nr_phenotypes ));
See https://gmrepo.humangut.info/species for more details.
Get summary information of the prevalence and relative abundance of a species/genus in all associated phenotypes
data input
: ncbi taxonomy id of a species/genus,
data output
: a data.frame
query <- POST("https://gmrepo.humangut.info/api/getPhenotypesAndAbundanceSummaryOfAAssociatedTaxon", body = list( "ncbi_taxon_id" = 40520 ), encode = "json");
retrieved_contents <- content( query );
data <- fromJSON( xml_text( retrieved_contents ));
## --- a data.frame --
head(data);
See the first table at https://gmrepo.humangut.info/species/40520 for details.
Get detailed information of the prevalence and relative abundance of a species/genus in all associated phenotypes
data input
: ncbi taxonomy id of a species/genus,
data output
: a list
query <- POST("https://gmrepo.humangut.info/api/getAssociatedPhenotypesAndAbundancesOfATaxon", body = list( "ncbi_taxon_id" = 40520 ), encode = "json");
retrieved_contents <- content( query );
data <- fromJSON( xml_text( retrieved_contents ));
## --- a list --
str(data);
The retrieved data
is a list containing:
phenotypes_associated_with_taxon
: adata.frame
contains summary information on associated phenotypes,taxon
: a list contains detailed information about this taxon, such as scientific name and taxonomic level,density_data_groupped
: a list ofdata.frame
, each contains abundance information of the current taxon in an associated phenotype; the number ofdata.frame
corresponds to the number of phenotypes the current taxon is associated with.
The retrieved data can be used to generate the plots at https://gmrepo.humangut.info/species/40520.
Two APIs are available here, namely getRunDetailsByRunID
and getFullTaxonomicProfileByRunID
. The usages are the same (see below). However, getFullTaxonomicProfileByRunID
will produce the full taxonomic profiles at species and genus levels, while getRunDetailsByRunID
only produces the top ten most abundant ones, and merge the others into a 'Others' category.
input
: run ID, e.g. ERR475468
,
output
: a list, see below:
query <- POST("https://gmrepo.humangut.info/api/getFullTaxonomicProfileByRunID", body = list( "run_id" = "ERR475468" ), encode = "json");
retrieved_contents <- content( query );
data <- fromJSON( xml_text( retrieved_contents ));
## --- a list --
str(data);
The retrieved data
is a list
containing:
run
: alist
contains run metadata,species
: adata.frame
contains relative abundances of all species,genus
: adata.frame
contains relative abundances of all genera.
See https://gmrepo.humangut.info/data/run/ERR475468 for details.
data input
: project id, e.g. PRJNA489760
, and a MeSH ID
data output
: a list
# Get relative species/genus abundances for all phenotypes
params <- list( "project_id"="PRJNA489760","mesh_id" = "" )
# Get relative species/genus abundances for one of the phenotype in the project
params <- list( "project_id"="PRJNA489760","mesh_id" = "D006262" )
# Query data
query <- POST("https://gmrepo.humangut.info/api/getMicrobeAbundancesByPhenotypeMeshIDAndProjectID", body = params, encode = "json")
retrieved_contents <- content( query )
data <- fromJSON( xml_text( retrieved_contents ))
## --- a list --
str(data);
The retrieved data
is a list
containing:
project_info
: alist
contains project information,disease_info
: alist
contains disease infromation,abundance_and_meta_data
: adata.frame
contains relative abundances of the project.
Although it is possible to download projects and runs through our RESTful API, it is highly recommended to download them from our website, or use the following URLs:
- download all projects: https://gmrepo.humangut.info/Downloads/AllSummaryData/all_projects_metadata.tsv.gz,
- download all runs associated with a project: https://gmrepo.humangut.info/Downloads/RunsByProjectID/all_runs_in_project_PRJEB6070.tsv.gz; please replace
PRJEB6070
with any other project ID of interests, - other downloads please consult the
Data downloads
section of the Help page: https://gmrepo.humangut.info/help.