-
Notifications
You must be signed in to change notification settings - Fork 25
General data usage tips
Efrat Muller edited this page Jun 27, 2022
·
5 revisions
Here are some general tips about how to use the data:
- Most of the datasets are from "case-control studies", i.e. consist of samples from individuals with a studied disease, and samples from "healthy" controls. We call these two (or sometimes more) groups - "study groups", and they are reported in each
metadata.tsv
file. Users should consider these study groups in any analysis they perform. - Some of the datasets are from longitudinal studies, meaning that they include multiple samples per subject. Depending on the analysis, users may want to handle such samples differently.
- To relate metabolites across studies, users can use either HMDB or KEGG ID's, given in the
mtb.map
tables.- Note that some HMDB/KEGG annotations are marked as
High.Confidence.Annotation = FALSE
, indicating that the metabolite's identification should be used with caution. See Data processing details for details about theHigh.Confidence.Annotation
flag. - Additionally, metabolite values (or presence/absence) cannot be compared directly across datasets, due to differences between metabolomic platforms. See the Limitations for a further discussion on this topic.
- Note that some HMDB/KEGG annotations are marked as
- To compare microbial taxa across studies, genera tables can be used as is (genus names are all derived from GTDB), or if analyzing only shotgun datasets, species tables can be used as is. All genera and species names are in accordance to the GTDB taxonomy.
- A simple example of a cross-study comparison using this data collection can be found in the following R notebook: meta-analysis_of_genus_metabolite_associations.Rmd. The rendered html of the R notebook can be viewed here.