Skip to content

SUmmarizing Multiple Enrichment analysis Results

Notifications You must be signed in to change notification settings

bzhanglab/sumer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SUMER

An R package for summarizing multiple enrichment analysis results

Installation

# install.packages("devtools")
devtools::install_github("bzhanglab/sumer")

Usage

The main function of the package is sumer(). This function takes two parameters. The first parameter is the name of a file in json format that, among other things, specifies the location of input data files for each omics platform. The second parameter specifies the location of output files. Currently, enrichment results from up to 7 platforms can be summarized with sumer().

sumer("/path/to/config.json", "/path/to/output_dir")

A sample configuration file is shown below:

{
  "project":"My study",
  "top_num": 50,
  "similarity": "Jaccard",
  "data": [
    {
      "platform_name":"rna-seq",
      "platform_abbr":"rna",
      "gmt_file":"/data/rna_seq.gmt",
      "score_file":"/data/rna_seq_score.txt"
    },
    {
      "platform_name":"proteomics",
      "platform_abbr": "pro",
      "gmt_file":"/data/proteomics.gmt",
      "score_file":"/data/proteomics_score.txt"
    },
    {
      "platform_name":"phospho-proteomics",
      "platform_abbr": "phospho",
      "gmt_file":"/data/phospho_proteomics.gmt",
      "score_file":"/data/phospho_proteomics_score.txt"
    }
  ]
}

The keys for the configuration file are explained in the following table.

name description
project a brief description for your project
top_num maximum number of top sets selected by set cover algorithm
similarity (optional) Similarity measure for gene sets, default is "Jaccard". The other supported measures are "Simpson", "Dice".
data an array of objects that describe the data for each platform
platform_name name of platform
platform_abbr abbreviation of the platform name
gmt_file path to the gmt file
score_file path to the score file

The gmt_file specifies a tab delimited file that provides gene set information. Each row describes a gene set where the first column lists the name of gene set and the optional second column a brief description of the set. The rest of columns list the genes included in the set. This is the same format as described here.

The score_file lists the measurement of significance for each gene set. There is one row for each gene set. The first column is the gene set name and second column is the score for the set. Typical examples of the score value are signed or signed . Columns are separated by tab.

The gene set names in gmt_file and score_file should match.

To run the sample data provided by the package, first make sure the working directory (/path/to/your_work_dir) and output directory (/path/to/your_output_dir) exist in your system.

> setwd("/path/to/your_work_dir")
> file.copy(system.file("data", "sample.tgz", package="sumer"), ".")
> untar("sample.tgz")
> sumer("config.json", "/path/to/your_output_dir")

The final output files are in your_output_dir. You can open index.html to explore the summarized results.

Citation

If you use the sumer package in your research, please cite the following paper:

Savage, S. R., Shi, Z., Liao, Y., Zhang, B. (2019). "Graph algorithms for condensing and consolidating gene set analysis results." Molecular & Cellular Proteomics, 18(8): S141-S152.

BibTeX Citation

@article{savage2019graph,
  title     = {Graph algorithms for condensing and consolidating gene set analysis results},
  author    = {Savage, Sara R and Shi, Zhiao and Liao, Yuxing and Zhang, Bing},
  journal   = {Molecular \& Cellular Proteomics},
  volume    = {18},
  number    = {8},
  pages     = {S141--S152},
  year      = {2019},
  publisher = {ASBMB}
}