Skip to content

pliu55/pRSEM_demo

Repository files navigation

README for pRSEM demo

Table of Contents


Introduction

Prior-enhanced RSEM (pRSEM) is an RNA-seq quantification method that utilizes external data for the task of transcript abundance estimation. The workflow of pRSEM is illustrated in the following figure.

alt text

This repository is a mini-example for running pRSEM. It contains all the required software packages, input files, and scripts. More details is described below. This demo runs in 4 threads. The installation and running will take about 20 to 30 minutes on a 4 x 2.4GHz core machine depending on which R/Bioconductor libraries users have already installed.

Download

There are two ways to download this demo and all three required submodules

  • Download the tar ball or zip file via https. They are also listed on the release page with name pRSEM_demo_and_AllSubmodules
  • Use a git command: git clone --recursive git@github.com:pliu55/pRSEM_demo

System Requirements

  • Linux
  • Hard drive space > 2.5G
  • Perl version >= 5.8.8
  • Python version >= 2.7.3
  • R version >= 3.3.1
  • POSIX Threads
  • Boost (C++ libraries)

Submodules

This demo requires three submodules:

  • pRSEM: prior-enhanced RSEM
  • STAR: aligner for RNA-seq reads, version 2.5.2b
  • Bowtie: aligned for ChIP-seq reads, version 1.0.1

Usage

Go to this demo's folder and type

./run_pRSEM_demo.sh

This script will carry out the following tasks:

  1. Install Bowtie, STAR, RSEM, and all required libraries not yet installed by users
  2. Prepare genome references for Bowtie, STAR, and RSEM
  3. Derive prior parameters from RNA Polymerase II ChIP-seq data and use them to quantify RNA-seq data
  4. Derive prior parameters from a combination of four histone modfication ChIP-seq data sets and use them to quantify RNA-seq data
  5. Perform a testing procedure using a combination of four histone modification ChIP-seq data sets as the external data
  6. Perform a testing procedure using RNA Polymerase II ChIP-seq peaks as the external data

Input

All of the following data sets are under the folder input/. The RNA-seq and PolII ChIP-seq data were derived from ENCODE2 mouse Mel cell line. Although they are derived from a cell line rather than from tissue, we named them with keyword mmliver just to be consistent with the examples given in RSEM's documentation. The four histone modification ChIP-seq data sets were derived from Lara-Astiaso and Weiner et al. Science 2014 345:943.

Output

All output files will be stored in the following four folders under output/:

  • genome_references/: genomic index files for Bowtie, STAR, and RSEM

  • quantification_results_PolII/: output files from pRSEM's expression-calculation step by using prior from Pol II ChIP-seq data. They are organized in the same way as RSEM's output. The only difference is that there are five files specifically generated by pRSEM under folder output/quantification_results_PolII/demo.stat/:

    • demo_prsem.all_tr_features: isofrom features to derive and assign pRSEM prior. The first line is a header and the rest is one isoform per line. The description for each column is:

      column name descrption
      trid transcript ID from input annotation
      geneid gene ID from input anntation
      chrom isoform's chromosome name
      strand isoform's strand name
      start TSS if isoform is on '+' strand; TES if isoform is on '-' strand; where TSS stands for isoform's transcription start site, i.e. 5'-end, and TES stands for isoform's transcript end site, i.e. 3'-end
      end TES if isoform is on '+' strand; TSS if isoform is on '-' strand
      tss_mpp average mappability of [TSS-500bp, TSS+500bp]
      body_mpp average mappability of (TSS+500bp, TES-500bp)
      tes_mpp average mappability of [TES-500bp, TES+500bp]
      pme_count isoform's fragment or read count from RSEM's posterior mean estimates
      tss isoform's TSS
      tss_pk equal to 1 if isoform's [TSS-500bp, TSS+500bp] region overlaps with a RNA Pol II peak; 0 otherwise
      is_training equal to 1 if isoform is in the training set where Pol II prior is learned; 0 otherwise
      partition group number that isoform was partitioned to
    • demo_prsem.all_tr_prior: prior parameters for every isoform. This file does not have a header. Each line contains an isoform's prior parameter and its transcript ID delimited by #

    • demo_prsem.pval_LL: contains a p-value indicating if Pol II ChIP-seq data is informative for RNA-seq quantification and a log-likelihood denoting the fitness of pRSEM's Dirichlet-multinomial model to Pol II-partitioned training set isoforms

    • demo_uniform_prior_1.gene.results: RSEM's posterior mean estimates on the gene level with an initial pseudo-count of one for every isoform

    • demo_uniform_prior_1.isoform.results: RSEM's posterior mean estimates on the isoform level with an initial pseudo-count of one for every isoform

  • quantification_results_histone/: output files from pRSEM's expression-calculation step by using prior from four histone modification ChIP-seq data. They are very similar to those in quantification_results_PolII/. The only difference are in two files:

    • demo_prsem.all_tr_features: similar to quantification_results_PolII/demo.stat/demo_prsem.all_tr_features. It does not have the column tss_pk, but a few more new columns as shown below:

      column name description
      pme_TPM isoform abundance from RSEM's posterior mean estimates
      H3K27Ac normalized H3K27Ac signals (in RPKM) of isoform's [TSS-500, TSS+500]
      H3K4me1 normalized H3K4me1 signals (in RPKM) of isoform's [TSS-500, TSS+500]
      H3K4me2 normalized H3K4me2 signals (in RPKM) of isoform's [TSS-500, TSS+500]
      H3K4me3 normalized H3K4me3 signals (in RPKM) of isoform's [TSS-500, TSS+500]
      prd_expr_prob probability of being expressed from a logistic regression model
      prior isoform's prior parameter
    • demo_prsem.lgt_mdl.RData: It stores an R logistic regression object, which was obtained based on four histone modification signals and fragment counts of isoforms in the training set.

  • testing_procedure_histone_PolII/: output files from pRSEM's testing procedure. Test results are saved in demo.all.pval_LL. All the other files are similar to those in quantification_results_PolII/ and quantification_results_histone/.

    • demo.all.pval__LL: contains three lines
      • line 1: Header
      • line 2: Testing procedure result when using four histone modification ChIP-seq data as the external data for RNA-seq quantification
      • line 3: Testing procedure result when using Pol II peaks as the external data for RNA-seq quantification

Please note that, in order to shorten the running time as much as possible, the input ChIP-seq and RNA-seq files were prepared in extremely small (and unrealistic) sizes, and Gibbs sampling were set to run in just 100 instead of the default 1000 steps. As a result, the final quantification results may vary from time to time.

Contact

Got a question? Please post it at RSEM's GitHub Issues page with @pliu55 mentioned.

License

This demo is licensed under the GNU General Public License v3.