Skip to content
Zhiao Shi edited this page Jul 20, 2021 · 8 revisions

Introduction

This pipeline is designed to process RNA-Seq data generated by total RNA-Seq strategy. One advantage of total RNA-Seq strategy is it can capture and sequence both linear and circular mRNA isoforms at a single run. circRNA is not considered in most existing gene expression quantification tools and pipelines which focused on polyA enriched RNA-Seq data. To solve this problem, we designed this total RNA-Seq analysis pipeline. It identifies circRNAs from total RNA-Seq first. Then, RNA-Seq reads will be distributed to linear and circular mRNA isoforms to quantify their expression.

Pipeline summary

The pipeline performs the following steps:

  1. Build BWA index
  2. Map RNA-seq to genome using bwa. The mapped sam will be used for circRNA calling
  3. circRNA calling using CIRI
  4. Add gene names to CIRI outputs by the in-house script
  5. Add circRNA to gene annotation in the gft format by the in-house script
  6. Extract both linear and circRNA transcripts using RSEM
  7. Convert linear transcript of circRNA to pseudo linear transcript. It also removes transcript with length less than reads length and any transcripts with “N”
  8. Generate transcript and gene mapping table for RSEM index
  9. Build RSEM index using transcript from step 7 and mapping from step 8
  10. Run RSEM to perform gene and transcript quantification
  11. Summarize RSEM output
  12. Combine and summarize the results from all samples

Inputs

The pipeline can process fastq files stored in GDC or locally/S3.

  • Input files are stored in GDC.

    • Sample catalog file. The pipeline uses the default catalog file for CPTAC3 samples if this file not provided, otherwise user should prepare a catalog file with similar format.
    • Case ID file. A list of case IDs that needs to be processed. Each ID should be in a separate row. The IDs must match the case column in the catalog file.
  • Input files are stored in local file system or S3.

    • Sample catalog file with the following tsv format.

      sample_name RNAseq_R1 RNAseq_R2
      sample1 /path/to/sample1_r1.fastq.gz /path/to/sample1_r2.fastq.gz
      sample2 /path/to/sample2_r1.fastq.gz /path/to/sample2_r2.fastq.gz
      ... ... ...

      Note: here the path can be absolute path in local file system or s3 path

    • There is no case id file needed.