Skip to content

A reproducible pipeline for processing NGS data from stranded workflows including RNA-seq and NET-seq

Notifications You must be signed in to change notification settings

mikewolfe/Strandedseq_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Snakemake PEP compatible

A portable and reproducible pipeline for analyzing stranded read coverage

The goal of this project is to create a reproducible analysis pipeline for sequencing data that gives stranded information. This is a fork of my ChIPseq_pipeline but has independent development.

The goals of this pipeline are to support data from:

  • bisulfite sequencing
  • NET-seq
  • RNA-seq read coverage
  • Any other stranded NGS datatype

This pipeline is focused on:

  • Short-read illumina sequencing data
  • Smaller genome sizes such as bacterial genomes

What the pipeline does

Starting from raw fastq files this pipeline does the following:

  • preprocessing to remove adapters and low quality sequences using cutadapt and trimmomatic
  • alignment to (a) reference genome(s) using bowtie2
  • read coverage calculation with a variety of normalization options within or between samples with deeptools and custom python scripts
  • peak calling using either macs2 or a custom python implementation of CMARRT
  • variant calling against a reference genome on ChIP input samples using breseq

In addition this pipeline uses multiqc to compile the following quality control metrics into an interactive html report:

  • Read QC using fastqc both before and after preprocessing
  • A large number of ChIP quality control metrics calculated by deeptools

Installing the pipeline

This pipeline has three dependencies:

  • The package manager miniconda
  • The workflow management system snakemake
  • The API for working with Portable Encaspulated Projects (PEP) peppy

miniconda can be installed following the installation instructions for your system here.

Once miniconda is installed, both snakemake and peppy can be installed in their own environment easily using:

conda create -n ChIPseq_pipeline snakemake=5.24.2 peppy
conda activate ChIPseq_pipeline

Note If you are using a computational cluster that requires job management software, you may want to install that with your environment as well. For example, if you are using an htcondor-managed server you would instead create your environment like so:

conda create -n ChIPseq_pipeline snakemake=5.24.2 peppy htcondor=8.9.5
conda activate ChIPseq_pipeline

Now you can pull the pipeline from github using:

git clone --recurse-submodules https://github.com/mikewolfe/ChIPseq_pipeline/

And you can change into the newly cloned ChIPseq_pipeline directory and test your installation with:

snakemake --use-conda --cores 10

Or if using a cluster with job management software you can run this with an environment-specific profile. For example:

snakemake --use-conda --cores 10 --profile htcondor

This will run the entire pipeline using the provided test data consisting of small example fastqs heavily downsampled from real ChIP data.

The first time you run the pipeline it will need to create dedicated conda environments for each module which will take some time. Afterwards, it will run quickly. For more information on using conda with snakemake including how to set things up to run offline check out the documentation here.

If everything runs smoothly you can then clean up and remove the results from the test data using:

snakemake clean_all --cores 1

Running the pipeline

This pipeline uses snakemake to manage the workflow and familiarity with snakemake will help with getting the most out of the pipeline. Fortunately, snakemake has an excellent tutorial that can be found here if you are unfamiliar with this workflow management system.

Input

This pipeline takes as input a Portable Encaspulated Project (PEP) which is essentially a .csv of samples and metadata together with a .yaml file allowing for extensions to the existing metadata.

The following required fields are needed for this pipeline

  • sample_name - a unique identifier for each sample
  • filenameR1 - the base file name for read1 of a set of paired fastq files
  • filenameR2 - the base file name for read2 of a set of paired fastq files
  • file_path - the path to where the files for a given sample live
  • input_sample - which unique sample_name acts as the input for each extracted sample. Input samples should leave this field blank.
  • genome - what reference genome should this sample be aligned to

An example of a sample sheet is included at pep/test_samples.csv.

Additionally, the sample sheet can be augmented with a required config.yaml file. In the included test example this is used to replace the file_path field with a specific location. This example can be found at pep/config.yaml.

The pep/config.yaml should be edited to point towards your pep/samples.csv file. If you create your samples.csv file with excel be sure to save it as comma seperated values not any of the other encodings for .csv.

Configuration

The pipeline itself, including parameters controlling specific tools, is controlled by a .yaml file in config/config.yaml. The included config.yaml has all possible options specified with comments describing what those options control.

Rules

The pipeline is organized into modules each of which runs a specific task needed for ChIP-seq analysis.

Each of these rules can be run individually using:

snakemake run_module_name --use-conda --cores 10

For example:

snakemake run_preprocessing --use-conda --cores 10

Additionally to remove the output of a given module run:

snakemake clean_module_name --use-conda --cores 1

For example:

snakemake clean_preprocessing --use-conda --cores 1

Many of later modules are dependent on the earlier modules and running a later module will run the required rules in an earlier module automatically.

Issues with the pipeline

If you run into any issues with the pipeline and would like help please submit it to the Issues page. Please include your config/config.yaml file, your pep/config.yaml file, your pep/samples.yaml file, and the output from snakemake that includes your error.

Version history

Currently at version 0.0.1

See the Changelog for version history and upcoming features.

About

A reproducible pipeline for processing NGS data from stranded workflows including RNA-seq and NET-seq

Resources

Stars

Watchers

Forks

Packages

No packages published