The goal of this project is to create a reproducible analysis pipeline for
sequencing data that gives stranded information. This is a fork of my
ChIPseq_pipeline
but has
independent development.
The goals of this pipeline are to support data from:
- bisulfite sequencing
- NET-seq
- RNA-seq read coverage
- Any other stranded NGS datatype
This pipeline is focused on:
- Short-read illumina sequencing data
- Smaller genome sizes such as bacterial genomes
Starting from raw fastq files this pipeline does the following:
- preprocessing to remove adapters and low quality sequences using
cutadapt
andtrimmomatic
- alignment to (a) reference genome(s) using
bowtie2
- read coverage calculation with a variety of normalization options within or between samples
with
deeptools
and custom python scripts - peak calling using either
macs2
or a custom python implementation ofCMARRT
- variant calling against a reference genome on ChIP input samples using
breseq
In addition this pipeline uses multiqc
to compile the
following quality control metrics into an interactive html report:
- Read QC using
fastqc
both before and after preprocessing - A large number of ChIP quality control metrics calculated by
deeptools
This pipeline has three dependencies:
- The package manager
miniconda
- The workflow management system
snakemake
- The API for working with Portable Encaspulated Projects (PEP)
peppy
miniconda
can be installed following the installation instructions for your system here.
Once miniconda
is installed, both snakemake
and peppy
can be installed in their own environment easily using:
conda create -n ChIPseq_pipeline snakemake=5.24.2 peppy
conda activate ChIPseq_pipeline
Note If you are using a computational cluster that requires job management
software, you may want to install that with your environment as well.
For example, if you are using an htcondor
-managed server you would
instead create your environment like so:
conda create -n ChIPseq_pipeline snakemake=5.24.2 peppy htcondor=8.9.5
conda activate ChIPseq_pipeline
Now you can pull the pipeline from github using:
git clone --recurse-submodules https://github.com/mikewolfe/ChIPseq_pipeline/
And you can change into the newly cloned ChIPseq_pipeline
directory and test your installation with:
snakemake --use-conda --cores 10
Or if using a cluster with job management software you can run this with an environment-specific profile. For example:
snakemake --use-conda --cores 10 --profile htcondor
This will run the entire pipeline using the provided test data consisting of small example fastqs heavily downsampled from real ChIP data.
The first time you run the pipeline it will need to create dedicated conda
environments for each module which will take some time. Afterwards, it will run quickly. For more information on using conda
with snakemake
including how to set things up to run offline check out the documentation here.
If everything runs smoothly you can then clean up and remove the results from the test data using:
snakemake clean_all --cores 1
This pipeline uses snakemake
to manage the workflow and familiarity with snakemake
will help with getting the most out of the pipeline. Fortunately, snakemake
has an excellent tutorial that can be found here if you are unfamiliar with this workflow management system.
This pipeline takes as input a Portable Encaspulated Project (PEP) which is essentially a .csv
of samples and metadata together with a .yaml
file allowing for extensions to the existing metadata.
The following required fields are needed for this pipeline
sample_name
- a unique identifier for each samplefilenameR1
- the base file name for read1 of a set of paired fastq filesfilenameR2
- the base file name for read2 of a set of paired fastq filesfile_path
- the path to where the files for a given sample liveinput_sample
- which uniquesample_name
acts as the input for each extracted sample. Input samples should leave this field blank.genome
- what reference genome should this sample be aligned to
An example of a sample sheet is included at pep/test_samples.csv.
Additionally, the sample sheet can be augmented with a required config.yaml
file. In the included test example this is used to replace the file_path
field with a specific location. This example can be found at pep/config.yaml.
The pep/config.yaml
should be edited to point towards your pep/samples.csv
file. If you create your samples.csv
file with excel be sure to save it as comma seperated values
not any of the other encodings for .csv
.
The pipeline itself, including parameters controlling specific tools, is controlled by a .yaml
file in config/config.yaml. The included config.yaml
has all possible options specified with comments describing what those options control.
The pipeline is organized into modules each of which runs a specific task needed for ChIP-seq analysis.
- workflow/rules/preprocessing.smk includes rules for trimming raw reads for adapters and quality
- workflow/rules/alignment.smk includes rules for aligning samples to their reference genome
- workflow/rules/coverage_and_norm.smk includes rules for calculating read coverage over the genome and performing within and between sample normalization
- workflow/rules/peak_calling.smk includes rules for calling ChIP-seq peaks
- workflow/rules/quality_control.smk includes rules for performing summarizing quality control on the reads themselves and ChIP-seq specific quality control
- workflow/rules/postprocessing.smk includes rules for getting summaries of ChIP signals over specified regions or locations
- workflow/rules/variant_calling.smk includes rules for checking for mutations against the reference genome. Typically run on ChIP input samples. Will take awhile to run.
Each of these rules can be run individually using:
snakemake run_module_name --use-conda --cores 10
For example:
snakemake run_preprocessing --use-conda --cores 10
Additionally to remove the output of a given module run:
snakemake clean_module_name --use-conda --cores 1
For example:
snakemake clean_preprocessing --use-conda --cores 1
Many of later modules are dependent on the earlier modules and running a later module will run the required rules in an earlier module automatically.
If you run into any issues with the pipeline and would like help please submit it to the Issues page. Please include your config/config.yaml
file, your pep/config.yaml
file, your pep/samples.yaml
file, and the output from snakemake
that includes your error.
Currently at version 0.0.1
See the Changelog for version history and upcoming features.