-
Notifications
You must be signed in to change notification settings - Fork 40
ChIP seq
This document explains how to run piPipes ChIP-seq pipeline and how to interpret the output.
This pipeline provides analysis on the abundance of a specific chromatin signal on genes and transposons using single-end or paired-end ChIP-seq reads generated by Next Generation Sequencing.
The ChIP-seq pipeline contains two modes: single-sample mode and dual-sample mode. Please see the examples below.
# require internet access
piPipes install -g dm3
ChIP-seq usually contains two libraries for each experiment: input control and immunoprecipitates.
# use fastq-dump from SRATools (http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software)
# to download data and convert to fastq; require internet access
# --split-3 split the left and right read from paired-end ChIP-seq data
fastq-dump -F --split-3 --gzip -A Hannon.ChIP.input.nos_piwi SRR646594
fastq-dump -F --split-3 --gzip -A Hannon.ChIP.H3K9me3.nos_piwi SRR646595
piPipes chip
# or
piPipes chip -h
# -l: left read of ChIP-seq IP
# -r: right read of ChIP-seq IP
# -L: left read of ChIP-seq input
# -R: right read of ChIP-seq input
# -g: genome to use
# -o: output directory
piPipes chip \
-l Hannon.ChIP.H3K9me3.nos_piwi_1.fastq.gz \
-r Hannon.ChIP.H3K9me3.nos_piwi_2.fastq.gz \
-L Hannon.ChIP.input.nos_piwi_1.fastq.gz \
-R Hannon.ChIP.input.nos_piwi_2.fastq.gz \
-g dm3 \
-B \
-o Hannon.ChIP.H3k9me3.nos_piwi.piPipes_out \
1> Hannon.ChIP.H3k9me3.nos_piwi.piPipes.stdout \
2> Hannon.ChIP.H3k9me3.nos_piwi.piPipes.stderr
# -B: broad peak; used for H3K9me3; for transcriptional factor, don't use this option
# -x: extend this many nucleotide on both sides of genomic features for TSS/TES analysis
# Note that for meta-analysis of gene body, the extended length is proportional to the size
# of the gene body (1:1:1) so this parameter does not influence meta-gene
# -M: Path to BED files for meta-plot analysis on user-defined regions.
# The files should be delimited by comma.
# the following three options toggle the usage of multi-mappers (unambiguous mappers)
# -u: Only use unique mappers. default: on
# -m: Use both unique and multi-mappers. For multi-mappers, let Bowtie2 randomly report one locus
# from the best alignment pool. default: off
# -e: Use both unique and multi-mappers. For multi-mappers, use Expectation–Maximization algorithm
# implemented by CSEM to allocate them. Only alignments passing 0.5 CSEM posterior are kept.
piPipes chip \
-l Hannon.ChIP.H3K9me3.nos_piwi_1.fastq.gz \
-r Hannon.ChIP.H3K9me3.nos_piwi_2.fastq.gz \
-L Hannon.ChIP.input.nos_piwi_1.fastq.gz \
-R Hannon.ChIP.input.nos_piwi_2.fastq.gz \
-g dm3 \
-B \
-x 100 \
-m \
-M region_of_interest1.bed,region_of_interest2.bed \
-o Hannon.ChIP.H3k9me3.nos_piwi.piPipes_out \
1> Hannon.ChIP.H3k9me3.nos_piwi.piPipes.stdout \
2> Hannon.ChIP.H3k9me3.nos_piwi.piPipes.stderr
The output folder should contain the following folders:
# folder raw data for mega analysis
aggregate_output/
# bigWig files for UCSC genome browser
bigWig/
# BAM files for genome mapping
genome_mapping/
# outputs from MACS2, the peak calling program used
macs2_peaks_calling/
# log
nos_piwi.callpeak.log
# pdfs for meta-analysis
pdfs/
genome_mapping/
contains bam
files for genomic alignments. Note that input and IP are aligned separately.
** Based on the option of -u
-m
-e
, the bam file could contain different conteins**
- if
-u
is used (by default), reads are mapped to the genome bybowtie2
, the bam file contains only the unique mapper - if
-m
is used, for each of the multi-mapper reads,bowtie2
will randomly report one alignment from the best alignments. - if
-e
is used, reads are aligned to the genome usingbowtie
with-a -m 100 --best --strata
(to report best alignments for multi-mappers with <= 100 loci). Then the bam file will be used bycsem
, which uses Expectation–maximization algorithm to allocate multi-mappers.piPipes
then filter the alignments and only keep the ones with crem posterior higher than 0.5. The reason to usebowtie
instead ofbowtie2
is thatcrem
currently is incompatible withbowtie2
output.
Note that bowtie
is fundamentally a string matching algorithm, it can allow up to 3 mismatches and no indels.
bowtie2
uses score matrix based alignment algorithm and are tolerable with more mismatches and indels.
Thus for long reads (>70), we recommend using -u
. For targets enriched in repetitive region, including H3K9me2/3 or Rhino,
we recommend to use -m
so that we don't lose important information.
# alignments for the input
Hannon.ChIP.H3K9me3.dm3.Input.b2.log
Hannon.ChIP.H3K9me3.dm3.Input.b2.sorted.bam
Hannon.ChIP.H3K9me3.dm3.Input.b2.sorted.bam.bai
# alignments for the IP
Hannon.ChIP.H3K9me3.dm3.IP.b2.log
Hannon.ChIP.H3K9me3.dm3.IP.b2.sorted.bam
Hannon.ChIP.H3K9me3.dm3.IP.b2.sorted.bam.bai
macs2_peaks_calling/
contains output files from MACS2
.
# bedGraph files for IP and input alignments
Hannon.ChIP.H3K9me3_treat_pileup.bdg
Hannon.ChIP.H3K9me3_control_lambda.bdg
# loci of peaks called by MACS2
Hannon.ChIP.H3K9me3_peaks.xls
# information on the "broad" regions called by MACS2 in BED format
Hannon.ChIP.H3K9me3_peaks.broadPeak
# information on both the "broad" regions and narrow peaks
Hannon.ChIP.H3K9me3_peaks.gappedPeak
# log for peak calling, --SPMR option is used so the signal was normalized
Hannon.ChIP.H3K9me3.callpeak.SPMR.log
# bedGraph files for signal enrichment (IP over input) with three different methods
Hannon.ChIP.H3K9me3.ppois.bdg
Hannon.ChIP.H3K9me3.FE.bdg
Hannon.ChIP.H3K9me3.logLR.bdg
bigWig/
contains bigWig
files for UCSC genome browser
# bigWig files for IP/input signal enrichment from three different methods.
Hannon.ChIP.H3K9me3.ppois.bigWig
Hannon.ChIP.H3K9me3.FE.bigWig
Hannon.ChIP.H3K9me3.logLR.bigWig
aggregate_output/
contains accumulated signal around start site (TSS), end site (TES), and the whole region (meta).
They are generated by bwtool aggregate
.
By default, piPipes extends 100 bp around the start/end and calculate the accumulated signals. The size can be toggled using -x
option.
For meta-gene analysis, the extension length is the same as the gene body, so it is not affected by -x
.
# piPipes used three different methods to calculate the signal enrichment of IP/input,
# Poisson P value, Folder Enrichment and logLR.
# for each position from -100 to 100
# the data is used to draw pdf
$ head piRNA_Cluster_42AB.starts.txt
piRNA_Cluster_42ABstart Poisson P value -100 0.121330
piRNA_Cluster_42ABstart Fold Enriched -100 0.452620
piRNA_Cluster_42ABstart logLR -100 -0.115620
piRNA_Cluster_42ABstart Poisson P value -99 0.121330
piRNA_Cluster_42ABstart Fold Enriched -99 0.452620
piRNA_Cluster_42ABstart logLR -99 -0.115620
piRNA_Cluster_42ABstart Poisson P value -98 0.121330
piRNA_Cluster_42ABstart Fold Enriched -98 0.452620
piRNA_Cluster_42ABstart logLR -98 -0.115620
# download all the data
fastq-dump -F --split-3 --gzip -A Hannon.ChIP.input.nos_white SRR646592
fastq-dump -F --split-3 --gzip -A Hannon.ChIP.H3K9me3.nos_white SRR646593
fastq-dump -F --split-3 --gzip -A Hannon.ChIP.input.nos_piwi SRR646594
fastq-dump -F --split-3 --gzip -A Hannon.ChIP.H3K9me3.nos_piwi SRR646595
# run single-sample pipeline for control sample
piPipes chip \
-l Hannon.ChIP.H3K9me3.nos_white_1.fastq.gz \
-r Hannon.ChIP.H3K9me3.nos_white_2.fastq.gz \
-L Hannon.ChIP.input.nos_white_1.fastq.gz \
-R Hannon.ChIP.input.nos_white_2.fastq.gz \
-g dm3 \
-c 24 \
-B \
-x 100 \
-o Hannon.ChIP.H3K9me3.nos_white.piPipes_out \
1> Hannon.ChIP.H3K9me3.nos_white.piPipes.stdout \
2> Hannon.ChIP.H3K9me3.nos_white.piPipes.stderr
# run single sample pipeline for experimental sample
piPipes chip \
-l Hannon.ChIP.H3K9me3.nos_piwi_1.fastq.gz \
-r Hannon.ChIP.H3K9me3.nos_piwi_2.fastq.gz \
-L Hannon.ChIP.input.nos_piwi_1.fastq.gz \
-R Hannon.ChIP.input.nos_piwi_2.fastq.gz \
-g dm3 \
-c 24 \
-B \
-x 100 \
-o Hannon.ChIP.H3K9me3.nos_piwi.piPipes_out \
1> Hannon.ChIP.H3K9me3.nos_piwi.piPipes.stdout \
2> Hannon.ChIP.H3K9me3.nos_piwi.piPipes.stderr
piPipes chip2 \
-a Hannon.ChIP.H3K9me3.nos_white.piPipes_out \
-b Hannon.ChIP.H3K9me3.nos_piwi.piPipes_out \
-g dm3 \
-c 24 \
-o Hannon.ChIP.H3K9me3.nos_white.vs.nos_piwi \
-A nos_white \
-B nos_piwi \
1> Hannon.ChIP.H3K9me3.nos_white.vs.nos_piwi.stdout \
2> Hannon.ChIP.H3K9me3.nos_white.vs.nos_piwi.stderr
# pdfs output
pdfs/
# peak calling output from sample 1, without in-library normalization
macs2_peaks_calling_no_normalization_nos_white/
# peak calling output from sample 2, without in-library normalization
macs2_peaks_calling_no_normalization_nos_piwi/
# differential peaks calling between sample 1 and sample 2
differential_peaks_calling/
# aggregate analysis on differentially expressed peaks identified
aggregate_output/
pdfs/
folder contains the TSS/TES/meta analysis on the newly identified regions that are
differentially enriched/depleted in mutant.
macs2_peaks_calling_no_normalization_nos_white/
contains peak calling of MACS2
on control sample
macs2_peaks_calling_no_normalization_nos_piwi/
contains peak calling of MACS2
on experimental sample
differential_peaks_calling/
contains outputs of MACS2
on differential peak calling
# log file
nos_.nos_white_vs_nos_piwi.bdgdiff.log
# common peaks enriched in IP over input for both samples, in BED format
nos_.nos_white_vs_nos_piwi_c3.0_common.bed
# peaks enriched in sample A, in BED format
nos_.nos_white_vs_nos_piwi_c3.0_cond1.bed
# peaks enriched in sample B, in BED format
nos_.nos_white_vs_nos_piwi_c3.0_cond2.bed
aggregate_output/
contains text outputs for TSS/TES/meta analysis on the newly identified regions that are
differentially enriched/depleted in mutant. The data has been used to draw the figures in pdfs/
folder.
Please see example figures from the Github Wiki page or our manuscript.
##Flowchart and example figures from our manuscript