This is a file containing instruction for manually implementing the first two steps of our pipeline (see pipeline.description.pdf. for descriptions of the steps). Note that scAPA.shell.script.R automatically performs all of the five steps of the pipline.
- Fasta and chromosome length files for human (hg19) or (and) mouse (mm10). Can be downloaded from UCSC website
- Samtools version 0.1.19-96b5f2294a or above.
- Bedtools version v2.27.1-17-gf59480f or above.
- Homer version v4.8.3 or above. Uses homer's "MakeTagdirectory" and "findPeaks".
- UMI_tools
- Drop-seq_tools-2.2.0
- R version 3.5.3, or above.
The files used for this example are from Lukassen et al. The original fastq files can be obtained from GSE104556. Please notice that the BAM files here have been downsampled, to contain 25% of the reads. The result will differ from the results obtained from analyzing the full BAM files. Downlad the folder down.sampled.spermatogenesis from this link.
Here we make use of scAPA package. Install and load:
install.packages("devtools")
require(devtools)
devtools::install_github("ElkonLab/scAPA/scAPA")
require(scAPA)
- PCR duplicates are removed using UMI-tools dedup. As UMI tools dedup requires that each line in the BAM file has a molecular barcode tag, Drop-seq tools is first used to filter the BAMs, leaving only reads for which cell ranger counts produced corrected molecular barcode tag.
for sample in down.sampled.SRR6129050 down.sampled.SRR6129051 ; do FilterBam TAG_RETAIN=UB I=${sample}.bam O=UB.${sample}.bam; done
- UMI tools requires indexed bams:
for sample in down.sampled.SRR6129050 down.sampled.SRR6129051 ; do samtools index UB.${sample}.bam; done
- Then UMI tools is ran with “method=unique” so that cellranger corrected molecular barcodes are used:
for sample in down.sampled.SRR6129050 down.sampled.SRR6129051 ;do umi_tools dedup -I UB.${sample}.bam -S dedup.${sample}.bam --method=unique --extract-umi-method=tag --umi-tag=UB --cell-tag=CB; done
- Homer is used to create a tag directory (Tagdirectory) from the PCR duplicate removed BAMs:
makeTagDirectory Tagdirectory dedup.down.sampled.SRR6129050.bam dedup.down.sampled.SRR6129051.bam
- Homer findPeaks is used to identify peaks. By default, findPeaks adjusts reads to the center of their fragment. To avoid this, fragLength is set to the average read length. To find peaks of variable width, Homer is set to find peaks of width 50nt and a minimum distance of 1 nt between peaks.
findPeaks Tagdirectory -size 50 -fragLength 100 -minDist 1 -strand separate -o Peakfile
- The function merge_peaks from scAPA package uses bedtools to merge peaks less than 100 nt apart:
scAPA::merge_peaks(bedtools.path = bedtools.path, peaks.file = "Peakfile", path = "./")
bedtools.path is the path to bedtools
- scAPA function intersect_peaks uses bedtools to intersect Peakfile with a 3' UTR bed file to create a bed file of the 3’ UTR peaks. The peaks are annotated according to their 3’ UTR and their position within it.
peaks.bed <- scAPA::intersect_peaks(org = "Mm",
bed.name = "./merge.peakfile.bed",
path = "", bedtools.path = bedtools.path,
introns = F)
scAPA::write.bed(.x = peaks.bed, f = "./peaks.bed")
Adjacent pA sites may result in a single peak. To detect and separate such peaks the R package mclust is used to fit a Gaussian finite mixture model with two components to the UMI counts distribution in the interval of each peak. The input to mclust, the UMI counts distribution, is prepared as follows:
- To detect reads from the union of all reads from all samples, the duplicate-removed BAM files are merged.
samtools merge merged.bam dedup.down.sampled.SRR6129050.bam dedup.down.sampled.SRR6129051.bam
- A BEDGRAPHS files are produced from this BAM, one for the plus strand and one for the minus strand, using bedtools genomcove. The following two commands produce one BEDGRAPHS containg both strands and convertes it into a BED format:
bedtools genomecov -ibam merged.bam -bg -strand + | awk 'BEGIN {OFS = " "}{print $1, $2, $3, $4, ".", "+"}' > merged.wig
bedtools genomecov -ibam merged.bam -bg -strand - | awk 'BEGIN {OFS = " "}{print $1, $2, $3, $4, ".", "-"}' >> merged.wig
- Bedtools intersect is used to intersect them with the peaks' BED file.
bedtools intersect -s -wb -b peaks.bed -a merged.wig > intersected.wig
- The intersected file is read in R and converted to a list such that each element of the list, corresponding to a specific peak, is a numeric vector whose values represent read coverage observed across the peak. Mclust is used to fit an equal variance Gaussian finite mixture model with two components to (G=2, modelNames="E") to each list element. If the predicted means of the two fitted Gaussian components are separated by more than three standard deviations and at least 75 nt, the peak is split into two, according to mclust classification. The peak's bed is edited accordingly (Correct peak index). All this is done using the function creat_mclus, from scAPA package.
peaks.wig <- read.delim(file = "intersected.wig", header = F)
peaks.wig <- split(x = peaks.wig, f = peaks.wig$V10, drop = T)
bed <- plyr::rbind.fill(lapply(1:length(peaks.wig),
FUN = scAPA::creat_mclus))
This step uses featureCounts to count the reads that overlap each peak in each cluster ("cell type"). A separate BAM file for each cell cluster is generated. First, Drop-seq tools is used to split the reads in each sample BAM into separate BAMs that correspond to the different clusters. This is done using FilterBAMByTag together with text files, where each file contains the cell barcodes that belong to a certain cell cluster, from a certain sample.
- The following is used to write the text files.
annotation.files <- c("clusters_down.sampled.SRR6129050.txt",
"clusters_down.sampled.SRR6129051.txt")
for (j in annotation.files)) {
annotations <- read.delim(file = j, header = F)
list.cells[, 1] <- as.character(annotations[, 1])
list.cells <- split(x = list.cells, f = list.cells[, 2], drop = T)
for (i in 1:length(list.cells)) {
cluster <- names(list.cells)[i]
cluster.cells <- as.data.frame(list.cells[[i]][, 1])
tagValuefile <- paste0(cluster, "_", sample)
write.bed(.x = cluster.cells, f = tagValuefile)
scAPA::write.bed(.x = cluster.cells, f = tagValuefile)
}
The object annotation.files contains the paths to the cell cluster annotation files.
- Now use FilterBamByTag:
for sample in SRR6129050 SRR6129051; do for cluster in ES RS SC; do FilterBamByTag TAG=CB I=dedup.${sample}.bam tagValuefile=tag_value_file.down.sampled.${sample}_${cluster}.txt O=down.sampled.${sample}_${cluster}.bam; done; done
- Next, bams of the same cluster are merged:
for cluster in ES RS SC; do samtools merge ${cluster}.bam down.sampled.SRR6129050_${cluster}.bam down.sampled.SRR6129051_${cluster}.bam; done
We end up with 3 BAM files: ES.bam, RS.bam, and SC.bam.
Rsubread package featureCounts function is used.
- The annotation file is a SAF data.frame edited from the peaks bed file.
utr.saf <- bed[, c(4, 1, 2, 3, 6)]
colnames(utr.saf) <- c("GeneID", "Chr", "Start", "End", "Strand")
largestOverlap = True is specified so that reads spanning two peaks are counted according to their largest overlap.
counts <- Rsubread::featureCounts(files = c("ES.bam", "RS.bam", "CS.bam"), isGTFAnnotationFile = F,
strandSpecific = 1, annot.ext = utr.saf,
largestOverlap = T)
- The object counts is edited for downstream analysis:
co <- cbind.data.frame(rownames(counts$counts), counts$counts)
colnames(co) <- c("Peak_ID", "ES", "RS", "SC")
- We also produce an object storing the 200 nt downstream sequences for each peak, using scAPA function read_down.seq.
aseq <- scAPA::read_down.seq(saf = utr.saf,
char.length.path = char.length.path,
fasta.path = fasta.path, chr.modify = T)
aseq <- aseq[,c(4, 6)]
Here char.length.path is the path to the chromosomes length file, and fasta.path is the path to the fasta file. We also produce an object with information regarding the genomic location of the peaks:
row.Data <- counts$annotation[,c(2,3,4,1,6,5)]
- Finnaly, we use set_scAPAList from scAPA package to creat a scAPAList object for downstream analysis:
a <- scAPA::set_scAPAList(.clus.counts = co, .row.Data = row.Data,
.down.seq = aseq)
There is no need to match the order of the peaks in the three object as set_scAPAList does so automatically. For the next steps of the analysis, see scAPA package vignette.