This is a performance refactor of the original HMMRATAC by Evan D. Tarbell and Tao Liu at https://github.com/LiuLabUB/HMMRATAC This refactor does NOT affect output accuracy and output should be the same as with the original program. Model training speed has been increased by a factor 70 and memory usage issues have been fixed. New features have been added and the original features have been fixed where needed. This includes support for .csi files, processing large files, more output options and non-binary model files.
Accurate Data Wanted To increase program accuracy, files with fully known peaks are highly appreciated to use for further testing and accuracy increases.
Quick Start
Assume that you have a BAM file from aligner such as bwa mem
named ATACseq.bam
.
-
Sort the BAM file to get a
ATACseq.sorted.bam
file:samtools sort ATACseq.bam -o ATACseq.sorted.bam
-
Make index from the BAM file to get a
ATACseq.sorted.bam.bai
file:samtools index ATACseq.sorted.bam ATACseq.sorted.bam.bai
-
Make genome information (chromosome sizes) from the BAM file to get a
genome.info
file:samtools view -H ATACseq.sorted.bam| perl -ne 'if(/^@SQ.*?SN:(\w+)\s+LN:(\d+)/){print $1,"\t",$2,"\n"}' > genome.info
-
Run HMMRATAC on the sorted BAM
ATACseq.sorted.bam
, the BAM index fileATACseq.sorted.bam.bai
, and the genome information filegenome.info
:java -jar HMMRATAC_V2.0.0.jar -b ATACseq.sorted.bam -i ATACseq.sorted.bam.bai -g genome.info
-
Filter HMMRATAC output by the score, if desired. Score threshold will depend on dataset, score type and user preference. A threshold of 10 would be:
awk -v OFS="\t" '$13>=10 {print}' NAME_peaks.gappedPeak > NAME.filteredPeaks.gappedPeak
To filter the summit file by the same threshold:
awk -v OFS="\t" '$5>=10 {print}' NAME_summits.bed > NAME.filteredSummits.bed
NOTE: HMMRATAC will report all peaks that match the structure defined by the model, including weak peaks. Filtering by score can be used to retain stronger peaks. Lower score = higher sensitivity and lower precision, Higher score = lower sensitivity and higher precision.
Samtools can be downloaded here: http://www.htslib.org/download/
Be sure to run HMMRATAC using the executable file, found here: https://github.com/Mikxox/HMMRATAC/releases For details on HOW to run HMMRATAC, see HMMRATAC_Guide.md, which contains a thorough runthrough of all parameters, output files and input requirements and troubleshooting.
HMMRATAC requires paired-end data. Single-end data will not work. HMMRATAC is designed to process ATAC-seq data that hasn't undergone
any size selection, either physical or in silico. This should be standard practice for any ATAC-seq analysis. Size selected data can be
processed by HMMRATAC (see HMMRATAC_Guide.md on --trim
option).
If you use HMMRATAC in your research, please cite the following paper:
- Evan D. Tarbell and Tao Liu, HMMRATAC: a Hidden Markov ModeleR for ATAC-seq, Nucleic Acids Res. 2019 Jun 14. doi: 10.1093/nar/gkz533
- Adding an extra reference to this performance refactor is appreciated but not required.