Releases: jsh58/Genrich
Version 0.6.1
Version 0.6
- New default peak-calling parameters:
-p 0.01 -a 200.0
. This also means that, by default, q-values are not calculated. - In ATAC-seq mode, the ends of fragments are adjusted by +5/-5 to account for Tn5 transposase occupancy, when determining the centers of cut site intervals. This adjustment can be avoided with
-D
. - The number of warning messages of reads extending off the ends of chromosomes has been limited to 128 (per input file).
- A convenience option (
-S
) was added, which prevents Genrich from checking the sort order of the input SAM/BAM file. The input file is still parsed under the assumption that it is sorted by queryname, so this option should be used judiciously.
Version 0.5
-
The null statistical model was changed to the log-normal distribution. This was determined to be a more appropriate choice than other distributions (including the previously used exponential) from examining ChIP-seq control samples.
-
The minimum length (
-l
) is now a regular peak-calling parameter, not an alternative peak-calling method. The default value is 0bp. Those who prefer peak-calling by minimum length and not AUC can run-a 0 -l <int>
. -
Two convenience options were added:
-X
: option to skip peak-calling, but perform all alignment parsing (including identifying PCR duplicates)-P
: option to call peaks directly from a bedgraph-ish log file (-f
) produced by a previous Genrich run
Version 0.4
-
Option to remove PCR duplicates (
-r
). The procedure is explained here. Incorporates ideas fromfindDups9.py
, in particular the process of evaluating multimapping reads/fragments.- A log file listing duplicates can be produced via
-R <file>
.
- A log file listing duplicates can be produced via
-
A reasonably complete README was added to the homepage.
-
Input SAM/BAM files are now required to be queryname-sorted (
samtools sort -n
). -
A bug in alignment parsing (paired secondary alignments with matching positions) was fixed.
Version 0.3
-
New default peak-calling method: area under the curve (AUC). For a peak to be called, the total significance of the region must exceed a minimum value (
-a <float>
, default 20.0).- The total significance is calculated as the sum of the -log(q) values above the
-q
threshold over the length of the region (i.e. the area under the -log(q) "curve"). If a-p
threshold is specified, the area under the -log(p) curve is calculated. - The maximum gap parameter (
-g <int>
) still allows multiple regions to be linked. - No minimum length is required for a peak to be called.
- Can be overridden by specifying
-l <int>
, in which case peak-calling reverts to the previous method, with the given minimum length for peaks.
- The total significance is calculated as the sum of the -log(q) values above the
-
Option to provide a BED file of genomic regions to exclude from analysis (
-E <file>
).- The regions will affect peak calls, such that no peak may extend into or around an excluded region.
- The regions' lengths will be subtracted from the genome length calculated by the program.
- In the output log files, excluded regions will have treatment/control pileup values of
0.0
and p-/q-values ofNA
. - Multiple BED files can be specified, comma-separated (or space-separated, in quotes).
-
Accessory script
findNs.py
to produce a BED file of 'N' homopolymers from a fasta file (e.g. a reference genome). The output can (and should) be given to Genrich via-E
(above). -
Option to keep unpaired alignments, with lengths changed to a given value, has been changed to
-w <int>
(formerly-a <int>
).
Version 0.2
-
Multiple replicates are now analyzed separately, with p-values calculated for each. At each position, the multiple p-values are then combined by Fisher's method, before conversion to q-values.
-
Three optional output log files:
-f <file>
: with one replicate, it lists treatment/control pileups, p- and q-values, and significance for each interval; with multiple replicates, it lists p-values of each replicate, combined p-value, q-value, and significance for each interval.-k <file>
: for each replicate, sequentially, it lists a header line (# treatment file: <name>; control file: <name>
), followed by treatment/control pileups and a p-value for each interval. This is the way to examine pileup values with multiple replicates, since the-f
file will not supply them.-b <file>
: an unsorted BED file of the reads/fragments analyzed. The 4th column gives the read name, number of alignments, 'T'reatment or 'C'ontrol, and sample number (0-based), e.g.SRR5427885.57_2_T_0
.
-
Option to analyze reads in ATAC-seq mode (
-j
). Instead of analyzing full fragments, the program uses "cut site intervals" centered on the ends of fragments. The interval lengths are determined by the-d
parameter (def. 100bp). -
Control sample pileups are now scaled to match the treatment, based on fragment/interval lengths.
-
Fixed bug related to sorting alignments by scores.
-
Fixed bugs related to reference sequences (chromosomes) missing from a SAM/BAM file. Reads aligning to chromosomes either on the "skipped" list (
-e
argument) or not appearing in the header of the treatment SAM/BAM file will not be analyzed.
Version 0.1
- Analyzes secondary alignments:
- Keeps alignments whose alignment scores (
AS
) are within a specified value (-s <float>
) of the best alignment. The default value of-s 0
means that only alignments judged as equivalent by the aligner will be kept. - Each of the
n
alignments for a read/fragment is counted as1/n
for the pileup. - To avoid excessive memory usage and the imprecision inherent in floating-point values, a maximum of 10 alignments per read is analyzed. Reads with more than 10 alignments will be subsampled based on the best alignment scores; in the case of ties, alignments appearing first in the SAM/BAM are favored.
- Keeps alignments whose alignment scores (
- Input SAM/BAM files are required to be name sorted (
samtools sort -n
) - A simple hashtable is implemented for efficiently compiling p-values to calculate q-values. The q-value calculation is still based on the Benjamini-Hochberg procedure.
Version 0.0
Peak-calling for genomic enrichment assays of treatment sample(s) (-t [<file>]+
), with or without control sample(s) (-c [<file>]+
). Input files must be in SAM/BAM format. Incorporates ideas from SAMtoBED
, removeChrom
, and MACS2 v2.1.2_dev
.
- Control of analysis of alignments:
- Properly paired alignments only (default); fragments are inferred appropriately
- Also keeping unpaired alignments:
- As they appear in the SAM/BAM file (
-y
) - Length increased to specified value (
-a <int>
) - Length increased to average value of paired alignments (
-x
)
- As they appear in the SAM/BAM file (
- Filtering options:
- Chromosomes (reference sequences) to ignore (
-e <arg>
) - Minimum mapping quality (
-m <int>
)
- Chromosomes (reference sequences) to ignore (
- Peak-calling options:
- Maximum q-value (
-q <float>
, def. 0.05) - Maximum p-value (
-p <float>
) - Minimum length of a peak (
-l <int>
, def. 100bp) - Maximum distance between significant sites (
-g <int>
, def. 100bp)
- Maximum q-value (
- Output options:
- Output peak file, in ENCODE narrowPeak format (
-o <file
) - Output bedgraph file listing treatment/control pileups and p/q values (
-b <file>
)
- Output peak file, in ENCODE narrowPeak format (