Skip to content

Commit

Permalink
update filterdup, hmmratac, pileup md
Browse files Browse the repository at this point in the history
  • Loading branch information
taoliu committed Nov 7, 2023
1 parent 0b79f06 commit 9c42a26
Show file tree
Hide file tree
Showing 4 changed files with 91 additions and 27 deletions.
4 changes: 2 additions & 2 deletions bin/macs3
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/usr/bin/env python
# Time-stamp: <2023-11-03 11:55:39 Tao Liu>
# Time-stamp: <2023-11-07 14:37:18 Tao Liu>

"""Description: MACS v3 main executable.
Expand Down Expand Up @@ -652,7 +652,7 @@ def add_pileup_parser( subparsers ):
argparser_pileup = subparsers.add_parser( "pileup",
help = "Pileup aligned reads with a given extension size (fragment size or d in MACS language). Note there will be no step for duplicate reads filtering or sequencing depth scaling, so you may need to do certain pre/post-processing." )
argparser_pileup.add_argument( "-i", "--ifile", dest = "ifile", type = str, required = True, nargs = "+",
help = "Alignment file. If multiple files are given as '-t A B C', then they will all be read and combined. Note that pair-end data is not supposed to work with this command. REQUIRED." )
help = "Alignment file. If multiple files are given as '-t A B C', then they will all be read and combined. REQUIRED." )
argparser_pileup.add_argument( "-o", "--ofile", dest = "outputfile", type = str, required = True,
help = "Output bedGraph file name. If not specified, will write to standard output. REQUIRED." )
add_outdir_option( argparser_pileup )
Expand Down
7 changes: 7 additions & 0 deletions docs/filterdup.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,3 +84,10 @@ genome size is set to `hs` (Homo Sapiens), the format of the input
file is determined automatically, and the program keeps only one
duplicate.

Here is an example to convert BAMPE file into BEDPE. Please note that
`-f BAMPE` and `--keep-dup all` are both necessary for format
conversion:

```bash
macs3 filterdup -i input.bam -o output.bedpe -f BAMPE --keep-dup all
```
35 changes: 28 additions & 7 deletions docs/hmmratac.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# hmmratac

## Description

HMMRATAC (`macs3 hmmratac`) is a dedicated peak calling algorithm
based on Hidden Markov Model for ATAC-seq data. The basic idea behind
HMMRATAC is to digest ATAC-seq data according to the fragment length
Expand Down Expand Up @@ -33,22 +35,41 @@ then they will all be read and pooled together. REQUIRED.

### `--outdir OUTDIR`

If specified all output files will be written to that directory. Default: the current working directory
If specified all output files will be written to that
directory. Default: the current working directory

### `-n NAME`/ `--name NAME`
Name for this experiment, which will be used as a prefix to generate output file names. DEFAULT: "NA"
Name for this experiment, which will be used as a prefix to generate
output file names. DEFAULT: "NA"

### `--modelonly`
This option will only generate the HMM model as a JSON file and quit. This model can then be applied using the `--model` option. Default: False
This option will only generate the HMM model as a JSON file and
quit. This model can then be applied using the `--model`
option. Default: False

### `--model`
If provided, HMM training will be skipped and a JSON file generated from a previous HMMRATAC run will be used instead of creating new one. Default: NA
If provided, HMM training will be skipped and a JSON file generated
from a previous HMMRATAC run will be used instead of creating new
one. Default: NA

### `-t HMM_TRAINING_REGIONS` / `--training HMM_TRAINING_REGIONS`
Customized training regions can be provided through this option. `-t` takes the filename of training regions (previously was BED_file) to use for training HMM, instead of using foldchange settings to select. Default: NA
Customized training regions can be provided through this option. `-t`
takes the filename of training regions (previously was BED_file) to
use for training HMM, instead of using foldchange settings to
select. Default: NA

### `--min-frag-p MIN_FRAG_P`
We will exclude the abnormal fragments that can't be assigned to any of the four signal tracks. After we use EM to find the means and stddevs of the four distributions, we will calculate the likelihood that a given fragment length fit any of the four using normal distribution. The criteria we will use is that if a fragment length has less than MIN_FRAG_P probability to be like either of short, mono, di, or tri-nuc fragment, we will exclude it while generating the four signal tracks for later HMM training and prediction. The value should be between 0 and 1. Larger the value, more abnormal fragments will be allowed. So if you want to include more 'ideal' fragments, make this value smaller. Default = 0.001
We will exclude the abnormal fragments that can't be assigned to any
of the four signal tracks. After we use EM to find the means and
stddevs of the four distributions, we will calculate the likelihood
that a given fragment length fit any of the four using normal
distribution. The criteria we will use is that if a fragment length
has less than MIN_FRAG_P probability to be like either of short,
mono, di, or tri-nuc fragment, we will exclude it while generating
the four signal tracks for later HMM training and prediction. The
value should be between 0 and 1. Larger the value, more abnormal
fragments will be allowed. So if you want to include more 'ideal'
fragments, make this value smaller. Default = 0.001

### `--cutoff-analysis-only`

Expand All @@ -57,7 +78,7 @@ Name for this experiment, which will be used as a prefix to generate output file
the three crucial parameters for `-l`, `-u`, and `-c`. So it's highly
recommanded to run this first! Please read the report and
instructions in `Choices of cutoff values` on how to decide the three
crucial parameters
crucial parameters.

### `-u HMM_UPPER` / `--upper HMM_UPPER`

Expand Down
72 changes: 54 additions & 18 deletions docs/pileup.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,73 @@
# Pileup
# pileup

## Overview
The `pileup` command is part of the MACS3 suite of tools and is used to pile up alignment files. It is particularly useful in ChIP-Seq analysis where summarizing the read depth at each genomic location is required.
The `pileup` command is part of the MACS3 suite of tools and is used
to pile up alignment files. It is a fast algorithm to generate
coverage track from alignment file -- either single-end or paired-end
data.

## Detailed Description

The `pileup` command takes in one or multiple input files and produces an output file with the piled-up alignments. It uses an efficient algorithm to pile up the alignments, improving the quality of your data for further analysis.
The `pileup` command takes in one or multiple input files and produces
an output file with the piled-up genomic coverage. It uses an
efficient algorithm to pile up the alignments.

Pileup aligned reads with a given extension size (fragment size or d in MACS language). Note there will be no step for duplicate reads filtering or sequencing depth scaling, so you may need to do certain pre/post-processing.
![Pileup Algorithm](./pileup.jpeg)

Pileup aligned reads with a given extension size (fragment size or d
in MACS language). Note there will be no step for duplicate reads
filtering or sequencing depth scaling, so you may need to do certain
pre/post-processing.

## Command Line Options

The command line options for `pileup` are defined in `/MACS3/Commands/pileup_cmd.py` and `/bin/macs3` files. Here is a brief overview of these options:
Here is a brief overview of the command line options for `pileup`:

- `-i` or `--ifile`: Alignment file. If multiple files are given as '-t A B C', then they will all be read and combined. Note that pair-end data is not supposed to work with this command. REQUIRED.
- `-o` or `--ofile`: Output bedGraph file name. If not specified, will write to standard output. REQUIRED.
- `--outdir`: If specified, all output files will be written to that directory. Default: the current working directory
- `-i` or `--ifile`: Alignment file. If multiple files are given as
'-t A B C', then they will all be read and combined. REQUIRED.
- `-o` or `--ofile`: Output bedGraph file name. If not specified, will
write to standard output. REQUIRED.
- `--outdir`: If specified, all output files will be written to that
directory. Default: the current working directory
- `-f ` or `--format`: Format of the tag file.
- `AUTO`: MACS3 will pick a format from "AUTO", "BED", "ELAND", "ELANDMULTI", "ELANDEXPORT", "SAM", "BAM", and "BOWTIE". If the format is BAMPE or BEDPE, please specify it explicitly.
- `BAMPE` or `BEDPE`: When the format is BAMPE or BEDPE, the -B and --extsize options would be ignored.
- Other options correspond to specific formats.
- `-B` or `--both-direction`: By default, any read will be extended towards the downstream direction by the extension size. If this option is set, aligned reads will be extended in both upstream and downstream directions by the extension size. This option will be ignored when the format is set as BAMPE or BEDPE. DEFAULT: False
- `--extsize`: The extension size in bps. Each alignment read will become an EXTSIZE of the fragment, then be piled up. Check description for -B for details. This option will be ignored when the format is set as BAMPE or BEDPE. DEFAULT: 200
- `--buffer-size`: Buffer size for incrementally increasing the internal array size to store read alignment information. In most cases, you don't have to change this parameter. However, if there are a large number of chromosomes/contigs/scaffolds in your alignment, it's recommended to specify a smaller buffer size in order to decrease memory usage (but it will take longer time to read alignment files). Minimum memory requested for reading an alignment file is about # of CHROMOSOME * BUFFER_SIZE * 8 Bytes. DEFAULT: 100000
- `--verbose`: Set verbose level. 0: only show critical messages, 1: show additional warning messages, 2: show process information, 3: show debug messages. If you want to know where are the duplicate reads, use 3. DEFAULT: 2

- `AUTO`: MACS3 will pick a format from "AUTO", "BED", "ELAND",
"ELANDMULTI", "ELANDEXPORT", "SAM", "BAM", and "BOWTIE". If the
format is BAMPE or BEDPE, please specify it explicitly.
- `BAMPE` or `BEDPE`: When the format is BAMPE or BEDPE, the -B and
--extsize options would be ignored.
- Other options correspond to specific formats.
- `-B` or `--both-direction`: By default, any read will be extended
towards the downstream direction by the extension size. If this
option is set, aligned reads will be extended in both upstream and
downstream directions by the extension size. This option will be
ignored when the format is set as BAMPE or BEDPE. DEFAULT: False
- `--extsize`: The extension size in bps. Each alignment read will
become an EXTSIZE of the fragment, then be piled up. Check
description for -B for details. This option will be ignored when the
format is set as BAMPE or BEDPE. DEFAULT: 200
- `--buffer-size`: Buffer size for incrementally increasing the
internal array size to store read alignment information. In most
cases, you don't have to change this parameter. However, if there
are a large number of chromosomes/contigs/scaffolds in your
alignment, it's recommended to specify a smaller buffer size in
order to decrease memory usage (but it will take longer time to read
alignment files). Minimum memory requested for reading an alignment
file is about # of CHROMOSOME * BUFFER_SIZE * 8 Bytes. DEFAULT:
100000
- `--verbose`: Set verbose level. 0: only show critical messages, 1:
show additional warning messages, 2: show process information, 3:
show debug messages. If you want to know where are the duplicate
reads, use 3. DEFAULT: 2

## Example Usage

Here is an example of how to use the `pileup` command:

```bash
macs3 pileup -i treatment.bam -o piledup.bedGraph -f BAM -g hs -n experiment1
macs3 pileup -i treatment.bam -o piledup.bedGraph -f BAM --extsize 147
```

In this example, the program will pile up the alignments in the `treatment.bam` file and write the result to `piledup.bedGraph`. The input file is in BAM format, the genome size is set to 'hs' (human), and the name of the experiment is 'experiment1'.
In this example, the program will pile up the alignments in the
`treatment.bam` file and write the result to `piledup.bedGraph`. The
input file is in BAM format, and we extend each sequencing tag into a
147bps fragment for pileup.

0 comments on commit 9c42a26

Please sign in to comment.