Adding docs for 2.3.0

fulcrumgenomics · Jul 16, 2024 · b268df0 · b268df0
1 parent 7190b7e
commit b268df0
Show file tree

Hide file tree

Showing 57 changed files with 3,251 additions and 4 deletions.
diff --git a/index.md b/index.md
@@ -11,6 +11,8 @@
 
 fgbio is a command line toolkit for working with genomic and particularly next generation sequencing data.
 
+See the [latest available tools here](tools/latest).
+
 ## Quick Installation
 
 The [conda](https://conda.io/) package manager (configured with [bioconda channels](https://bioconda.github.io/)) can be used to quickly install fgbio:
@@ -39,8 +41,8 @@ If the reported version on the first line starts with `1.8` or higher, you are a
 
 Once you have Java installed and a release downloaded you can run:
 
-* Run `java -jar fgbio-2.2.1.jar` to get a list of available tools
-* Run `java -jar fgbio-2.2.1.jar <Tool Name>` to see detailed usage instructions on any tool
+* Run `java -jar fgbio-2.3.0.jar` to get a list of available tools
+* Run `java -jar fgbio-2.3.0.jar <Tool Name>` to see detailed usage instructions on any tool
 
 When running tools we recommend the following set of Java options as a starting point though individual tools may need more or less memory depending on the input data:
 

diff --git a/metrics/2.3.0/metrics.md b/metrics/2.3.0/metrics.md
diff --git a/metrics/latest b/metrics/latest
@@ -1 +1 @@
-2.2.1
+2.3.0
diff --git a/tools/2.3.0/AnnotateBamWithUmis.md b/tools/2.3.0/AnnotateBamWithUmis.md
@@ -0,0 +1,53 @@
+---
+title: AnnotateBamWithUmis
+---
+
+# AnnotateBamWithUmis
+
+## Overview
+**Group:** SAM/BAM
+
+Annotates existing BAM files with UMIs (Unique Molecular Indices, aka Molecular IDs,
+Molecular barcodes) from separate FASTQ files. Takes an existing BAM file and either
+one FASTQ file with UMI reads or multiple FASTQs if there are multiple UMIs per template,
+matches the reads between the files based on read names, and produces an output BAM file
+where each record is annotated with an optional tag (specified by `attribute`) that
+contains the read sequence of the UMI.  Trailing read numbers (`/1` or `/2`) are
+removed from FASTQ read names, as is any text after whitespace, before matching.
+If multiple UMI segments are specified (see `--read-structure`) across one or more FASTQs,
+they are delimited in the same order as FASTQs are specified on the command line.
+The delimiter is controlled by the `--delimiter` option.
+
+The `--read-structure` option may be used to specify which bases in the FASTQ contain UMI
+bases.  Otherwise it is assumed the FASTQ contains only UMI bases.
+
+The `--sorted` option may be used to indicate that the FASTQ has the same reads and is
+sorted in the same order as the BAM file.
+
+At the end of execution, reports how many records were processed and how many were
+missing UMIs. If any read from the BAM file did not have a matching UMI read in the
+FASTQ file, the program will exit with a non-zero exit status.  The `--fail-fast` option
+may be specified to cause the program to terminate the first time it finds a records
+without a matching UMI.
+
+In order to avoid sorting the input files, the entire UMI fastq file(s) is read into
+memory. As a result the program needs to be run with memory proportional the size of
+the (uncompressed) fastq(s).  Use the `--sorted` option to traverse the UMI fastq and BAM
+files assuming they are in the same order.  More precisely, the UMI fastq file will be
+traversed first, reading in the next set of BAM reads with same read name as the
+UMI's read name.  Those BAM reads will be annotated.  If no BAM reads exist for the UMI,
+no logging or error will be reported.
+
+## Arguments
+
+|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)|
+|----|----|----|-----------|---------|---------------|----------------|
+|input|i|PathToBam|The input SAM or BAM file.|Required|1||
+|fastq|f|PathToFastq|Input FASTQ(s) with UMI reads.|Required|Unlimited||
+|output|o|PathToBam|Output BAM file to write.|Required|1||
+|attribute|t|String|The BAM attribute to store UMI bases in.|Optional|1|RX|
+|qual-attribute|q|String|The BAM attribute to store UMI qualities in.|Optional|1||
+|read-structure|r|ReadStructure|The read structure for the FASTQ, otherwise all bases will be used.|Required|Unlimited|+M|
+|sorted|s|Boolean|Whether the FASTQ file is sorted in the same order as the BAM.|Optional|1|false|
+|fail-fast||Boolean|If set, fail on the first missing UMI.|Optional|1|false|
+
diff --git a/tools/2.3.0/AssessPhasing.md b/tools/2.3.0/AssessPhasing.md
@@ -0,0 +1,40 @@
+---
+title: AssessPhasing
+---
+
+# AssessPhasing
+
+## Overview
+**Group:** VCF/BCF
+
+Assess the accuracy of phasing for a set of variants.
+
+All phased genotypes should be annotated with the `PS` (phase set) `FORMAT` tag, which by convention is the
+position of the first variant in the phase set (see the VCF specification).  Furthermore, the alleles of a phased
+genotype should use the `|` separator instead of the `/` separator, where the latter indicates the genotype is
+unphased.
+
+The input VCFs are assumed to be single sample: the genotype from the first sample is used.
+
+Only bi-allelic heterozygous SNPs are considered.
+
+The input known phased variants can be subsetted using the known interval list, for example to keep only variants
+from high-confidence regions.
+
+If the intervals argument is supplied, only the set of chromosomes specified will be analyzed.  Note that the full
+chromosome will be analyzed and start/stop positions will be ignored.
+
+## Arguments
+
+|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)|
+|----|----|----|-----------|---------|---------------|----------------|
+|called-vcf|c|PathToVcf|The VCF with called phased variants.|Required|1||
+|truth-vcf|t|PathToVcf|The VCF with known phased variants.|Required|1||
+|output|o|PathPrefix|The output prefix for all output files.|Required|1||
+|known-intervals|k|PathToIntervals|The interval list over which known phased variants should be kept.|Optional|1||
+|allow-missing-fields-in-vcf-header|m|Boolean|Allow missing fields in the VCF header.|Optional|1|true|
+|skip-mismatching-alleles|s|Boolean|Skip sites where the truth and call are both called but do not share the same alleles.|Optional|1|true|
+|intervals|l|PathToIntervals|Analyze only the given chromosomes in the interval list.  The entire chromosome will be analyzed (start and end ignored).|Optional|1||
+|modify-blocks|b|Boolean|Remove enclosed phased blocks and truncate overlapping blocks.|Optional|1|true|
+|debug-vcf|d|Boolean|Output a VCF with the called variants annotated by if their phase matches the truth|Optional|1|false|
+
diff --git a/tools/2.3.0/AssignPrimers.md b/tools/2.3.0/AssignPrimers.md
@@ -0,0 +1,71 @@
+---
+title: AssignPrimers
+---
+
+# AssignPrimers
+
+## Overview
+**Group:** SAM/BAM
+
+Assigns reads to primers post-alignment. Takes in a BAM file of aligned reads and a tab-delimited file with five columns
+(`chrom`, `left_start`, `left_end`, `right_start`, and `right_end`) which provide the 1-based inclusive start and
+end positions of the primers for each amplicon.  The primer file must include headers, e.g:
+
+```
+chrom  left_start  left_end  right_start right_end
+chr1   1010873     1010894   1011118     1011137
+```
+
+Optionally, a sixth column column `id` may be given with a unique name for the amplicon.  If not given, the
+coordinates of the amplicon's primers will be used:
+  `<chrom>:<left_start>-<left_end>,<chrom>:<right_start>:<right_end>`
+
+Each read is assigned independently of its mate (for paired end reads). The primer for a read is assumed to be
+located at the start of the read in 5' sequencing order.  Therefore, a positive strand
+read will use its aligned start position to match against the amplicon's left-most coordinate, while a negative
+strand read will use its aligned end position to match against the amplicon's right-most coordinate.
+
+For paired end reads, the assignment for mate will also be stored in the current read, using the same procedure as
+above but using the mate's coordinates.  This requires the input BAM have the mate-cigar ("MC") SAM tag.  Read
+pairs must have both ends mapped in forward/reverse configuration to have an assignment.  Furthermore, the amplicon
+assignment may be different for a read and its mate.  This may occur, for example, if tiling nearby amplicons and
+a large deletion occurs over a given primer and therefore "skipping" an amplicon.  This may also occur if there are
+translocations across amplicons.
+
+The output will have the following tags added:
+- ap: the assigned primer coordinates (ex. `chr1:1010873-1010894`)
+- am: the mate's assigned primer coordinates (ex. `chr1:1011118-1011137`)
+- ip: the assigned amplicon id
+- im: the mate's assigned amplicon id (or `=` if the same as the assigned amplicon)
+
+The read sequence of the primer is not checked against the expected reference sequence at the primer's genomic
+coordinates.
+
+In some cases, large deletions within one end of a read pair may cause a primary and supplementary alignments to be
+produced by the aligner, with the supplementary alignment containing the primer end of the read (5' sequencing order).
+In this case, the primer may not be assigned for this end of the read pair.  Therefore, it is recommended to prefer
+or choose the primary alignment that has the closest aligned read base to the 5' end of the read in sequencing order.
+For example, from `bwa` version `0.7.16` onwards, the `-5` option may be used.  Consider also using the `-q` option 
+for `bwa` `0.7.16` as well, which is standard in `0.7.17` onwards when the `-5` option is used.
+
+The `--annotate-all` option may be used to annotate all alignments for a given read end (eg. R1) with
+the same assignment.  If the assignment differs across alignments for the same read end, no assignment is given.
+Furthermore, if the input BAM is neither `queryname` sorted nor `query` grouped, it will be sorted into queryname
+order to assign all alignments cross a template simultaneously.  The output is written in coordinate order.
+
+## Arguments
+
+|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)|
+|----|----|----|-----------|---------|---------------|----------------|
+|input|i|PathToBam|Input BAM file.|Required|1||
+|output|o|PathToBam|Output BAM file.|Required|1||
+|metrics|m|FilePath|Output metrics file.|Required|1||
+|primers|p|FilePath|File with primer locations.|Required|1||
+|slop|S|Int|Match to primer locations +/- this many bases.|Optional|1|5|
+|unclipped-coordinates|U|Boolean|True to based on the unclipped coordinates (adjust based on hard/soft clipping), otherwise the aligned bases|Optional|1|true|
+|primer-coordinates-tag||String|The SAM tag for the assigned primer coordinate.|Optional|1|rp|
+|mate-primer-coordinates-tag||String|The SAM tag for the mate's assigned primer coordinate.|Optional|1|mp|
+|amplicon-identifier-tag||String|The SAM tag for the assigned amplicon identifier.|Optional|1|ra|
+|mate-amplicon-identifier-tag||String|The SAM tag for the mate's assigned amplicon identifier.|Optional|1|ma|
+|annotate-all||Boolean|Annotate all R1 (or R2) with same value.|Optional|1|false|
+
diff --git a/tools/2.3.0/AutoGenerateReadGroupsByName.md b/tools/2.3.0/AutoGenerateReadGroupsByName.md
@@ -0,0 +1,43 @@
+---
+title: AutoGenerateReadGroupsByName
+---
+
+# AutoGenerateReadGroupsByName
+
+## Overview
+**Group:** SAM/BAM
+
+Adds read groups to a BAM file for a single sample by parsing the read names.
+
+Will add one or more read groups by parsing the read names.  The read names should be of the form:
+
+```
+<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<xpos>:<y-pos>
+```
+
+Each unique combination of `<instrument>:<run number>:<flowcell ID>:<lane>` will be its own read group. The ID of the
+read group will be an integer and the platform unit will be `<flowcell-id>.<lane>`.
+
+The input is assumed to contain reads for one sample and library.  Therefore, the sample and library must be given
+and will be applied to all read groups.  Read groups will be replaced if present.
+
+Two passes will be performed on the input: first to gather all the read groups, and second to write the output BAM
+file.
+
+## Arguments
+
+|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)|
+|----|----|----|-----------|---------|---------------|----------------|
+|input|i|PathToBam|Input SAM or BAM file|Required|1||
+|output|o|PathToBam|Output SAM or BAM file|Required|1||
+|sample|s|String|The sample to insert into the read groups|Required|1||
+|library|l|String|The library to insert into the read groups|Required|1||
+|sequencing-center||String|The sequencing center from which the data originated|Optional|1||
+|predicted-insert-size||Integer|Predicted median insert size, to insert into the read groups|Optional|1||
+|program-group||String|Program group to insert into the read groups|Optional|1||
+|platform-model||String|Platform model to insert into the groups (free-form text providing further details of the platform/technology used)|Optional|1||
+|description||String|Description inserted into the read groups|Optional|1||
+|run-date||Iso8601Date|Date the run was produced (ISO 8601: `YYYY-MM-DD` ), to insert into the read groups|Optional|1||
+|comments||String|Comment(s) to include in the merged output file's header.|Optional|Unlimited||
+|sort-order||SamOrder|The sort order for the output sam/bam file.|Optional|1||
+
diff --git a/tools/2.3.0/CallDuplexConsensusReads.md b/tools/2.3.0/CallDuplexConsensusReads.md
@@ -0,0 +1,83 @@
+---
+title: CallDuplexConsensusReads
+---
+
+# CallDuplexConsensusReads
+
+## Overview
+**Group:** Unique Molecular Identifiers (UMIs)
+
+Calls duplex consensus sequences from reads generated from the same _double-stranded_ source molecule. Prior
+to running this tool, read must have been grouped with `GroupReadsByUmi` using the `paired` strategy. Doing
+so will apply (by default) MI tags to all reads of the form `*/A` and `*/B` where the /A and /B suffixes
+with the same identifier denote reads that are derived from opposite strands of the same source duplex molecule.
+
+Reads from the same unique molecule are first partitioned by source strand and assembled into single
+strand consensus molecules as described by CallMolecularConsensusReads.  Subsequently, for molecules that
+have at least one observation of each strand, duplex consensus reads are assembled by combining the evidence
+from the two single strand consensus reads.
+
+Because of the nature of duplex sequencing, this tool does not support fragment reads - if found in the
+input they are _ignored_.  Similarly, read pairs for which consensus reads cannot be generated for one or
+other read (R1 or R2) are omitted from the output.
+
+The consensus reads produced are unaligned, due to the difficulty and error-prone nature of inferring the conesensus
+alignment.  Consensus reads should therefore be aligned after, which should not be too expensive as likely there
+are far fewer consensus reads than input raw raws.  Please see how best to use this tool within the best-practice
+pipeline: https://github.com/fulcrumgenomics/fgbio/blob/main/docs/best-practice-consensus-pipeline.md
+
+Consensus reads have a number of additional optional tags set in the resulting BAM file.  The tag names follow
+a pattern where the first letter (a, b or c) denotes that the tag applies to the first single strand consensus (a),
+second single-strand consensus (b) or the final duplex consensus (c).  The second letter is intended to capture
+the meaning of the tag (e.g. d=depth, m=min depth, e=errors/error-rate) and is upper case for values that are
+one per read and lower case for values that are one per base.
+
+The tags break down into those that are single-valued per read:
+
+```
+consensus depth      [aD,bD,cD] (int)  : the maximum depth of raw reads at any point in the consensus reads
+consensus min depth  [aM,bM,cM] (int)  : the minimum depth of raw reads at any point in the consensus reads
+consensus error rate [aE,bE,cE] (float): the fraction of bases in raw reads disagreeing with the final consensus calls
+```
+
+And those that have a value per base (duplex values are not generated, but can be generated by summing):
+
+```
+consensus depth  [ad,bd] (short[]): the count of bases contributing to each single-strand consensus read at each position
+consensus errors [ae,be] (short[]): the count of bases from raw reads disagreeing with the final single-strand consensus base
+consensus errors [ac,bc] (string): the single-strand consensus bases
+consensus errors [aq,bq] (string): the single-strand consensus qualities
+```
+
+The per base depths and errors are both capped at 32,767. In all cases no-calls (Ns) and bases below the
+min-input-base-quality are not counted in tag value calculations.
+
+The --min-reads option can take 1-3 values similar to `FilterConsensusReads`. For example:
+
+```
+CallDuplexConsensusReads ... --min-reads 10 5 3
+```
+
+If fewer than three values are supplied, the last value is repeated (i.e. `5 4` -> `5 4 4` and `1` -> `1 1 1`.  The
+first value applies to the final consensus read, the second value to one single-strand consensus, and the last
+value to the other single-strand consensus. It is required that if values two and three differ,
+the _more stringent value comes earlier_.
+
+## Arguments
+
+|Name|Flag|Type|Description|Required?|Max # of Values|Default Value(s)|
+|----|----|----|-----------|---------|---------------|----------------|
+|input|i|PathToBam|The input SAM or BAM file.|Required|1||
+|output|o|PathToBam|Output SAM or BAM file to write consensus reads.|Required|1||
+|read-name-prefix|p|String|The prefix all consensus read names|Optional|1||
+|read-group-id|R|String|The new read group ID for all the consensus reads.|Optional|1|A|
+|error-rate-pre-umi|1|PhredScore|The Phred-scaled error rate for an error prior to the UMIs being integrated.|Optional|1|45|
+|error-rate-post-umi|2|PhredScore|The Phred-scaled error rate for an error post the UMIs have been integrated.|Optional|1|40|
+|min-input-base-quality|m|PhredScore|Ignore bases in raw reads that have Q below this value.|Optional|1|10|
+|trim|t|Boolean|If true, quality trim input reads in addition to masking low Q bases.|Optional|1|false|
+|sort-order|S|SamOrder|The sort order of the output, the same as the input if not given.|Optional|1||
+|min-reads|M|Int|The minimum number of input reads to a consensus read.|Required|3|1|
+|max-reads-per-strand||Int|The maximum number of reads to use when building a single-strand consensus. If more than this many reads are present in a tag family, the family is randomly downsampled to exactly max-reads reads.|Optional|1||
+|threads||Int|The number of threads to use while consensus calling.|Optional|1|1|
+|consensus-call-overlapping-bases||Boolean|Consensus call overlapping bases in mapped paired end reads|Optional|1|true|
+