Skip to content
genetronhealth edited this page Feb 23, 2021 · 9 revisions

Welcome to the UVC wiki!

For toy example data and usage, please check https://github.com/genetronhealth/example-data/

A docker container for UVC is available at https://quay.io/repository/genetronhealth/gcc-6-3-0-uvc-0-6-0-441a694, where UVC is pre-compiled and pre-installed in the /uvc/ directory in the container.

UVC has been extensively evaluated with other variant callers, the repository containing the code and scripts used to evaluate UVC is at https://github.com/genetronhealth/uvc-eval

The following figure shows the variant-calling pipeline with UVC.

Variant-calling pipeline with UVC

By default, UVC uses 8 threads, and consumes approximately 2 GB of memory per thread. The speedup should scale linearly with the number of physical CPU cores used for up to 48 cores (the speedup is highly dependent on hardware architecture and OS detail and should be taken with a grain of salt). The peak memory usage should scale linearly with the number of threads used.

Prerequisites

The input of UVC is a BAM file and the ouput of UVC is a VCF file. Therefore, please make sure that you are familiar with the BAM file format specifications at http://samtools.github.io/hts-specs/SAMv1.pdf and the VCF file format specifications at https://samtools.github.io/hts-specs/VCFv4.2.pdf.

Command-line parameters

If the optional parameter --help is provided to a UVC executable, then UVC prints a help message describing all command-line parameters available. Each parameter is followed by a citation with PMC ID if applicable. Here, we will list and describe some important and commonly used parameters.

  • -v print version
  • -f reference fasta file
  • -o output vcf file
  • -R region bed file within which variants are called, please make sure that each genomic interval in the bed file is not too big (less than 100,000 bps).
  • -t number of threads to use, each thread uses about 2GB of RAM, and speedup gained by multi-threading may drop if using more than 48 threads depending on the OS and hardware architecture.
  • --outvar-flag bitwise flag specifying the type of variants to be generated. If the 0x1 bit is set, then UVC generates germline variants denoted by the GERMLINE flag in the INFO field. If the 0x2 bit is set, then UVC generates somatic variants. If the 0x4 bit is set, then UVC generates both germline and somatic variants without considering the germline-vs-somatic origin. Etc.
  • -q variant-quality threshold below which the variant is not reported in the VCF output.

Please note that --assay-type and --dedup-flag may be set to override the default for advanced usage.

Tags in the output VCF

If the positional parameter /only-print-vcf-header/ is provided to a UVC executable, then UVC prints a dummy VCF header describing each tag. Here, we will list and describe some important and commonly used tags.

  • FORMAT/DP : the total deduplicated (with duplicates removed) fragment depth of all alleles.
  • FORMAT/AD : the deduplicated fragment depth of each allele shown on this line of VCF record.
  • FORMAT/bDP : the total non-deduplicated (with duplicates kept) fragment depth of all alleles.
  • FORMAT/bAD : the non-deduplicated fragment depth of each allele shown on this line of VCF record.
  • FORMAT/c2DP : the total consensus barcode-family (each barcode-family requires at least 2 fragments and the most frequently occurring allele should occur at lest 80% of the times in the family of fragments) depth of all alleles.
  • FORMAT/c2AD : the consensus barcode-family depth of each allele shown on this line of VCF record.
  • FORMAT/aBQ: root-mean-square basecall quality of each allele using raw sequenced segments without merging R1 and R2 ends.
  • FORMAT/bMQ: root-mean-square mapping quality of each allele using deduplicated fragments with R1 and R2 ends merged.

To let the FORMAT fields/tags be easily used, the VCF FORMAT fields generated by UVC are partitioned and organized into some classes. There are four important classes.

  1. Class denoted by _A and _a: raw sequenced segment where R1 and R2 ends are not merged (a stands for basecall stage a).
  2. Class denoted by _B and _b: fragments where R1 and R2 are merged, but these fragments are not deduplicated (b stands for basecall stage b).
  3. Class denoted by _C and _c: deduplicated fragments or families of single-strand consensus sequences (c stands for consensus).
  4. Class denoted by _D and _d: duplex where the two single-strand consensus sequences coming from the same part of the original DNA double helix are merged together into one double-strand consensus sequence (d stands for duplex).

Each FORMAT field denoted by an uppercase letter (_A, _B, _C, and _D) has two integers denoting properties of all alleles and of the padded deletion allele.
Each FORMAT field denoted by a lowercase letter (_a, _b, _c, and _d) has R number of integers denoting properties of each allele indicated by the line with VCF records, where R is equal to the number of alleles on this line.

Please note that the FORMAT fields bHap and cHap contain phasing information that can be used to construct haplotypes. For example, with such information, we can see whether EGFR T790M, EGFR exon19 DEL, and EGFR L858R mutations occured in cis or trans configurations. Moreover, such information can be used to combine simple variants into complex variants if needed, because we know what variants occurred together.

Design philosophy

  1. Trust the aligner. UVC assumes that the alignment given by the aligner is very accurate. Therefore, UVC will neither re-align nor perform any assembly. If the alignment is not accurate, then the user can run the aligner again using a different set of alignment parameters (for example, higher gap opening penalty and lower gap extension penalty) or run a re-alignment software such as Abra2. We found that alignments tend to be very accurate if the alignment length is longer than 150 base pairs.
  2. Keep variants simple, but still contain sufficient information to generate complex variants. This is an extension of the previous one. In the SAM/BAM format specifications, the CIGAR operation contains only very simple operations for representing variants, namely substitution, insertion, and deletion. Therefore, UVC also generates only these three simple types of variants, which conforms with the visualization by manual review. Nonetheless, variants can still be combined with the information provided by the FORMAT tags bHap and cHap.
  3. Let QUAL conforms to the VCF file format specifications. This is actually easy to say but hard to do. Suppose that a variant has a QUAL of 100, what does this mean? It means that we expect to see one false positive variant call that looks similar to this variant out of ten billion genomic positions. In order to be that confident, we really need to perform a lot of simulation, evaluation, etc. to be really sure that the empirical probability matches the QUAL of 100. Hopefully, it turns out that the empirical probability of a variant being false positive at high sequencing depth is inversely proportional to the cube (third power) of variant allele fraction. Therefore, the combination of a binomial model based on base qualities and a cubic-power-law model base on allele fractions is used by UVC to generate QUAL that conforms to the VCF file format specifications.
  4. Infer from extreme corner cases. Corner cases are often ignored. For example, regions with extremely high sequencing depth are often ignored, regions with extremely low sequencing depth are often ignored, variants with extreme strand bias are often ignored, regions with poor mapping quality are often ignored, reads without unique molecular identifiers (UMIs) are often ignored in a UMI assay, etc. However, UVC has different perspective regarding these outlier corner cases: these corner cases are actually easier to work with because only one variable has a strong effect. For example, at extremely high sequencing depth, base quality and read depth are irrelevant, leaving us with only allele fraction and mapping quality; at low mapping quality, base quality and sequencing depth are all irrelevant, leaving us with allele fraction and mapping quality; etc. Then, we can just combine the equations and formulas derived for these corner cases so that they are applicable to the general case.

The last design philosophy is especially important because it leads to two important principles in NGS data analysis:

  1. Universality that establishes a one-to-one correspondence between variant allele fraction and probability of the variant being false positive.
  2. Zero-inflated modeling of all NGS biases where zero denotes the frequentist null hypothesis of having no bias and nonzero denotes the Bayesian estimate of having some nonzero bias.
Clone this wiki locally