Skip to content

Understanding Read Correction

Rob Flickenger edited this page Aug 9, 2021 · 4 revisions

In an ideal world, a genetic sequencer would detect every base in each chromosome with complete certainty. It would start at the telomere of each p-arm, proceed through the centromere, and end at the tip of the q-arm. It would do this once for each chromatid, and output a FASTA file containing each base present. Variant calling would be a simple matter of comparing each base to a genetic reference, and structural variant detection would be trivial.

In the real world, all instruments introduce a non-zero amount of random error. Modern short read sequencers produce results with impressive accuracy, but still include a small but unavoidable amount of systematic and random error. These errors can be detected and corrected by taking advantage of oversampling and statistical analysis.

The process of examining the body of read data and correcting (or removing) erroneous bases is called Read Correction. It is particularly critical for creating graph structures of read data, because erroneous bases lead to complicated graph structures.

Consider a 50X WGS in FASTQ format. At 150 bases per read, the dataset will contain roughly 1 billion reads. With an error rate of 0.1%, we should expect an average of at least 1 error in 1 million reads, or roughly 150 million total erroneous bases. In a graph structure, each of these errors represents a path deviation that must be represented and considered for analysis. Since these errors are largely unique, this can significantly impact performance, wasting time and memory on paths with no real evidence to support them.

BioGraph uses a Read Correction method that is highly efficient and eliminates errors with virtually no impact on the ultimate set of variant calls. It works without the need to consider quality scores, and thus requires no base quality score recalibration step. The process is completely reference-free, eliminating the possibility of reference bias.

Read correction in BioGraph

Read errors are corrected using a novel implementation of the A-star (A*) path-finding algorithm to find globally optimal base substitutions. This step is divided into two parts: k-merizing the reads and then using "good" kmers to infer the base substitutions needed to correct the reads.

K-merizing the reads

K-mers are generated from the reads during the import process. The default k-mer size is 30. Below is an example given a k-mer size of 6:

images/image1.png

All kmers from all reads are counted using a high-performance multi-pass bloom filter:

images/image8.png

K-mers less abundant than a threshold (default: 4 or less) are discarded:

images/image2.png

The remaining k-mers are considered “true”. The counts are discarded:

images/image7.png

The k-mers are then counted, and a distribution of k-mer counts is considered.

If we plot the k-mer counts for a 30X WGS on a graph, we see a bimodal distribution:

images/kmers.png

A typical k-mer count distribution. The k-mers to the right of the cutoff (count ≥ 5) are shaded blue, and are used for read correction. The k-mers to the left of the cutoff (< 5) are discarded.

A cutoff is chosen near the minimum point between the two peaks. Any k-mers to the left of the cutoff are considered erroneous and are removed. The remaining kmers are then used to correct the reads.

Correcting the reads

A de Bruijn graph is constructed from the remaining set of true k-mers, as if for de novo assembly. This graph represents an error-free reconstruction of the true read sequences. It allows for computationally efficient identification of error-free reads and correction of erroneous bases in all other reads using the A* search algorithm.

images/image3.png

Reads where all k-mers map perfectly to the graph do not require correction.

images/image10.png

For reads that imperfectly map to the graph, one or more globally optimum base substitutions are located so that the read aligns to the graph with no mismatches and differs by the smallest possible number of changes from the original read.

images/image4.png

Read errors often occur at the end of the read. Bases may be automatically trimmed during read correction. Correction starts at the beginning of a read up to the maximum allowed number of base substitutions (by default 8). Once this number of bases substitutions is reached, the read is trimmed starting from the next erroneous base. An example using a default of a maximum of 3 corrections follows:

images/image6.png

images/image5.png

The total number of bases corrected and the fraction of reads that may be automatically trimmed can be selected as appropriate for your dataset. By default, up to eight bases may be corrected (requiring a minimum of two good bases between them), and reads may be trimmed automatically down to a minimum of 70% of the original length. Reads that cannot meet the read correction criteria are discarded.

images/image9.png

Processing then proceeds with the resulting error-corrected reads.

Benchmarking

We benchmarked the read correction step using data simulated with Mason (http://www.seqan.de/apps/mason/). We simulated reads for whole human genome sequencing experiments on human reference hg19 using an Illumina HiSeq machine with 30x, 45x and 60x coverage.

Read depth Total read count Reads w/o errors before correction Remaining after correction Reads w/errors before correction Error reads corrected
30x WGS 32166263 20522011 20521863 11644252 10579162
45x wGS 48252264 30785990 30785933 17466274 15957764
60x WGS 64330880 41034875 41034827 23296005 21388977

Approximately 91% of reads with an error are corrected. The remaining approximately 9% of reads with errors are discarded. This makes up approximately 3% of the total number of reads.


Next: BioGraph Command Reference

Clone this wiki locally