Skip to content

Confident regions

Moore, Ben edited this page Aug 14, 2017 · 1 revision

In order to use Platinum Genomes truthsets to characterise false positives we also require confident regions which cover only those region whose state has been clearly defined by the truthset pipeline, including both variant non-reference sites and homozygous reference positions. Then, during benchmarking, any novel variation called in these confident regions can be treated as a false positive, while novel variation seen outside of confident regions is not assessed.

In Platinum Genomes, these regions are built from those pipelines which generated genomic VCFs (gVCFs) which assign genotypes and/or genotype likelihoods to all accessible reference positions. The method for building confident regions for Platinum Genomes is described in the PG manuscript for the v2016.1 hg19 truthset:

In addition to the variant positions, we also collated our high confidence invariant positions using the same rules that were applied to positions that were genotyped as homozygous alternative across the pedigree. In this case the position must be called homozygous reference using at least two different call sets based off of different sequence aligners. To further eliminate any possible missed variants in our confident homozygous reference positions, we removed all positions where variant calls were made in any of the samples by any of the sequencing pipelines including variant calls that did not pass the quality filters. In total, this analysis identified 2,737,246,156 bases that we defined as confident homozygous reference across the pedigree and these positions were used to assess false positive rates of variant calling pipelines.

Clone this wiki locally