Skip to content

Normalized variant representation

Brad Chapman edited this page May 21, 2014 · 4 revisions

The flexible VCF format results in multiple ways to represent variants. When performing comparisons of calls between multiple callers or sequencing technologies, it is critical that we're able to ensure uniform variant representation to avoid discordance calls due to representation.

For additional background on normalization, Adrian Tan wrote up the approaches used for normalization within vt.

We resolve these issues via a normalization process which does the following:

  • Converts all naming and coordinates into the standard NCBI/Ensembl convention from UCSC (chr1 to 1). In addition to flexibly remapping chromosome names, this handles reordering to standard conventions used within GATK.

  • Reduces all MNPs and complex variants into individual phased variants. Multiple nucleotide polymorphisms (MNPs) place multiple phased variants together into a single call representation. For example, this MNP:

          MT	150	.	TCT	CCC	.	PASS	.	GT	1/1
    

    can be equivalently expressed as:

         MT	150	.	T	C	.	PASS	.	GT	1/1
         MT	152	.	T	C	.	PASS	.	GT	1|1
    

    The normalize process converts all MNPs into the latter case.

  • Indels next to variants or in repetitive regions have multiple correct representations. For instance, an AG -> C deletion can be:

        TAG
        TC-
    

    or:

        TAG
        T-C
    

    We follow the convention of left-aligning these variants, and convert all of these to the second case. Similar to MNP normalization, this treats the changes as two separate variants for comparison. In the example we have a T -> TA and G/C change, instead of a AG -> C combined deletion and nucleotide change.

    This process also handles left-aligning indels in repetitive regions using GATK's LeftAlignVariants tool.

  • Trims extra reference base padding in indels. Some callers will add extra padded reference bases on indels. We remove these extra bases and adjust coordinates correctly to keep a single matching reference base. So a padded variant like:

         1	237528	.	CAAAAAAAAAAAAAAAA	CAAAAAAAAAAAAAAAAAAAAA
    

    will be:

         1	237544	.	A	AAAAAA
    
Clone this wiki locally