Add --high_confidence option for dual hybrid genomes #9

FelixKrueger · 2017-02-22T19:02:17Z

We have come across certain position in the genome where different strains appear to have the same SNP (indicated by the GT/genotype field), but one of the strains failed the FI/FILTER criterium (1 is PASS, 0 is FAIL). Here is an example:

GT:GQ:DP:MQ0F:GP:PL:AN:MQ:DV:DP4:SP:SGB:PV4:FI
1/1:22:6:0.166667:152,22,0:137,18,0:2:36:6:0,0,6,0:0:-0.616816:.:1 (129) 1/1:15:4:0:79,15,0:67,12,0:2:24:4:0,0,4,0:0:-0.556411:.:0 (Cast)

For single hybrid genomes we would include this position into the 129 genome (1/1 homozygous SNP, first line), but would ignore the position for the Cast genome (also 1/1 homozygous SNP, but failed the high confidence FI filter, second line). This seems like a reasonable approach.

For dual hybrid genomes such positions might be a problem though because when the 129 and Cast SNP lists are compared with each other it looks like there is now a SNP between 129 and Cast, even though there was evidence that the genotype was the same (1/1) in and Cast, only that it did not pass the threshold to count as high confidence SNP in Cast.

As a solution to this can we change the SNPsplit genome preparation to store the FI value as well as the GT genotype and only use the position for a dual-hybrid SNP list if the position was measured with high confidence (i.e. FI=1) in both strains? Thanks to @nservant for helpful discussions in this regard.

The text was updated successfully, but these errors were encountered:

FelixKrueger · 2017-02-23T11:33:28Z

I have now tried to add functionality for the --dual_hybrid mode to identify positions where both genomes had homozygous SNPs compared to the reference but where one strain did not pass the high confidence filters. Instead of making this a new option this is now the default behaviour since I believe this is the right thing to do. Addressed 210af81 and 1ab9048.

FelixKrueger · 2017-02-28T10:53:07Z

In addition to high confidence homozygous SNP positions we also see some cases of low confidence no-SNP positions, such as this one:
GT:GQ:DP:MQ0F:GP:PL:AN:MQ:DV:DP4:SP:SGB:PV4:FI
1/1:21:12:0:152,21,0:128,12,0:2:55:9:3,0,7,2:0:-0.662043:.:1
0/0:.:5:0:.,.,.:.,.,.:2:47:4:1,0,4,0:0:-0.556411:.:0

In line with only including high-confidence positions for the allele-specific analysis I have now added an additional check so that both FI fields need to have passed the filter (i.e. FI=1) irrespective of the genotype (which may e.g. be 0/0, 0/1 or ./.). This addition requires some additional memory compared to the original version but will make the genome preparation more robust.

Addressed in c9688d9 and 481a460.

FelixKrueger self-assigned this Feb 22, 2017

FelixKrueger added the enhancement label Feb 22, 2017

FelixKrueger closed this as completed Mar 29, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --high_confidence option for dual hybrid genomes #9

Add --high_confidence option for dual hybrid genomes #9

FelixKrueger commented Feb 22, 2017

FelixKrueger commented Feb 23, 2017

FelixKrueger commented Feb 28, 2017

Add --high_confidence option for dual hybrid genomes #9

Add --high_confidence option for dual hybrid genomes #9

Comments

FelixKrueger commented Feb 22, 2017

FelixKrueger commented Feb 23, 2017

FelixKrueger commented Feb 28, 2017