Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CreateSomaticPanelOfNormals: multiallelic sites wrongly added to PON despite --min-sample-count set to total input samples #8916

Open
juliawiggeshoff opened this issue Jul 17, 2024 · 0 comments

Comments

@juliawiggeshoff
Copy link

juliawiggeshoff commented Jul 17, 2024

Bug Report

Affected tool(s) or class(es)

CreateSomaticPanelOfNormals, argument --min-sample-count

Affected version(s)

  • 4.4.0.0

Description

I followed all the guidelines listed here to build a PON. This was done with 50 FFPE normals, WES. After calling variants for my matched samples using this panel, I noticed an important hotspot mutation on KRAS was wrongly removed from the VCF file. The same mutation had been reported for my samples in previous reports with targeted panels. That mutation was removed because the following coordinate was listed in the PON:

chr12 25245348 . C A,G . . BETA=1.00,1.00;FRACTION=1.00

I've read in other discussions that setting the fraction for multiallelic sites to 1.00 is a known bug from the tool. (@davidbenjamin I'm tagging you here because I think this might be your comment, sorry if not). This initially led me to believe that it would default to 1.00 despite the multiallelic site being found by let's say only 60% of my samples. For example, If I set --min-sample-count 30, i.e. 60% of my samples, I'd expect that multiallelic site to show up with a buggy fraction of 1.00 if it had been found by let's say 39 of my samples. However, this is not the case.

If I concatenate the input mutect2 calls for my fifty samples, I see that this genomic position is only listed by four of my samples. And with a very low AF, all below 5%.

Here are the mutations:

chr12   25245348        .       C       A       .       .       AS_SB_TABLE=69,70|3,3;DP=148;ECNT=2;MBQ=20,20;MFRL=117,100;MMQ=60,60;MPOS=27;POPAF=7.3;TLOD=7.65        GT:AD:AF:DP:F1R2:F2R1:FAD:PGT:PID:PS:SB      0|1:139,6:0.044:145:40,0:42,3:85,3:0|1:25245348_C_A:25245348:69,70,3,3
chr12   25245348        .       C       G       .       .       AS_SB_TABLE=94,101|2,2;DP=199;ECNT=3;MBQ=20,20;MFRL=108,118;MMQ=60,60;MPOS=39;POPAF=7.3;TLOD=3.94       GT:AD:AF:DP:F1R2:F2R1:FAD:SB 0/1:195,4:0.025:199:64,1:37,1:115,2:94,101,2,2
chr12   25245348        .       C       A       .       .       AS_SB_TABLE=95,101|3,3;DP=207;ECNT=2;MBQ=20,20;MFRL=116,91;MMQ=60,60;MPOS=26;POPAF=7.3;TLOD=7.08        GT:AD:AF:DP:F1R2:F2R1:FAD:PGT:PID:PS:SB      0|1:196,6:0.032:202:69,0:41,3:119,3:0|1:25245348_C_A:25245348:95,101,3,3
chr12   25245348        .       C       A       .       .       AS_SB_TABLE=64,69|2,3;DP=139;ECNT=1;MBQ=20,20;MFRL=86,94;MMQ=60,60;MPOS=21;POPAF=7.3;TLOD=6.28  GT:AD:AF:DP:F1R2:F2R1:FAD:SB 0/1:133,5:0.053:138:40,0:29,2:70,3:64,69,2,3

The guidelines in the documentation could include filtering the calls, which I could have done independently, but this wouldn't matter here because 4 samples displaying a mutation should never have been added to a panel that was created when requesting a minimum of 30, 45, or even 50 samples displaying a variant at the same site.

Steps to reproduce

gatk --java-options "-Xmx30g" Mutect2 \
        -R /ref/Homo_sapiens_assembly38.fasta \
        -I /bams/input/WES_Normal/${infile} \
        -max-mnp-distance 0 \
        -O /bams/output/${outfile}

gatk --java-options "-Xmx100g" GenomicsDBImport \
       -R /ref/Homo_sapiens_assembly38.fasta -L /mydir/S33266340_hg38_Regions.bed \
       --tmp-dir /scratch/ --genomicsdb-workspace-path ${RAMDISK}/PON_db_50_samples \
       --merge-input-intervals true \
       -V  /bams/output/sample1.vcf.gz -V  /bams/output/sample2.vcf.gz [....]

gatk --java-options "-Xmx10g" CreateSomaticPanelOfNormals \
       -R /ref/Homo_sapiens_assembly38.fasta \
       -V gendb://${RAMDISK}/PON_db_50_samples \
       --germline-resource /gnomad/gnomAD.r2.1.1.GRCh38.PASS.AC.AF.only.vcf.gz \
       --min-sample-count 50 \
       -O /mydir/output/variants_100percent_samples_PON_50_samples.vcf.gz

Am I missing something? Has this been fixed in more recent releases?

Expected behavior

For this chr12:25245348 it should not have been included at all in the PON, or only if I had set --min-sample-count to 4 and the fraction should have been 0.08 (don't know about the beta).

chr12 25245348 . C A,G . . BETA=?,?;FRACTION=0.08

Alternative method to create the panel

After I noticed the problem, I ran bcftools isec to get for all the files a stripped VCF where 90% of them had a mutation at the same position and at least some of the alternative alleles were the same

bcftools isec -n+45 --collapse some -p /mydir/output/isec ${vcfs}

This shows me sites that are actually shared by 90% of my samples. Can't I just get chromosome, start, and end from one of these files, set INFO to "." and use this as the panel of normals with mutect2? Unfortunately, this seems more accurate than using CreateSomaticPanelOfNormals, though I would have liked to.

Importantly, it seems like this only affects multiallelic sites. For the others, I manually checked some sites from the PON built with -min-sample-count 50 and they really were found in 100% of the samples.

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant