Normal filtering in TIDDIT #1127
Replies: 5 comments 2 replies
-
Very nice write-up! A question about the 4 different kind of variants. Is the normal always from buffy coats? In that case, would one expect to see less of this kind where the tumor comes from a blood samples vs say a solid tumor? What are the types of tumors in these 3 cases? |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Current state of the filters: normal_variant set if: high_normal_af set if: high_normal_af_fraction set if: in_normal set if: |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Background
In issue #1118 it was brought up that there has been a significant increase in the number of SV calls in the final VCF since version 10 when TIDDIT was introduced, and which is due primarily to the lack of filtering of normal variants in the merged tumor + normal VCF.
This is attempted to be solved in this PR: #1120 by implementation of filters in bcftools which filters variants based on their presence in the normal.
But before implementing the filtering I wanted to report the results of this filtering, and allow for feedback based on this.
This filtering is only relevant for T / N WGS cases, and I selected 3 at random to look at: A, B and C, all with about 13k variants each in the merged vcf from TIDDIT without any filters applied. (See original caseIDs in private CG google drive doc: https://docs.google.com/document/d/1dP1mM-LkuvZUOA6cripXC5I089V-M8knEw26W5hBMEs/edit)
Results
Out of the total 40117 variants looked at, 2 had variant-frequencies that gave me some concern for the viability of calculating allele-frequencies from TIDDIT, as they had allele frequencies in the normal above 1, which I would not expect to be possible.
For downstream plotting and reporting these 2 variants are excluded.
Below is a scatterplot of all variants in all 3 samples, based on allele frequency in normal (Y) and tumor (X).
After applying filters in the PR:
Reasoning for the filter set in this way:
I had preferred to implement a filter where the allowed frequency in the normal was based on a maximum tumor contamination level of 0.1. Such that:
Like this:
However it doesn't seem like bcftools is capable of implementing a filter which depends on multiple samples within the same vcf in this way, and I was forced to rely on a fixed value. So I chose a space of allele-frequencies that would allow for tumor contamination between 25-10%.
There are 4 groups of variants in this merged vcf based on AF_T and AF_N, and I interpret them like this:
Contig variants
It's a bit difficult to see in these plots the overlapping variants, and there are some interesting ones left to mention...
While it appears from the plots above that all variants with AF_N = 0 has PASS, there are variants with the filter "in_normal" that have an allele frequency in the normal sample of 0. Yet they have been called in the normal.
In fact out of 40115 variants, 3311 of them fit that description. The same is true for the tumor variants, where 2499 of the variants have an AF_T of 0 and yet are still called as PASS.
This seems to have to do with how TIDDIT reports certain variants that have successfully been assembled into a contig, where the stats are sometimes lost.
In the scatterplot below I have included only variants with filter "in_normal", and the colors marks if they have a contig in the normal or not.
In addition I can confirm that:
I discussed these variants with Jesper, the creator of TIDDIT, last week and he claimed that it was unlikely that a variant could have an assembled contig without significant read support, and based on that it would be safe to remove them as normal variants.
Some of these variants are still rescued in the contamination filtering, and in the scatterplot below only variants with a contig in the normal are shown, the orange marked variants will be rescued:
Discussion
Is there a significant risk that I'm filtering away true somatic variants that should be taken into account here?
Risk factors identified:
In general I think this is a safe filter to implement to reduce the amount of noise. At least in these cases it does not appear to be a lot of tumor in the normal contamination (TINC), if there was a significant TINC I would expect to see less of a clear separation between the shared and unique variants.
Beta Was this translation helpful? Give feedback.
All reactions