Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CreateSomaticPanelOfNormals output PoN has much less variants in 4.1.8.0 than before #6744

Closed
yl-h opened this issue Aug 7, 2020 · 42 comments · Fixed by #6871
Closed

CreateSomaticPanelOfNormals output PoN has much less variants in 4.1.8.0 than before #6744

yl-h opened this issue Aug 7, 2020 · 42 comments · Fixed by #6871

Comments

@yl-h
Copy link

yl-h commented Aug 7, 2020

Bug Report

Affected tool

CreateSomaticPanelOfNormals

Affected version

Tested on version 4.1.8.0 (likely commit 3e921c6, GenomicsDB 1.3.0 #6654)

Description

Panel of normals generated from version 4.1.8.0 has some ~28% less records (~52% less ALT alleles) than one created with 4.1.7.0 (tested at commit 9cc92e3) with all input data and arguments unchanged. The GenomicsDB version does not seem to matter as PoN created running CreateSomaticPanelOfNormals on 4.1.8.0 has the result is about the same regardless of whether GenomicsDBImport was run on 4.1.7.0 or 4.1.8.0. CreateSomaticPanelOfNormals on 4.1.7.0 fails to run on the new GenomicsDBs.

Mutect2 GenomicsDB CreateSomaticPanelOfNormals Output
4.1.7.0 4.1.7.0 4.1.7.0 100% alleles (reference)
4.1.8.0 4.1.8.0 4.1.7.0 expected error
4.1.7.0 4.1.7.0 4.1.8.0 48% alleles
4.1.8.0 4.1.8.0 4.1.8.0 48% alleles

Steps to reproduce

The PoN was created with GRCh38, scattered over chromosomes. Mutect command:

$gatk/gatk --java-options "-Xmx4G" Mutect2 \
	-R $reference -L $chr \
	-I $bam --max-mnp-distance 0 \
	-O @out1@

GenomicsDBImport command:

$gatk/gatk --java-options "-Xms8G -Xmx8G" GenomicsDBImport \
	-R $reference -L $chr \
	--sample-name-map ${inputGenomicsDB.out1} \
	--genomicsdb-workspace-path @folder1@

CreateSomaticPanelOfNormals:

$gatk/gatk --java-options "-Xms8G" CreateSomaticPanelOfNormals \
	-R $reference -V gendb://@folder1@ -O @out1@ \
	--germline-resource $gnomad \
	--max-germline-probability 0.5

Expected behavior

Based on description of the GenomicsDB 1.3.0 update, CreateSomaticPanelOfNormals is expected to behave similarly in 4.1.8.0 as before with the output PoN containing a similar number of variants.

Actual behavior

28% of PoN records (52% alleles) are missing in 4.1.8.0 compared to 4.1.7.0. Although all spanning deletions are dropped in the new version, they account for only a small portion of the loss (6.8% / 52% missing ALT alleles).

@droazen
Copy link
Contributor

droazen commented Aug 7, 2020

@nalinigans @mlathara and/or @fleharty care to comment on this one?

@nalinigans
Copy link
Collaborator

Most of the changes in GenomicsDB 1.3.0 was import and performance related. However, #6500 did add support for MNPs, not sure if that played a part here and the default streaming of data from GenomicsDB is now VCFCodec whereas it was BCFCodec before. @mlathara, can you think of anything else here?

@mlathara
Copy link
Contributor

mlathara commented Aug 10, 2020

@yl-h One thing that might help narrow down where the issue is a bit...would it be possible to compare the size of the total vcfs between the 4.1.7.0 tool version and the versions where number of records+alleles decreases significantly?

Given the significant difference you see, I would expect that if the issue is with Mutect2, then the total size of the vcfs should also decrease significantly. If the issue is with GenomicsDBImport, the size of the vcfs should be roughly comparable.

edit: nevermind, @nalinigans pointed out that you see this issue with either version of Mutect2/GenomicsDBImport so the issue is likely on the query end after GenomicsDBImport.

@nalinigans
Copy link
Collaborator

@yl-h I have created a new branch genomicsdb_6744 that exposes GenomicsDBArgument Collection to CreateSomaticPanelOfNormals. Can you please run the following to help us narrow down the issues?

  1. The default for GenomicsDB exports/queries changed from BCFCodec streaming in 4.1.7.0 to VCFCodec in 4.1.8.0. Run gatk CreateSomaticPanelOfNormal with --genomicsdb-use-bcf-codec true to override this default. If the expected PoN records is still missing variants, can you also run (2)?
  2. Run gatk SelectVariants -O out.vcf -V gendb://... on a small region with this branch to verify the number of variants is the same as from 4.1.7.0. If not, would you be able to distill and post any line that is missing now?

@yl-h
Copy link
Author

yl-h commented Aug 12, 2020

@nalinigans I tested (1) using your branch, i.e. running CreateSomaticPanelOfNormal as above but with --genomicsdb-use-bcf-codec true on the existing 4.1.8.0 GenomicsDBs built from 4.1.8.0 Mutect2 VCFs.

I got ~100% records and alleles compared to 4.1.7.0 with some change probably due to minor changes in Mutect2 between versions (allele overlap is about 99.6%).

As for using 4.1.7.0 M2 + genomicsDB (still running) as input, the PoN output was identical (all records/variants retained) to one created using 4.1.7.0.

As you suggested, loss of PoN variants in 4.1.8.0 appears to be caused by the change to VCFCodec.

@nalinigans
Copy link
Collaborator

Thanks @yl-h for following up. Would it be possible to include an example vcf line that is not showing up with VCFCodec? That will help us debug this issue further.

@yl-h
Copy link
Author

yl-h commented Aug 13, 2020

Sure thing @nalinigans.

I compared the PoNs created with 4.1.7.0 and 4.1.8.0 using 4.1.7.0 M2 and genomicsDB with

bcftools isec -c none -n=1 -w1 pon.7.7.7.vcf.gz pon.7.7.8.vcf.gz |
	bcftools view -H |
	head -1

giving

chr1	10622	.	T	G,*	.	.	BETA=7,1;FRACTION=0.075

Actually, I just noticed that all of the missing records are multiallelic. Some of the retained records (0.8% of all 4.1.7.0 records) are also multiallelic, however.

@nalinigans
Copy link
Collaborator

nalinigans commented Aug 14, 2020

@yl-h, thanks for the information.

I have been able to reproduce the dropped variant with VCFCodec using two separate gvcf samples with one overlapping multi-allelic interval e.g.
Sample 1:

20	38245276	.	A	G,GG	.	str_contraction	CONTQ=93;DP=20;ECNT=1;GERMQ=14,10;MBQ=29,29,30;MFRL=342,348,338;MMQ=60,60,60;MPOS=3,29;POPAF=7.30,7.30;RPA=11,10,12;RU=AC;SAAF=0.485,0.525,0.533;SAPP=0.031,0.020,0.948;STR;TLOD=4.48,21.16	GT:AD:AF:DP:F1R2:F2R1	0/1/2:7,2,8:0.148,0.450:17:3,0,4:4,2,4

and Sample 2:

20	38245275	.	TAC	T,TACAC	.	str_contraction	CONTQ=93;DP=20;ECNT=1;GERMQ=14,10;MBQ=29,29,30;MFRL=342,348,338;MMQ=60,60,60;MPOS=3,29;POPAF=7.30,7.30;RPA=11,10,12;RU=AC;SAAF=0.485,0.525,0.533;SAPP=0.031,0.020,0.948;STR;TLOD=4.48,21.16	GT:AD:AF:DP:F1R2:F2R1	0/1/2:7,2,8:0.148,0.450:17:3,0,4:4,2,4

Running SelectVariants after GenomicsDBImport with VCFCodec has these lines in the output. Notice that the AD format field is dropped for the start position 20:38245276.

20	38245275	.	TAC	T,TACAC	.	.	DP=20	GT:AD:AF:DP:F1R2:F2R1	.	././.:7,2,8:0.148,0.45:17:3,0,4:4,2,4
20	38245276	.	A	G,GG,*	.	.	DP=40	GT:AF:DP:F1R2:F2R1	././.:0.148,0.45,.:17:3,0,4,.:4,2,4,.	././.:.,.,0.148:17:3,.,.,0:4,.,.,2

That is different from what we get from running SelectVariants with BCFCodec. The GenomicsDB workspace is the same in both these cases.

20	38245275	.	TAC	T,TACAC	.	.	DP=20	GT:AD:AF:DP:F1R2:F2R1	.	././.:7,2,8:0.148,0.450:17:3,0,4:4,2,4
20	38245276	.	A	G,GG,*	.	.	DP=40	GT:AD:AF:DP:F1R2:F2R1	././.:7,2,8:0.148,0.450:17:3,0,4:4,2,4	././.:7:0.148:17:3,0:4,2

@lbergelson, @droazen, any idea why VCFCodec is behaving differently here?

@nalinigans
Copy link
Collaborator

FWIW htslib seems to handle 20:38245276 correctly. The '.' values are correctly inserted since the samples do not contain the corresponding ALT allele in the merged ALT allele list. Note that neither sample contains a <NON_REF> allele.

20      38245276        .       A       G,GG,*  .       .       DP=40   GT:AD:AF:F1R2:F2R1:DP   0/1/2:7,2,8,.:0.148,0.45,.:3,0,4,.:4,2,4,.:17   0/3/.:7,.,.,2:.,.,0.148:3,.,.,0:4,.,.,2:17

Discussing with @kgururaj, he mentions that the VCFCodec in htsjdk uses AbstractVCFCodec:decodeInts which doesn't like '.' and hence drops the full AD field. BCF2Codec on the other hand uses BCF2Decoder:decodeIntArray - it seems to drop elements after the first missing value is seen - htsjdk is not dealing with missing and vector end characters separately (in conflict with the BCF2 spec).

@nalinigans
Copy link
Collaborator

@lbergelson, should I open an issue with htsjdk/codecs?

@kgururaj
Copy link
Collaborator

Looks like these issues are known - see samtools/htsjdk#340 , samtools/htsjdk#961 and samtools/hts-specs#232.

@kgururaj
Copy link
Collaborator

kgururaj commented Aug 18, 2020

@yl-h just to confirm, your input VCFs (input to GenomicsDBImport) don't have the <NON_REF> allele (primarily for the multi allelic sites), right?

@yl-h
Copy link
Author

yl-h commented Aug 18, 2020

@kgururaj Correct, there are no <NON_REF> alleles in the VCFs given to GenomicsDBImport in our PoN pipeline, as expected for Mutect2.

@droazen
Copy link
Contributor

droazen commented Aug 18, 2020

@nalinigans After looking into this issue a bit more, I have some additional comments/questions:

  1. The problematic method AbstractVCFCodec.decodeInts() is only called for the AD and PL attributes, as far as I can tell.

  2. We don't expect AD or PL to contain '.''s in practice, and GATK can't handle this. For AD, 0 should be used instead.

  3. Where are the '.''s in the AD attribute actually coming from in the user's pipeline (Mutect2 -> GenomicsDB -> CreateSomaticPanelOfNormals)? Are they being added internally in GenomicsDB due to its reliance on htslib? If so, can GenomicsDB internally translate '.' into 0 for the AD field when interfacing with htslib?

@nalinigans
Copy link
Collaborator

nalinigans commented Aug 18, 2020

We don't expect AD or PL to contain '.''s in practice, and GATK can't handle this. For AD, 0 should be used instead.

0 seems to be a valid value for both AD and PL attributes, whereas . is a missing value. How will that be distinguished by the Codecs in htsjdk?

Where are the '.''s in the AD attribute actually coming from in the user's pipeline (Mutect2 -> GenomicsDB -> CreateSomaticPanelOfNormals)? Are they being added internally in GenomicsDB due to its reliance on htslib?

They are from the GenomicsDB query stream to CreateSomaticPanelOfNormals. FWIW : both BCF2Codec(drops all elements after encountering a missing value) and VCF2Codec(drops the entire format field when it encounters any missing value) are problematic in different ways.

@droazen
Copy link
Contributor

droazen commented Aug 19, 2020

@nalinigans For AD, I don't think we care about the distinction between '.' and 0. If AD is the only problematic field, and we're not seeing any issues with PL or any other attribute, then I'd advocate for a simple '.' -> 0 translation (for AD only!) within GenomicsDB, if such a thing is possible. @ldgauthier do you agree?

@ldgauthier
Copy link
Contributor

Sounds good to me.

@kgururaj
Copy link
Collaborator

kgururaj commented Aug 19, 2020

@nalinigans For AD, I don't think we care about the distinction between '.' and 0. If AD is the only problematic field, and we're not seeing any issues with PL or any other attribute, then I'd advocate for a simple '.' -> 0 translation (for AD only!) within GenomicsDB, if such a thing is possible. @ldgauthier do you agree?

Missing values will appear in all fields whose lengths are of type A, R or G (so PL field also). I'm assuming that missing values in allele specific annotation fields are handled gracefullyby GATK.
As you likely gathered from the previous comments, the primary issue is that for multi-allelic sites a given sample may have only one allele - so, the PL, AD values corresponding to the "missing" alleles for that sample are missing (no <NON_REF> alleles exist). I think I followed the convention that bcftools merge used.
We may be able to replace missing values with 'quiet' values (such as 0) - but need some guidance on what makes sense.

@droazen
Copy link
Contributor

droazen commented Aug 19, 2020

Why do you say "no <NON_REF> alleles exist" @kgururaj ? In a GVCF, NON_REF should always be present, including at multi-allelic sites.

@droazen
Copy link
Contributor

droazen commented Aug 19, 2020

And if the inputs here are VCFs and not GVCFs, then that is a problem. We've never supported combining regular VCFs using GenomicsDB, have we @ldgauthier ?

@mlathara
Copy link
Contributor

So, the user is running the workflow as outlined here, which would mean they are using VCFs, not GVCFs, right?

Does that workflow need to be altered? It seems that Mutect2 does have a (beta) feature that would support emitting GVCFs - should that be used here?

@kgururaj
Copy link
Collaborator

Why do you say "no <NON_REF> alleles exist" @kgururaj ? In a GVCF, NON_REF should always be present, including at multi-allelic sites.

David, can you take a quick look at the first comment in the issue by @yl-h ? Looks like the output of Mutect2 enters GenomicsDB.

@nalinigans
Copy link
Collaborator

Yes, @yl-h seems to have followed what was described in CreateSomaticPanelOfNormals Tools Doc. Perhaps the documentation should be changed to include -ERC=GVCF as @mlathara suggests like in consolidating GVCFs for joint calling?

@droazen
Copy link
Contributor

droazen commented Aug 19, 2020

@nalinigans @mlathara @kgururaj Yes, we're having an internal discussion now about whether this panel creation workflow needs to be altered. I think the crux of the problem is that we're trying to combine VCFs in GenomicsDB, which is not something we've ever claimed to support. Tagging @fleharty @davidbenjamin and @ldgauthier

@davidbenjamin
Copy link
Contributor

The solution is definitely not to run Mutect2 in GVCF mode. It's too different from VCF mode and has a big performance cost.

@droazen
Copy link
Contributor

droazen commented Aug 20, 2020

@davidbenjamin @fleharty In that case, I see two remaining options:

  • Use a different tool to combine the VCFs that is explicitly designed for VCF data, such as CombineVariants or bcftools merge

  • Modify the panel creation pipeline to explicitly use the BCF codec with GenomicsDB (this was the pre-4.1.8.0 behavior). This may have all kinds of hidden issues that we didn't know about before, however, due to the BCF codec truncating annotation values at the first '.', so if we go with this option we should investigate whether it was in fact broken all along at sites like the ones discussed in this ticket.

@lbergelson
Copy link
Member

@droazen I think the BCF code is broken here too. The problem is fundamental to htsjdk.

CombineVariants almost certainly has the same or similar problems because it's fundamental to combining vcfs and the fact that htsjdk doesn't handle partially empty lists. Bcftools likely has similar issues. Or loading the correct output from bcftools will recreate the issuue.

What about fixing the combine operation so it can substitute default missing values with a per attribute configuration for what value to substitute?

@droazen
Copy link
Contributor

droazen commented Aug 20, 2020

@lbergelson We're looking for a pragmatic fix that can go out in the next GATK release (this month!).

This is our only pipeline (that I know of...) that uses GenomicsDB in an unsupported way to combine VCFs rather than GVCFs. If this pipeline produced reasonable results with the flawed BCF codec behavior, then it might make sense to revert to that for now -- eg., maybe it doesn't rely on the annotation values that get truncated by BCF in any meaningful way (@davidbenjamin and/or @fleharty can hopefully chime in on this point)

In the future we could consider making the more comprehensive changes needed to be able to claim that we support combination of VCFs in GenomicsDB, but this would have to be a project for a future quarter, as it's going to involve a significant amount of development work.

@ldgauthier
Copy link
Contributor

I don't want to use CombineVariants. A) It's got some hinky combine behavior for certain attributes, B) I still don't want to port it to GATK4, and C) it's a slow as cold molasses. I was stapling together about 200 CNV VCFs, which are on the order of 100 variants each. CombineVariants took 11 hours and bcftools was so fast I couldn't push the button on my stopwatch (which is 17 minutes in the cloud with localizing and pulling Docker images and everything). 11 hours is unacceptable.

I'm in favor of specifying the BCF codec for the PoN workflow AND adding a small test to Travis -- it can honestly just be like 10 variants with two multi-allelics. We should test the PoN WDL anyway, which I don't think we do anywhere right now (i.e. not in Terra either.)

@droazen
Copy link
Contributor

droazen commented Aug 24, 2020

@nalinigans pointed out that even if we do switch this pipeline back to the BCF codec, it's still likely to encounter errors with 64-bit values due to some recent changes in GenomicsDB, which now throws an exception for these values instead of silently truncating them as it did in the past. So just switching to the BCF codec in the WDL might not be enough.

A couple of other options we discussed today for a quick fix for this issue:

  • @ldgauthier suggested that we change the WDL to use an older (pre-4.1.8.0) version of GATK that is known to work well with this pipeline, which would be an easy change. The only problem is that users might run CreateSomaticPanelOfNormals in the latest GATK independently of the WDL anyway and continue to run into this issue.

  • @nalinigans suggested that we patch CreateSomaticPanelOfNormals so that it does something sensible when the AD annotation is completely missing. Since AD was already likely being truncated by the BCF codec in previous versions of this pipeline, its value was never particularly trustworthy to begin with.

Whatever option we go with, we'll need to add a good regression test for the PoN workflow that would have caught this issue. Longer term, we'll plan on developing a fix at the HTSJDK level for the way missing values are handled for the AD and PL annotations in the VCF codec (@lbergelson is currently looking into how this could be done).

@fleharty What are your thoughts? Is there someone on the M2 team who could work on a hotfix for the next GATK release?

@fleharty
Copy link
Contributor

@droazen
The option suggested by @ldgauthier is straight forward, but I'm uncomfortable releasing a fix the requires two versions of GATK.

I am up for making the change. suggested by @nalingans, but do we believe this is sufficient to resolve this issue?
I'm not confident in my ability to make this change with the other things I have on my plate in the time frame required here. I will ask @davidbenjamin, he may be able to but I need to double check with him.

@droazen
Copy link
Contributor

droazen commented Aug 24, 2020

@fleharty If you can come up with a way to handle genotypes with missing AD in the tool (that does not involve dropping them completely) I believe that would solve the problem, yes. The following code snippet from the tool shows the current behavior, where genotypes get filtered out if AD is missing:

        final List<Genotype> variantGenotypes = vc.getGenotypes().stream()
                .filter(g -> hasArtifact(g, germlineAF)).collect(Collectors.toList());

    private final boolean hasArtifact(final Genotype g, final double populationAlleleFrequency) {
        final int altCount = altCount(g);
        if (altCount == 0) {
            return false;
        }
        final int totalCount = (int) MathUtils.sum(g.getAD());

        return germlineProbability(populationAlleleFrequency, altCount, totalCount) < maxGermlineProbability;
    }

    private static final int altCount(final Genotype g) {
        return g.hasAD() ? (int) MathUtils.sum(g.getAD()) - g.getAD()[0] : 0;
    }

AD will be missing completely if there were any missing values present after combination, due to the issues in HTSJDK discussed above. These missing values are there only because the pipeline is combining VCFs rather than GVCFs -- with GVCFs, you have the NON_REF allele and can fill in these missing values.

Back when GenomicsDB used the BCF codec by default, the AD value would get truncated at the first missing value, instead of completely removed as it does with the VCF codec. This suggests that the AD values the tool was seeing were never correct in the first place, and the tool should probably be relying on a different attribute. Only AD and PL are affected by this HTSJDK issue. Is there another attribute you could use instead of AD in CreateSomaticPanelOfNormals?

Let me know what @davidbenjamin says

@davidbenjamin
Copy link
Contributor

AD is very important because it's what the tool uses to decide if an allele is germline (and should be excluded from the panel) or an artifact (and should be kept in the panel). If it's missing we could instead rely on the GERMQ annotation emitted by FilterMutectCalls or the AF from Mutect2, but would those be combined properly by GenomicsDB?

@nalinigans
Copy link
Collaborator

nalinigans commented Aug 25, 2020

@davidbenjamin, looks like you may be able to use AF when AD is missing at least for VCFCodec. Check the output of SelectVariants in the comment above, looks like AF is being correctly handled by VCFCodec, whereas BCFCodec still suffers from all elements getting dropped after encountering the missing value .

As for using GERMQ annotation, no combination is specified to GenomicsDB, so it gets dropped from being included in the INFO fields currently. If you want to specify the combination(choose from none, sum, mean, median, element_wise_sum, concatenate or histogram_sum), see examples in GenomicsDBUtils.java to extend this to GERMQ. Also see outstanding PR #6514 for adding reasonable combination defaults for known info fields.

@fleharty
Copy link
Contributor

@nalinigans
David @davidbenjamin and I have discussed the option of using AF when AD is missing. The change to the code should be relatively straight forward, so we are investigating whether or not this will work.

@droazen
Copy link
Contributor

droazen commented Aug 26, 2020

@fleharty Great, let us know whether it turns out to be viable!

As a fallback option, if AF doesn't work out, it might be possible to get a fix for just the AD behavior into HTSJDK within a few weeks. My reading of the code suggests that it could be relatively painless, especially if we don't care about the distinction between '.' and 0 for this annotation. A fix for PL, if it's needed, would be more difficult.

@fleharty
Copy link
Contributor

fleharty commented Aug 27, 2020

@droazen
Ideally we would like to get AD to work properly in HTSJDK. Is it the code in https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/variant/vcf/AbstractVCFCodec.java#L849 that needs to be modified?

Specifically:
gb.AD(decodeInts(genotypeValues.get(i)))

@droazen
Copy link
Contributor

droazen commented Aug 27, 2020

@fleharty It's line 810 in that class (https://github.com/samtools/htsjdk/blob/f15bc9d2c0297a1bde6b89aa95cf2dc45dfc567f/src/main/java/htsjdk/variant/vcf/AbstractVCFCodec.java#L810). We need to switch from calling decodeInts() to calling a method that tolerates and preserves missing values. A decision will need to be made about whether, for AD specifically, missing values should be replaced with 0 (which @ldgauthier said she'd be ok with), or passed through to the caller as '.' or null. If we choose to propagate the missing values back to the caller, we may need to do downstream work in GATK/Picard to modify tools to handle them, and also modify the HTSJDK accessor for the AD field to return list of Integer instead of array of int. If we replace the missing values with 0, we likely wouldn't have to patch any downstream code at all.

@fleharty
Copy link
Contributor

fleharty commented Sep 9, 2020

@davidbenjamin Has there been any progress on this?

@yl-h
Copy link
Author

yl-h commented Oct 13, 2020

@davidbenjamin Thanks for the workaround in the 4.1.9.0 release!

I tested the updated CreateSomaticPanelOfNormals with genomicsDBs computed in 4.1.7.0 as above and it seems that the workaround recovers a lot of multiallelic variants that were already missing in 4.1.7.0.

Using the record and variant counts in 4.1.7.0 as 100% reference, I'm getting 57% more records (all multiallelic) or 142% more variants. No sites from 4.1.7.0 are missing in 4.1.9.0.

As a side note, all of the new records have FRACTION=1 and most (90%) have BETA=1,1;FRACTION=1. Among shared records, all multiallelic sites also have FRACTION=1 and almost always different beta parameter estimates compared to 4.1.7.0. As expected, biallelic sites are unchanged. As far as I understand, these annotations are irrelevant in deciding whether a site should be output or not, so this is not a concern.

@davidbenjamin
Copy link
Contributor

@yl-h Thank you for the update! Although we write tests for bug fixes, there is nothing so reassuring as hearing from users. As you noted, the FRACTION and BETA annotations will require a more involved solution to compute properly, but fortunately they are experimental and not used in filtering.

@yl-h
Copy link
Author

yl-h commented Oct 14, 2020

@davidbenjamin Sorry, I have a follow-up comment regarding the new sites in 4.1.9.0.

Around 79% of the new sites can be emitted by disabling the germline filter in 4.1.7.0. I had a look at the rest of records in genomicsDB using SelectVariants specifically in chromosome 1. Among those, ~27% have only one real ALT allele alongside a spanning deletion, with only one sample with positive/non-empty AF for the real ALT and at least one supporting the span del. Could you tell me if spanning deletions are supposed to contribute to the number of samples at a site for passing the sample count threshold? Some sites (~1%) only had one contributing sample but with all of the ALT alleles. Should such sites not pass the sample count threshold?

The remaining ~39% and ~33% are respectively sites with ALT alleles each supported by one exclusive sample and the rest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants