-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New HaplotypeCaller priors for variants sites and homRef blocks #4793
Conversation
@@ -190,10 +193,14 @@ public AssemblyRegionEvaluator assemblyRegionEvaluator() { | |||
|
|||
@Override | |||
public void onTraversalStart() { | |||
if(hcArgs.genotypeArgs.applyPriors && hcArgs.genotypeArgs.samplePloidy != HomoSapiensConstants.DEFAULT_PLOIDY) { | |||
throw new UserException("Priors can only be applied for diploid samples."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check should be moved to HaplotypeCallerEngine.validateAndInitializeArgs()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this moveable? It's gross having it in the tool itself rather than the HaplotypeCallerEngine
runCommandLine(args); | ||
|
||
// Test for an exact match against past results | ||
IntegrationTestSpec.assertEqualTextFiles(output, expected); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you think of a way to construct a good test that doesn't require an exact match on the output file? I've been trying to move HaplotypeCallerIntegrationTest
away from that style of test. Eg., could you call VariantContextTestUtils.readEntireVCFIntoMemory()
and inspect the VCs themselves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I can refactor the integration test in the same way as in PosteriorProbabilitiesUtilsUnitTest.
Codecov Report
@@ Coverage Diff @@
## master #4793 +/- ##
===============================================
+ Coverage 80.332% 80.359% +0.027%
- Complexity 17625 17670 +45
===============================================
Files 1088 1088
Lines 63849 63958 +109
Branches 10276 10307 +31
===============================================
+ Hits 51291 51396 +105
- Misses 8558 8559 +1
- Partials 4000 4003 +3
|
Adding @davidbenjamin as a second reviewer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some blue-collar coding suggestions.
final GenotypeBuilder gb = useNewLikelihoods ? new GenotypeBuilder(g).PL(newLikelihoods).log10PError(newLog10GQ) : new GenotypeBuilder(g).noPL().noGQ(); | ||
|
||
final GenotypeBuilder gb; | ||
final Map<String, Object> attributes = new HashMap<>(g.getExtendedAttributes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attributes
should be local to the if
block.
@@ -79,8 +79,16 @@ public static GenotypesContext subsetAlleles(final GenotypesContext originalGs, | |||
} | |||
|
|||
final boolean useNewLikelihoods = newLikelihoods != null && (depth != 0 || GATKVariantContextUtils.isInformative(newLikelihoods)); | |||
final GenotypeBuilder gb = useNewLikelihoods ? new GenotypeBuilder(g).PL(newLikelihoods).log10PError(newLog10GQ) : new GenotypeBuilder(g).noPL().noGQ(); | |||
|
|||
final GenotypeBuilder gb; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not absorb the new GenotypeBuilder(g)
into this declaration?
/** | ||
* Apply population priors to genotype likelihoods. If specified, a population resource VCF must be supplied via the -supporting argument. | ||
*/ | ||
@Argument(fullName = "apply-priors", optional = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this can be inferred from whether supportVariants
is null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to use the same arguments as CalculateGenotypePosteriors, in which case supporting-callsets
is not particularly informative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why not eliminate this argument entirely, because it looks like it is intended to be true if and only if supportVariants != null
?
/** | ||
* When a variant is not seen in a panel, this argument controls whether to infer (and with what effective strength) | ||
* that only reference alleles were observed at that site. E.g. "If not seen in 1000Genomes, treat it as AC=0, | ||
* AN=2000". This is applied across all external panels, so if numRefIsMissing = 10, and the variant is absent in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior for multiple panels doesn't seem like the most natural one. IMO a single number for the total imputed ref count is more intuitive. Maybe @droazen can break the tie.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love any of the solutions here, but I do think if a variant is absent from all the resources then it should get a higher prior than if it's absent from just one. I suppose we make the assumption that the sets of samples in the resource files are disjoint and it's certainly not great that we have to apply the same number of each callset. But we can't determine the number of samples from the header because it might be a sites-only file and we can't use AN without traversing the whole VCF. It would greatly simplify things to limit CGP to one supporting callset as well, but I'm always hesitant to reduce functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is, if there are multiple resources the user can take that into account when choosing this value. What worries me most is the case of a single big resource, like gnomAD, along with a few boutique add-ons for whatever reason. Then the prior comes almost entirely from the big one and should hardly change when the little ones are added.
final ReferenceSequenceFile referenceReader = getReferenceReader(referenceArguments); | ||
hcEngine = new HaplotypeCallerEngine(hcArgs, createOutputBamIndex, createOutputBamMD5, getHeaderForReads(), referenceReader); | ||
|
||
// The HC engine will make the right kind (VCF or GVCF) of writer for us | ||
// The HC engine will make the right kind (VCF or GVCF) of writer for us |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the old indentation here.
@@ -429,6 +429,12 @@ public static Boolean alleleSpecificAnnotationEquals(VariantContext actual, Vari | |||
return true; | |||
} | |||
|
|||
public static void assertGenotypeAttributeWasRemoved(final VariantContext actual, final VariantContext expected) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method's name doesn't suggest that it is hard-coded to check one particular attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. I initially had an attribute parameter before I realized the test expected a BiConsumer. Suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertGenotypePosteriorsAttributeWasRemoved
?
if( genotype.hasExtendedAttribute(GATKVCFConstants.PHRED_SCALED_POSTERIORS_KEY)) { | ||
if (minPPs == null ) { | ||
minPPs = PosteriorProbabilitiesUtils.parsePosteriorsIntoPhredSpace(genotype); | ||
} else { // otherwise take the min with the provided genotype's PLs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation issues here.
" -O %s" + | ||
" -R " + b37_reference_20_21 + //NOTE: we need a reference for -L | ||
" -L 20:10,000,000-10,010,000" + | ||
" -supporting " + largeDir + "1000G.phase3.broad.withGenotypes.chr20.10100000.vcf" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a variable for -supporting
.
" -O %s" + | ||
" -R " + b37_reference_20_21 + //NOTE: we need a reference for -L | ||
" -L 20:10,000,000-10,010,000" + | ||
" -supporting " + largeDir + "1000G.phase3.broad.withGenotypes.chr20.10100000.vcf" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use variable for supporting
.
" -supporting " + largeDir + "1000G.phase3.broad.withGenotypes.chr20.10100000.vcf" + | ||
" --" + StandardArgumentDefinitions.ADD_OUTPUT_VCF_COMMANDLINE +" false" + | ||
" -V " + dir + "NA12878.Jan2013.haplotypeCaller.subset.indels.vcf" + | ||
" --num-reference-samples-if-no-call 2500", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use variable.
CalculateGenotypePosteriors now supports indels
3547c12
to
722afbb
Compare
@davidbenjamin back to you |
@ldgauthier Looks good to me. 👍 |
HC priors will let us reduce the WEx GVCF footprint. As a consequence, CalculateGenotypePosteriors now supports indels.
I fixed a bug and changed the args for CGP. We didn't have great tests, but CGP results will be fixed/improved in some cases.