Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New HaplotypeCaller priors for variants sites and homRef blocks #4793

Merged
merged 8 commits into from
May 30, 2018

Conversation

ldgauthier
Copy link
Contributor

HC priors will let us reduce the WEx GVCF footprint. As a consequence, CalculateGenotypePosteriors now supports indels.

I fixed a bug and changed the args for CGP. We didn't have great tests, but CGP results will be fixed/improved in some cases.

@@ -190,10 +193,14 @@ public AssemblyRegionEvaluator assemblyRegionEvaluator() {

@Override
public void onTraversalStart() {
if(hcArgs.genotypeArgs.applyPriors && hcArgs.genotypeArgs.samplePloidy != HomoSapiensConstants.DEFAULT_PLOIDY) {
throw new UserException("Priors can only be applied for diploid samples.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check should be moved to HaplotypeCallerEngine.validateAndInitializeArgs()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this moveable? It's gross having it in the tool itself rather than the HaplotypeCallerEngine

runCommandLine(args);

// Test for an exact match against past results
IntegrationTestSpec.assertEqualTextFiles(output, expected);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you think of a way to construct a good test that doesn't require an exact match on the output file? I've been trying to move HaplotypeCallerIntegrationTest away from that style of test. Eg., could you call VariantContextTestUtils.readEntireVCFIntoMemory() and inspect the VCs themselves?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I can refactor the integration test in the same way as in PosteriorProbabilitiesUtilsUnitTest.

@codecov-io
Copy link

codecov-io commented May 23, 2018

Codecov Report

Merging #4793 into master will increase coverage by 0.027%.
The diff coverage is 88.889%.

@@               Coverage Diff               @@
##              master     #4793       +/-   ##
===============================================
+ Coverage     80.332%   80.359%   +0.027%     
- Complexity     17625     17670       +45     
===============================================
  Files           1088      1088               
  Lines          63849     63958      +109     
  Branches       10276     10307       +31     
===============================================
+ Hits           51291     51396      +105     
- Misses          8558      8559        +1     
- Partials        4000      4003        +3
Impacted Files Coverage Δ Complexity Δ
...lkers/ReferenceConfidenceVariantContextMerger.java 95.257% <ø> (ø) 71 <0> (ø) ⬇️
...tools/walkers/haplotypecaller/HaplotypeCaller.java 93.939% <ø> (ø) 18 <0> (ø) ⬇️
...lbender/utils/variant/GATKVariantContextUtils.java 84.32% <ø> (ø) 216 <0> (ø) ⬇️
...plotypecaller/HaplotypeCallerGenotypingEngine.java 78.014% <100%> (+0.316%) 32 <0> (ø) ⬇️
...hellbender/utils/test/VariantContextTestUtils.java 83.544% <100%> (+0.283%) 66 <2> (+2) ⬆️
...notyper/GenotypeCalculationArgumentCollection.java 100% <100%> (ø) 2 <0> (ø) ⬇️
...kers/haplotypecaller/ReferenceConfidenceModel.java 93.532% <100%> (+0.339%) 68 <6> (+7) ⬆️
...tools/walkers/genotyper/AlleleSubsettingUtils.java 91.736% <100%> (+0.507%) 45 <0> (ø) ⬇️
...kers/variantutils/PosteriorProbabilitiesUtils.java 82.051% <84.211%> (+4.085%) 58 <3> (+15) ⬆️
...walkers/haplotypecaller/HaplotypeCallerEngine.java 79.643% <85.714%> (+0.676%) 71 <0> (+4) ⬆️
... and 5 more

@droazen
Copy link
Contributor

droazen commented May 23, 2018

Adding @davidbenjamin as a second reviewer

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some blue-collar coding suggestions.

final GenotypeBuilder gb = useNewLikelihoods ? new GenotypeBuilder(g).PL(newLikelihoods).log10PError(newLog10GQ) : new GenotypeBuilder(g).noPL().noGQ();

final GenotypeBuilder gb;
final Map<String, Object> attributes = new HashMap<>(g.getExtendedAttributes());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attributes should be local to the if block.

@@ -79,8 +79,16 @@ public static GenotypesContext subsetAlleles(final GenotypesContext originalGs,
}

final boolean useNewLikelihoods = newLikelihoods != null && (depth != 0 || GATKVariantContextUtils.isInformative(newLikelihoods));
final GenotypeBuilder gb = useNewLikelihoods ? new GenotypeBuilder(g).PL(newLikelihoods).log10PError(newLog10GQ) : new GenotypeBuilder(g).noPL().noGQ();

final GenotypeBuilder gb;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not absorb the new GenotypeBuilder(g) into this declaration?

/**
* Apply population priors to genotype likelihoods. If specified, a population resource VCF must be supplied via the -supporting argument.
*/
@Argument(fullName = "apply-priors", optional = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this can be inferred from whether supportVariants is null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to use the same arguments as CalculateGenotypePosteriors, in which case supporting-callsets is not particularly informative.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why not eliminate this argument entirely, because it looks like it is intended to be true if and only if supportVariants != null?

/**
* When a variant is not seen in a panel, this argument controls whether to infer (and with what effective strength)
* that only reference alleles were observed at that site. E.g. "If not seen in 1000Genomes, treat it as AC=0,
* AN=2000". This is applied across all external panels, so if numRefIsMissing = 10, and the variant is absent in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior for multiple panels doesn't seem like the most natural one. IMO a single number for the total imputed ref count is more intuitive. Maybe @droazen can break the tie.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love any of the solutions here, but I do think if a variant is absent from all the resources then it should get a higher prior than if it's absent from just one. I suppose we make the assumption that the sets of samples in the resource files are disjoint and it's certainly not great that we have to apply the same number of each callset. But we can't determine the number of samples from the header because it might be a sites-only file and we can't use AN without traversing the whole VCF. It would greatly simplify things to limit CGP to one supporting callset as well, but I'm always hesitant to reduce functionality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is, if there are multiple resources the user can take that into account when choosing this value. What worries me most is the case of a single big resource, like gnomAD, along with a few boutique add-ons for whatever reason. Then the prior comes almost entirely from the big one and should hardly change when the little ones are added.

final ReferenceSequenceFile referenceReader = getReferenceReader(referenceArguments);
hcEngine = new HaplotypeCallerEngine(hcArgs, createOutputBamIndex, createOutputBamMD5, getHeaderForReads(), referenceReader);

// The HC engine will make the right kind (VCF or GVCF) of writer for us
// The HC engine will make the right kind (VCF or GVCF) of writer for us
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the old indentation here.

@@ -429,6 +429,12 @@ public static Boolean alleleSpecificAnnotationEquals(VariantContext actual, Vari
return true;
}

public static void assertGenotypeAttributeWasRemoved(final VariantContext actual, final VariantContext expected) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method's name doesn't suggest that it is hard-coded to check one particular attribute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. I initially had an attribute parameter before I realized the test expected a BiConsumer. Suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertGenotypePosteriorsAttributeWasRemoved?

if( genotype.hasExtendedAttribute(GATKVCFConstants.PHRED_SCALED_POSTERIORS_KEY)) {
if (minPPs == null ) {
minPPs = PosteriorProbabilitiesUtils.parsePosteriorsIntoPhredSpace(genotype);
} else { // otherwise take the min with the provided genotype's PLs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation issues here.

" -O %s" +
" -R " + b37_reference_20_21 + //NOTE: we need a reference for -L
" -L 20:10,000,000-10,010,000" +
" -supporting " + largeDir + "1000G.phase3.broad.withGenotypes.chr20.10100000.vcf" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a variable for -supporting.

" -O %s" +
" -R " + b37_reference_20_21 + //NOTE: we need a reference for -L
" -L 20:10,000,000-10,010,000" +
" -supporting " + largeDir + "1000G.phase3.broad.withGenotypes.chr20.10100000.vcf" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use variable for supporting.

" -supporting " + largeDir + "1000G.phase3.broad.withGenotypes.chr20.10100000.vcf" +
" --" + StandardArgumentDefinitions.ADD_OUTPUT_VCF_COMMANDLINE +" false" +
" -V " + dir + "NA12878.Jan2013.haplotypeCaller.subset.indels.vcf" +
" --num-reference-samples-if-no-call 2500",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use variable.

@ldgauthier ldgauthier force-pushed the ldg_HCwithPopPriors branch from 3547c12 to 722afbb Compare May 29, 2018 17:12
@ldgauthier
Copy link
Contributor Author

@davidbenjamin back to you

@davidbenjamin
Copy link
Contributor

@ldgauthier Looks good to me. 👍

@ldgauthier ldgauthier merged commit ca2a4e3 into master May 30, 2018
@ldgauthier ldgauthier deleted the ldg_HCwithPopPriors branch May 30, 2018 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants