New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Fix two phasing bugs #7019

Merged

cwhelan merged 5 commits into master from cw_inconsistent_phase_sets

Jan 26, 2021

Member

cwhelan commented Jan 5, 2021

This PR addresses two phasing bugs, #6463 and #6845.

#6463 identified a bug in the phasing algorithm which caused the wrong phase information to be output for scenarios where the first variant in a phase set is homozygous variant and it is followed by het variants in opposite phase. Without this change the het variants were incorrectly placed on the same phase strand because the phase set was tied to the hom var variant, and the algorithm assumed that each het variant could be put in the same phase strand as it because it was on all haplotypes. I've modified the algorithm to keep track, for variants that occur on all haplotypes, of which of the haplotypes have already been used for phasing an upstream "comp" variant so that further downstream variants can be checked against the remaining set.

#6845 showed an example of phase sets being disrupted by the presence of an alternate haplotype that supported an additional, uncalled, variant in the region. In this case there was an alternate haplotype supported by two reads that supported a SNP downstream of two pairs of SNPs in alternate phase. The presence of an additional haplotype causes the phasing algorithm to break the phase sets in the region. I've modified the algorithm to only use haplotypes that support the alternate alleles present in called variants in phasing by modifying the number that we pass as AssemblyBasedCallerUtils.constructPhaseSetMapping()'s totalAvailableHaplotypes parameter. In my opinion this
fix produces output that is still correct and is much easier to understand (since it only depends on sites that are visible in the output VCF), but if anyone objects to this change please let me know.

Non-test code changes for this PR are in two different commits to try to make it easier to understand the scope of the two changes.

cwhelan added 2 commits

January 5, 2021 08:55


          ignore haplotypes that don't map to any called variants

8006f93


          fix phasing bug when first variant is hom var followed by multiple he…

6107d32

…ts; fix up test cases in AssemblyBaseCallerUnitTest including numbering comments

cwhelan requested a review from davidbenjamin

January 6, 2021 17:13

davidbenjamin requested changes

View reviewed changes

Contributor

davidbenjamin left a comment

This looks highly beneficial. I have a couple of suggestions that might be too laborious. Disregard them if they would be painful.

...va/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/AssemblyBasedCallerUtils.java Outdated

+                      // produce a set of the haplotypes that support called alt alleles, use it to set totalAvailableHaplotypes in the
+                      // call to constructPhaseSetMapping() below so that only haplotypes that supported called variants are used for phasing
+                      final Set<Haplotype> haplotypesWithCalledVariants = new HashSet<>(calledHaplotypes.size());
+                      haplotypeMap.values().forEach(haplotypesWithCalledVariants::addAll);

Contributor

davidbenjamin Jan 8, 2021

This can be implemented as something like final Set<Haplotype> haplotypesWithCalledVariants = haplotypeMap.values().stream().flatMap(Set::stream).collect(Collectors.toSet()).

But, even better, it looks like this variable is only used for its size, so how about cutting out the intermediate set and just declaring final int numHaplotypesWithCalledVariants = haplotypeMap.values().stream().mapToInt(Set::size).sum()?

Member Author

cwhelan Jan 8, 2021

I think that I need the size of the union of these sets (and one haplotype might appear in multiple haplotypeMap value sets), so I don't think the sum would work. I'll implement your first suggestion, though.

...va/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/AssemblyBasedCallerUtils.java Outdated

    
                      // construct a mapping from call to phase set ID

                      final Map<VariantContext, Pair<Integer, PhaseGroup>> phaseSetMapping = new HashMap<>();

                      final int uniqueCounterEndValue = constructPhaseSetMapping(calls, haplotypeMap, calledHaplotypes.size() - 1, phaseSetMapping);

                      final int uniqueCounterEndValue = constructPhaseSetMapping(calls, haplotypeMap, haplotypesWithCalledVariants.size(), phaseSetMapping);

Contributor

davidbenjamin Jan 8, 2021

I agree with you on this change.

...va/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/AssemblyBasedCallerUtils.java

                           }
                           final boolean callIsOnAllHaps = haplotypesWithCall.size() == totalAvailableHaplotypes;
+                          // if the call is on all haplotypes but we only use some of them to phase it with another variant

Contributor

davidbenjamin Jan 8, 2021

[This comment is for the method, not for this particular line]. While you're at it, could you clean up the following: this method only gets passed an empty Map for the phaseSetMappingArgument, so basically it has to be written as if it's mutating an existing object when in fact it's essentially creating a new object and returning it.

As for the fact that the method currently returns the count of unique phase IDs, that can easily be obtained from the phaseSetMapping.

I wouldn't mind going one step further in simplifying the signature by computing totalAvailableHaplotypes from the other arguments.

Member Author

cwhelan Jan 14, 2021

Refactored the method according to these suggestions, which indeed makes it a bit easier to understand.

...va/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/AssemblyBasedCallerUtils.java

+                                      // if the call was on all haps but the comp isn't, we need to narrow down the set of haps we'll consider
+                                      // for further phasing with the call
+                                      callHaplotypesAvailableForPhasing.retainAll(haplotypesWithComp);

Contributor

davidbenjamin Jan 8, 2021

If I understand correctly, you're avoiding the bug by arbitrarily deciding to phase a hom alt with the first downstream het. This is fine since phasing a hom alt is itself an arbitrary thing we already do, but could you document it?

Member Author

cwhelan Jan 14, 2021

As I try to explain in detail below, we don't actually know if variant genotypes are hom alt or not in this method. I've tried to do a blanket update of the comments in this method to make this clearer.

...va/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/AssemblyBasedCallerUtils.java

+                                      // if the call was on all haps but the comp isn't, we need to narrow down the set of haps we'll consider
+                                      // for further phasing with the call
+                                      callHaplotypesAvailableForPhasing.retainAll(haplotypesWithComp);

Contributor

davidbenjamin Jan 8, 2021

Also, if there is an elegant way to do it, the most logical thing IMO would be to simply ignore hom alts (ie if callIsOnAllHaps { continue;} and if compIsOnAllHaps { continue;} entirely in the nested loops and then after the fact slap whatever arbitrary label we want on them.

For example, could we phase all the hets and then just say every hom alt is 0|1 with the first counter value?

Member Author

cwhelan Jan 8, 2021

I thought about doing something like this myself but got a bit scared off of making major changes to the algorithm due to its complexity -- I got some unexpected results in the integration tests when I made an initial attempt at it. However, I think that implementing the tests you suggest below and maybe the other refactorings you've suggested might make the algorithm easier to change and reason about. I'll give it another shot.

Member Author

cwhelan Jan 14, 2021

After thinking about this some more and experimenting with it, I don't think this approach can work. A very confusing thing about this method, which has tripped me up multiple times when I've looked at issues in this code, is that its logic is not actually considering genotypes: it only considers the set of haplotypes that support alternate alleles, and doesn't get to look at the reference haplotype. Therefore it can't make the distinction between a hom var and a het which is supported by all of the alternate-allele supporting haplotypes. The comments and variable names in the method, and in the unit test, have not made this easy to understand, although the comment directly above this one obliquely refers to situations in which the set of haplotypes considered here may not be indicative of genotype.

I've been tempted to try to modify this method to also consider the reference haplotype. The problem is that apart from the genotypes on the variant context we don't know if there was actual support in the read evidence for the ref haplotype existing. I think that considering genotypes (by peeking in to the genotypes attached to the variant contexts in the haplotype map) could be very complicated since 1) as the comment above notes, genotypes can be changed by downstream processing and 2) it looks possible to me that this method could be run in multi-sample mode, in which case you have to build a VC -> haplotype map for each sample and phase independently. Making sure this method works in multisample mode (or preventing it from being run) should probably be its own ticket, although I'd imagine it would be low priority.

Instead, I've done my best to rename variables and make the code comments more explicit about the distinction between haplotype sets and genotypes. Let me know if you have additional suggestions.

...roadinstitute/hellbender/tools/walkers/haplotypecaller/AssemblyBasedCallerUtilsUnitTest.java

                       spandelCalls.add(spannedSnpVC);
                       tests.add(new Object[]{spandelCalls, new HashMap<>(spanDelHapMap), 2, 2, 1, 1, 1});
+                      // test 11: a hom followed by two opposite-phase hets

Contributor

davidbenjamin Jan 8, 2021

I understand that by virtue of the for (i = 0; i < numVariants; i++) { . . . for (j = i + 1; j < numVariants; j++ . . . nesting the case where the hom precedes the hets is the thorniest. Nonetheless, in order to decouple the tests from the implementation could you add a case where the hom follows the hets and a case where the het is in the middle?

Finally, could you add a test with three hets, where the middle one is out of phase with the outer two? This would make sure that the iteration doesn't give up somehow once it finds a phase switch.

Member Author

cwhelan Jan 14, 2021

Added these tests, and an additional one for the scenario in which we create two phase groups from a set of haplotypes (for two pairs of in-phase variants separated by an unphaseable SNP.

davidbenjamin assigned cwhelan

cwhelan added 3 commits

January 13, 2021 14:36


          add additional unit tests suggested by PR review

45605c0


          refactor the constructPhaseSetMapping method

50be303


          complete refactorings; additional unit test; documentation updates

a9cc26f

cwhelan requested a review from davidbenjamin

January 14, 2021 19:11

cwhelan assigned davidbenjamin

Member Author

cwhelan commented Jan 22, 2021

@davidbenjamin How does this look to you now?

davidbenjamin approved these changes

View reviewed changes

Contributor

davidbenjamin left a comment

Ready to merge, @cwhelan!

cwhelan merged commit 862a3f4 into master

cwhelan deleted the cw_inconsistent_phase_sets branch

January 26, 2021 15:51

This was referenced Jan 26, 2021

HaplotypeCaller emitting incorrect phasing when genotyping hom-het-het #6463

Closed

Phased variants do not have the same phase set identifier #6845

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet