Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix two phasing bugs #7019

Merged
merged 5 commits into from
Jan 26, 2021
Merged

Fix two phasing bugs #7019

merged 5 commits into from
Jan 26, 2021

Conversation

cwhelan
Copy link
Member

@cwhelan cwhelan commented Jan 5, 2021

This PR addresses two phasing bugs, #6463 and #6845.

#6463 identified a bug in the phasing algorithm which caused the wrong phase information to be output for scenarios where the first variant in a phase set is homozygous variant and it is followed by het variants in opposite phase. Without this change the het variants were incorrectly placed on the same phase strand because the phase set was tied to the hom var variant, and the algorithm assumed that each het variant could be put in the same phase strand as it because it was on all haplotypes. I've modified the algorithm to keep track, for variants that occur on all haplotypes, of which of the haplotypes have already been used for phasing an upstream "comp" variant so that further downstream variants can be checked against the remaining set.

#6845 showed an example of phase sets being disrupted by the presence of an alternate haplotype that supported an additional, uncalled, variant in the region. In this case there was an alternate haplotype supported by two reads that supported a SNP downstream of two pairs of SNPs in alternate phase. The presence of an additional haplotype causes the phasing algorithm to break the phase sets in the region. I've modified the algorithm to only use haplotypes that support the alternate alleles present in called variants in phasing by modifying the number that we pass as AssemblyBasedCallerUtils.constructPhaseSetMapping()'s totalAvailableHaplotypes parameter. In my opinion this
fix produces output that is still correct and is much easier to understand (since it only depends on sites that are visible in the output VCF), but if anyone objects to this change please let me know.

Non-test code changes for this PR are in two different commits to try to make it easier to understand the scope of the two changes.

@cwhelan cwhelan requested a review from davidbenjamin January 6, 2021 17:13
Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks highly beneficial. I have a couple of suggestions that might be too laborious. Disregard them if they would be painful.

// produce a set of the haplotypes that support called alt alleles, use it to set totalAvailableHaplotypes in the
// call to constructPhaseSetMapping() below so that only haplotypes that supported called variants are used for phasing
final Set<Haplotype> haplotypesWithCalledVariants = new HashSet<>(calledHaplotypes.size());
haplotypeMap.values().forEach(haplotypesWithCalledVariants::addAll);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be implemented as something like final Set<Haplotype> haplotypesWithCalledVariants = haplotypeMap.values().stream().flatMap(Set::stream).collect(Collectors.toSet()).

But, even better, it looks like this variable is only used for its size, so how about cutting out the intermediate set and just declaring final int numHaplotypesWithCalledVariants = haplotypeMap.values().stream().mapToInt(Set::size).sum()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that I need the size of the union of these sets (and one haplotype might appear in multiple haplotypeMap value sets), so I don't think the sum would work. I'll implement your first suggestion, though.

// construct a mapping from call to phase set ID
final Map<VariantContext, Pair<Integer, PhaseGroup>> phaseSetMapping = new HashMap<>();
final int uniqueCounterEndValue = constructPhaseSetMapping(calls, haplotypeMap, calledHaplotypes.size() - 1, phaseSetMapping);
final int uniqueCounterEndValue = constructPhaseSetMapping(calls, haplotypeMap, haplotypesWithCalledVariants.size(), phaseSetMapping);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you on this change.

@@ -714,7 +719,12 @@ static int constructPhaseSetMapping(final List<VariantContext> originalCalls,
}

final boolean callIsOnAllHaps = haplotypesWithCall.size() == totalAvailableHaplotypes;

// if the call is on all haplotypes but we only use some of them to phase it with another variant
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[This comment is for the method, not for this particular line]. While you're at it, could you clean up the following: this method only gets passed an empty Map for the phaseSetMappingArgument, so basically it has to be written as if it's mutating an existing object when in fact it's essentially creating a new object and returning it.

As for the fact that the method currently returns the count of unique phase IDs, that can easily be obtained from the phaseSetMapping.

I wouldn't mind going one step further in simplifying the signature by computing totalAvailableHaplotypes from the other arguments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored the method according to these suggestions, which indeed makes it a bit easier to understand.


// if the call was on all haps but the comp isn't, we need to narrow down the set of haps we'll consider
// for further phasing with the call
callHaplotypesAvailableForPhasing.retainAll(haplotypesWithComp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, you're avoiding the bug by arbitrarily deciding to phase a hom alt with the first downstream het. This is fine since phasing a hom alt is itself an arbitrary thing we already do, but could you document it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I try to explain in detail below, we don't actually know if variant genotypes are hom alt or not in this method. I've tried to do a blanket update of the comments in this method to make this clearer.

// if the call was on all haps but the comp isn't, we need to narrow down the set of haps we'll consider
// for further phasing with the call
callHaplotypesAvailableForPhasing.retainAll(haplotypesWithComp);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if there is an elegant way to do it, the most logical thing IMO would be to simply ignore hom alts (ie if callIsOnAllHaps { continue;} and if compIsOnAllHaps { continue;} entirely in the nested loops and then after the fact slap whatever arbitrary label we want on them.

For example, could we phase all the hets and then just say every hom alt is 0|1 with the first counter value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about doing something like this myself but got a bit scared off of making major changes to the algorithm due to its complexity -- I got some unexpected results in the integration tests when I made an initial attempt at it. However, I think that implementing the tests you suggest below and maybe the other refactorings you've suggested might make the algorithm easier to change and reason about. I'll give it another shot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this some more and experimenting with it, I don't think this approach can work. A very confusing thing about this method, which has tripped me up multiple times when I've looked at issues in this code, is that its logic is not actually considering genotypes: it only considers the set of haplotypes that support alternate alleles, and doesn't get to look at the reference haplotype. Therefore it can't make the distinction between a hom var and a het which is supported by all of the alternate-allele supporting haplotypes. The comments and variable names in the method, and in the unit test, have not made this easy to understand, although the comment directly above this one obliquely refers to situations in which the set of haplotypes considered here may not be indicative of genotype.

I've been tempted to try to modify this method to also consider the reference haplotype. The problem is that apart from the genotypes on the variant context we don't know if there was actual support in the read evidence for the ref haplotype existing. I think that considering genotypes (by peeking in to the genotypes attached to the variant contexts in the haplotype map) could be very complicated since 1) as the comment above notes, genotypes can be changed by downstream processing and 2) it looks possible to me that this method could be run in multi-sample mode, in which case you have to build a VC -> haplotype map for each sample and phase independently. Making sure this method works in multisample mode (or preventing it from being run) should probably be its own ticket, although I'd imagine it would be low priority.

Instead, I've done my best to rename variables and make the code comments more explicit about the distinction between haplotype sets and genotypes. Let me know if you have additional suggestions.

@@ -970,9 +974,31 @@ public void testConstructPhaseGroups(final List<VariantContext> calls,
spandelCalls.add(spannedSnpVC);

tests.add(new Object[]{spandelCalls, new HashMap<>(spanDelHapMap), 2, 2, 1, 1, 1});

// test 11: a hom followed by two opposite-phase hets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that by virtue of the for (i = 0; i < numVariants; i++) { . . . for (j = i + 1; j < numVariants; j++ . . . nesting the case where the hom precedes the hets is the thorniest. Nonetheless, in order to decouple the tests from the implementation could you add a case where the hom follows the hets and a case where the het is in the middle?

Finally, could you add a test with three hets, where the middle one is out of phase with the outer two? This would make sure that the iteration doesn't give up somehow once it finds a phase switch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added these tests, and an additional one for the scenario in which we create two phase groups from a set of haplotypes (for two pairs of in-phase variants separated by an unphaseable SNP.

@cwhelan
Copy link
Member Author

cwhelan commented Jan 22, 2021

@davidbenjamin How does this look to you now?

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready to merge, @cwhelan!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants