-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrote leftAlignIndels code -- it now always works, even for multiple indels #6427
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha, what could be more exciting than some left aligning!
Some initial comments. I generally like where this is going. The changes to AlignmentUtils
are great; being able to left align cigars with multiple indels or mutliallelic sites is a major improvement.
I do think the treatment of multiple nearby variants in LeftAlignAndTrimVariants
needs to be handled more carefully. Simply left aligning each variant separately and then placing them wherever they fall will introduce problems. For example, left aligning a deletion past a snp, without adjusting the snp's location, will change the represented haplotype. And I can construct scenarios where left aligning an insertion past a snp leaves it impossible to recover the initially represented haplotype no matter how I attempt to adjust the downstream snp.
While left aligning a single variant record on its own is well defined, I don't think the same is true for a set of variants in each others presence. In the end, I'm not convinced that simply refusing to left align past earlier variants isn't the best solution. Probably merging abutting variants at that point is reasonable. (I think this is effectively what AlignmentUtils.LeftAlignIndels
is now doing). But I think attempting to left align past an upstream variant raises too many pitfalls and potential representation problems make it a good option.
@@ -69,6 +70,9 @@ public static OptionalDouble getReadPosition(final GATKRead read, final int refL | |||
final int leadingHardClips = firstElement.getOperator() == CigarOperator.HARD_CLIP ? firstElement.getLength() : 0; | |||
final int trailingHardClips = lastElement.getOperator() == CigarOperator.HARD_CLIP ? lastElement.getLength() : 0; | |||
|
|||
if (offset >= cigar.getReadLength()) { | |||
return OptionalDouble.empty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like an unrelated change, did you intend to include in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's intentional. A test was failing because correct left-alignment exposed an edge case. I'm about to issue a PR related to ReadPosRankSumTest
that will re-do this code.
src/main/java/org/broadinstitute/hellbender/utils/read/AlignmentUtils.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/utils/read/AlignmentUtils.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/utils/read/AlignmentUtils.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/utils/read/AlignmentUtils.java
Outdated
Show resolved
Hide resolved
.../java/org/broadinstitute/hellbender/tools/walkers/variantutils/LeftAlignAndTrimVariants.java
Outdated
Show resolved
Hide resolved
final Pair<Integer, Integer> shifts = leftAlignAlleles(Arrays.asList(ref, read), Arrays.asList(refIndelRange, readIndelRange), maxShift); | ||
|
||
// account for new match alignments on the right due to left-alignment | ||
resultRightToLeft.add(new CigarElement(shifts.getRight(), CigarOperator.MATCH_OR_MISMATCH)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use element.getOperator()
(like you do below) instead of CigarOperator.MATCH_OR_MISMATCH
to avoid throwing out information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's different here because below we're re-emitting a shorter version of a cigar element that already existed. Here we're creating something new resulting from the left-alignment.
src/main/java/org/broadinstitute/hellbender/tools/LeftAlignIndels.java
Outdated
Show resolved
Hide resolved
.../java/org/broadinstitute/hellbender/tools/walkers/variantutils/LeftAlignAndTrimVariants.java
Outdated
Show resolved
Hide resolved
2d25c22
to
13c8db5
Compare
@kachulis back to you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I think there is just a small bug in LeftAlignAndTrimVariants::leftAlignAndTrim
. I think it's an easy fix, and it would be nice to have a test of the bug-related case it as well.
return vc; | ||
} | ||
|
||
// get the indel length | ||
final int indelLength = Math.abs(vc.getIndelLengths().get(0)); | ||
for(int leadingBases = Math.min(maxLeadingBases, 10); leadingBases <= maxLeadingBases; leadingBases *= 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there is a bug here (quite possibly originating prior to this PR). If maxLeadingBases
is greater than 10, but not a multiple of 10, there can be problems. Suppose maxLeadingBases=15
, and the indel should be left aligned that full amount. leadingBases
will initialize to 10, and the indel will left align the full 10 bases. leadingBases
will then double to 20, the for loop condition will be false, and the variant won't be shifted left at all. I think this is easy to fix, just needs a slight adjustment to the control flow logic.
Also would be nice to have a test of this type of case, to ensure it is handled correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@kachulis Back to you. I fixed the bug and wrote a test for it. I also put in a commit replacing some trimming code with the new normalization method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidbenjamin Looks good! One very minor suggestion you can accept or ignore as you prefer.
@@ -279,7 +279,8 @@ static VariantContext leftAlignAndTrim(final VariantContext vc, final ReferenceC | |||
return vc; | |||
} | |||
|
|||
for(int leadingBases = Math.min(maxLeadingBases, 10); leadingBases <= maxLeadingBases; leadingBases *= 2) { | |||
|
|||
for(int leadingBases = Math.min(maxLeadingBases, 10); leadingBases <= maxLeadingBases; leadingBases = Math.min(2*leadingBases, maxLeadingBases)) { | |||
final int refStart = Math.max(vc.getStart() - Math.min(leadingBases, maxLeadingBases), 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need the Math.min here, since it's the condition in the for loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
31f7ccf
to
9fae3e5
Compare
9fae3e5
to
1a8ef5d
Compare
1a8ef5d
to
ba09f3a
Compare
DRAGstr first working approximation to matlab scripts fist go at rebasing everything correctly, now have to deal with the buggy refactor code Revert "Rewrote leftAlignIndels code -- it now always works, even for multiple indels (#6427)" This reverts commit 8612288. Which brakes the softclip aware cigar building code and has a horrible bug. bringing modernity to this branch added a consistency test for DRAGEN mode a bunch of code cleanup more cleanup and tests? Added the matlab files as test Strengthened trust in SoftclipBases Behavior for overlap base quality adjustments Fixing a crash in alleleliklehoods and adjusting bq annotation after the quality for overlapping bases change minor fix and expanded debug output removing excess bases before hmm scoring seems to improve the resutlts at complex sites... the FRD issue is still relevant... Fixed genotype posterior calculator to not honor the unphase permutation number factor more bugfixes to read edge cases at sites with many low quality ends and low mapping quality combined including reads that only overlap in hardclips (despite the obvious concerns with that) fixing an unexpected error flipping hardclippingsoftclippedBases creating a frankenbranch with a very intersting change involving realignment of reads and apparent DRAGEN behavior fixing a bug in 'getAllVariationEvents()' code flipping the values so it actually works Changed BQD feather end calculation to account for indels in the read relative to the reference reverted the non-realignment code fixing the horrible regression I just wrought on the BQD calculations... disabling bad debug squack Parameterizing the option to use original base qualities threading the truly original alignment through for BQD/FRD to match the bad dragen behavior we are now aware of cleaning up the branch if you like me then you should have put a pin in it responding to reveiw comments, (still havent fixed the debug streams) deleting a test that was for local files fixing a silly bug fixing the logic yet one more time fixing an issue with the disable-spanning-event-genotyping code that was causing false positivles (or perhaps reasonable behavior) jumping the gun on takutos fix because it was causing me trouble for this test refactored the debug output to be something very simple that should/could/maybe is better handled by making a debug level in the debugger. Anyway this is fine for now and probably could be expanded to handle the other debug streams in HC adding a scary error message to deter people from using contamination downsampling with BQD/FRD (it should be an exception but somebody would complain and it really isn't a sound thing to do unless the disqualified reads are included but they can't be inherently because they don't have scores) partial val review response work checkpoint : responded to most comments for my code in val PR review Fixed a couple of rare bugs in DRAGEN port code. Now we can reads dragstr-params values out of gcloud storage directly. taking the changes from valentin with the removed OQ tag pulling out the --use-original-alignments-for-genotyping-overlap as if it were so many weeds in my garden. Revision changes thus far Fix issues with how we assign STR period and repeat lenghts to positions on a read. Added more robust testing using Illumina's Matlab script to generate use cases. Impose a maximum of 40 for the GOP in DRAGstr pairhmm as seem to be the case in DRAGEN. Changes to meet recommendation except for the param estimation part that is on a separate branch reverting a mistake where i replaced every every instance of the word 'exists' with 'isEnabled' Unifying changes with the refactoring of the ReadsDataSource class A few updates on the revision: * Solved a couple of self-evaluation javadoc PR comments. * Added a integration test for ComposeSTRTableFile based on the DRAGEN output on some of the mini references under src/test/resources/large * Fix an issue that made ComposeSTRTableFile quite slow with reference with a very large number of contigs (eg. funcotator gencode trascripts ~100_000). * Found and fixed a bug introduced during refactoring making EstimateDragstrParameters produce "garbage" if run in parallel. * Silience a tedious debug message in SAMReaderQueryingIterator that comes out too often when the traversal intervals are just a few. Now it requires like a 1000+ intervals to produce that debug, might be still too low but it works in practice. Solves james' comment on DragstrParam toString deficiency fixed a few dangling issues with the rebase added the argument to end all arguments (and also completely remove the users ability to interact with our toolkit as they please) Fixes bug reported by Cwhelan Addresses some of James' comments and fixes a bug in a unit test for DragstrLocus class Fixes Dragstr Read Str analyzer failing unit test allowing nulls in the genotyper because its actually okay
…uggy refactor code Revert "Rewrote leftAlignIndels code -- it now always works, even for multiple indels (#6427)" This reverts commit 8612288. Which brakes the softclip aware cigar building code and has a horrible bug. bringing modernity to this branch added a consistency test for DRAGEN mode a bunch of code cleanup more cleanup and tests? Added the matlab files as test Strengthened trust in SoftclipBases Behavior for overlap base quality adjustments Fixing a crash in alleleliklehoods and adjusting bq annotation after the quality for overlapping bases change minor fix and expanded debug output removing excess bases before hmm scoring seems to improve the resutlts at complex sites... the FRD issue is still relevant... Fixed genotype posterior calculator to not honor the unphase permutation number factor more bugfixes to read edge cases at sites with many low quality ends and low mapping quality combined including reads that only overlap in hardclips (despite the obvious concerns with that) fixing an unexpected error flipping hardclippingsoftclippedBases creating a frankenbranch with a very intersting change involving realignment of reads and apparent DRAGEN behavior fixing a bug in 'getAllVariationEvents()' code flipping the values so it actually works Changed BQD feather end calculation to account for indels in the read relative to the reference reverted the non-realignment code fixing the horrible regression I just wrought on the BQD calculations... disabling bad debug squack Parameterizing the option to use original base qualities threading the truly original alignment through for BQD/FRD to match the bad dragen behavior we are now aware of cleaning up the branch if you like me then you should have put a pin in it responding to reveiw comments, (still havent fixed the debug streams) deleting a test that was for local files fixing a silly bug fixing the logic yet one more time fixing an issue with the disable-spanning-event-genotyping code that was causing false positivles (or perhaps reasonable behavior) jumping the gun on takutos fix because it was causing me trouble for this test refactored the debug output to be something very simple that should/could/maybe is better handled by making a debug level in the debugger. Anyway this is fine for now and probably could be expanded to handle the other debug streams in HC adding a scary error message to deter people from using contamination downsampling with BQD/FRD (it should be an exception but somebody would complain and it really isn't a sound thing to do unless the disqualified reads are included but they can't be inherently because they don't have scores) partial val review response work checkpoint : responded to most comments for my code in val PR review Fixed a couple of rare bugs in DRAGEN port code. Now we can reads dragstr-params values out of gcloud storage directly. taking the changes from valentin with the removed OQ tag pulling out the --use-original-alignments-for-genotyping-overlap as if it were so many weeds in my garden. Revision changes thus far Fix issues with how we assign STR period and repeat lenghts to positions on a read. Added more robust testing using Illumina's Matlab script to generate use cases. Impose a maximum of 40 for the GOP in DRAGstr pairhmm as seem to be the case in DRAGEN. Changes to meet recommendation except for the param estimation part that is on a separate branch reverting a mistake where i replaced every every instance of the word 'exists' with 'isEnabled' Unifying changes with the refactoring of the ReadsDataSource class A few updates on the revision: * Solved a couple of self-evaluation javadoc PR comments. * Added a integration test for ComposeSTRTableFile based on the DRAGEN output on some of the mini references under src/test/resources/large * Fix an issue that made ComposeSTRTableFile quite slow with reference with a very large number of contigs (eg. funcotator gencode trascripts ~100_000). * Found and fixed a bug introduced during refactoring making EstimateDragstrParameters produce "garbage" if run in parallel. * Silience a tedious debug message in SAMReaderQueryingIterator that comes out too often when the traversal intervals are just a few. Now it requires like a 1000+ intervals to produce that debug, might be still too low but it works in practice. Solves james' comment on DragstrParam toString deficiency fixed a few dangling issues with the rebase added the argument to end all arguments (and also completely remove the users ability to interact with our toolkit as they please) Fixes bug reported by Cwhelan Addresses some of James' comments and fixes a bug in a unit test for DragstrLocus class Fixes Dragstr Read Str analyzer failing unit test allowing nulls in the genotyper because its actually okay Fixed a bug causing the wrong priors to be taken (indel's instead of snp's in snp sites) Reverted a couple of changes in DragstrPairHMMInputScoreImputator to make it match better the corresponding HC integration-tests fixing a test i broke adding a test for the easy dragen calling mode Roll forward some of the rolled back changes Dragstr PairHMM code: * Limit GOP to 40 max. * byte-round rather than byte-truncate gop, gcp scores I Address most outstanding review comments Fix treatment of star alleles in use genotype posteriors for qual mode (#6856) * use genotype posteriors for qual spanning deletion fix
…uggy refactor code Revert "Rewrote leftAlignIndels code -- it now always works, even for multiple indels (#6427)" This reverts commit 8612288. Which brakes the softclip aware cigar building code and has a horrible bug. bringing modernity to this branch added a consistency test for DRAGEN mode a bunch of code cleanup more cleanup and tests? Added the matlab files as test Strengthened trust in SoftclipBases Behavior for overlap base quality adjustments Fixing a crash in alleleliklehoods and adjusting bq annotation after the quality for overlapping bases change minor fix and expanded debug output removing excess bases before hmm scoring seems to improve the resutlts at complex sites... the FRD issue is still relevant... Fixed genotype posterior calculator to not honor the unphase permutation number factor more bugfixes to read edge cases at sites with many low quality ends and low mapping quality combined including reads that only overlap in hardclips (despite the obvious concerns with that) fixing an unexpected error flipping hardclippingsoftclippedBases creating a frankenbranch with a very intersting change involving realignment of reads and apparent DRAGEN behavior fixing a bug in 'getAllVariationEvents()' code flipping the values so it actually works Changed BQD feather end calculation to account for indels in the read relative to the reference reverted the non-realignment code fixing the horrible regression I just wrought on the BQD calculations... disabling bad debug squack Parameterizing the option to use original base qualities threading the truly original alignment through for BQD/FRD to match the bad dragen behavior we are now aware of cleaning up the branch if you like me then you should have put a pin in it responding to reveiw comments, (still havent fixed the debug streams) deleting a test that was for local files fixing a silly bug fixing the logic yet one more time fixing an issue with the disable-spanning-event-genotyping code that was causing false positivles (or perhaps reasonable behavior) jumping the gun on takutos fix because it was causing me trouble for this test refactored the debug output to be something very simple that should/could/maybe is better handled by making a debug level in the debugger. Anyway this is fine for now and probably could be expanded to handle the other debug streams in HC adding a scary error message to deter people from using contamination downsampling with BQD/FRD (it should be an exception but somebody would complain and it really isn't a sound thing to do unless the disqualified reads are included but they can't be inherently because they don't have scores) partial val review response work checkpoint : responded to most comments for my code in val PR review Fixed a couple of rare bugs in DRAGEN port code. Now we can reads dragstr-params values out of gcloud storage directly. taking the changes from valentin with the removed OQ tag pulling out the --use-original-alignments-for-genotyping-overlap as if it were so many weeds in my garden. Revision changes thus far Fix issues with how we assign STR period and repeat lenghts to positions on a read. Added more robust testing using Illumina's Matlab script to generate use cases. Impose a maximum of 40 for the GOP in DRAGstr pairhmm as seem to be the case in DRAGEN. Changes to meet recommendation except for the param estimation part that is on a separate branch reverting a mistake where i replaced every every instance of the word 'exists' with 'isEnabled' Unifying changes with the refactoring of the ReadsDataSource class A few updates on the revision: * Solved a couple of self-evaluation javadoc PR comments. * Added a integration test for ComposeSTRTableFile based on the DRAGEN output on some of the mini references under src/test/resources/large * Fix an issue that made ComposeSTRTableFile quite slow with reference with a very large number of contigs (eg. funcotator gencode trascripts ~100_000). * Found and fixed a bug introduced during refactoring making EstimateDragstrParameters produce "garbage" if run in parallel. * Silience a tedious debug message in SAMReaderQueryingIterator that comes out too often when the traversal intervals are just a few. Now it requires like a 1000+ intervals to produce that debug, might be still too low but it works in practice. Solves james' comment on DragstrParam toString deficiency fixed a few dangling issues with the rebase added the argument to end all arguments (and also completely remove the users ability to interact with our toolkit as they please) Fixes bug reported by Cwhelan Addresses some of James' comments and fixes a bug in a unit test for DragstrLocus class Fixes Dragstr Read Str analyzer failing unit test allowing nulls in the genotyper because its actually okay Fixed a bug causing the wrong priors to be taken (indel's instead of snp's in snp sites) Reverted a couple of changes in DragstrPairHMMInputScoreImputator to make it match better the corresponding HC integration-tests fixing a test i broke adding a test for the easy dragen calling mode Roll forward some of the rolled back changes Dragstr PairHMM code: * Limit GOP to 40 max. * byte-round rather than byte-truncate gop, gcp scores I Address most outstanding review comments Fix treatment of star alleles in use genotype posteriors for qual mode (#6856) * use genotype posteriors for qual spanning deletion fix
…uggy refactor code Revert "Rewrote leftAlignIndels code -- it now always works, even for multiple indels (#6427)" This reverts commit 8612288. Which brakes the softclip aware cigar building code and has a horrible bug. bringing modernity to this branch added a consistency test for DRAGEN mode a bunch of code cleanup more cleanup and tests? Added the matlab files as test Strengthened trust in SoftclipBases Behavior for overlap base quality adjustments Fixing a crash in alleleliklehoods and adjusting bq annotation after the quality for overlapping bases change minor fix and expanded debug output removing excess bases before hmm scoring seems to improve the resutlts at complex sites... the FRD issue is still relevant... Fixed genotype posterior calculator to not honor the unphase permutation number factor more bugfixes to read edge cases at sites with many low quality ends and low mapping quality combined including reads that only overlap in hardclips (despite the obvious concerns with that) fixing an unexpected error flipping hardclippingsoftclippedBases creating a frankenbranch with a very intersting change involving realignment of reads and apparent DRAGEN behavior fixing a bug in 'getAllVariationEvents()' code flipping the values so it actually works Changed BQD feather end calculation to account for indels in the read relative to the reference reverted the non-realignment code fixing the horrible regression I just wrought on the BQD calculations... disabling bad debug squack Parameterizing the option to use original base qualities threading the truly original alignment through for BQD/FRD to match the bad dragen behavior we are now aware of cleaning up the branch if you like me then you should have put a pin in it responding to reveiw comments, (still havent fixed the debug streams) deleting a test that was for local files fixing a silly bug fixing the logic yet one more time fixing an issue with the disable-spanning-event-genotyping code that was causing false positivles (or perhaps reasonable behavior) jumping the gun on takutos fix because it was causing me trouble for this test refactored the debug output to be something very simple that should/could/maybe is better handled by making a debug level in the debugger. Anyway this is fine for now and probably could be expanded to handle the other debug streams in HC adding a scary error message to deter people from using contamination downsampling with BQD/FRD (it should be an exception but somebody would complain and it really isn't a sound thing to do unless the disqualified reads are included but they can't be inherently because they don't have scores) partial val review response work checkpoint : responded to most comments for my code in val PR review Fixed a couple of rare bugs in DRAGEN port code. Now we can reads dragstr-params values out of gcloud storage directly. taking the changes from valentin with the removed OQ tag pulling out the --use-original-alignments-for-genotyping-overlap as if it were so many weeds in my garden. Revision changes thus far Fix issues with how we assign STR period and repeat lenghts to positions on a read. Added more robust testing using Illumina's Matlab script to generate use cases. Impose a maximum of 40 for the GOP in DRAGstr pairhmm as seem to be the case in DRAGEN. Changes to meet recommendation except for the param estimation part that is on a separate branch reverting a mistake where i replaced every every instance of the word 'exists' with 'isEnabled' Unifying changes with the refactoring of the ReadsDataSource class A few updates on the revision: * Solved a couple of self-evaluation javadoc PR comments. * Added a integration test for ComposeSTRTableFile based on the DRAGEN output on some of the mini references under src/test/resources/large * Fix an issue that made ComposeSTRTableFile quite slow with reference with a very large number of contigs (eg. funcotator gencode trascripts ~100_000). * Found and fixed a bug introduced during refactoring making EstimateDragstrParameters produce "garbage" if run in parallel. * Silience a tedious debug message in SAMReaderQueryingIterator that comes out too often when the traversal intervals are just a few. Now it requires like a 1000+ intervals to produce that debug, might be still too low but it works in practice. Solves james' comment on DragstrParam toString deficiency fixed a few dangling issues with the rebase added the argument to end all arguments (and also completely remove the users ability to interact with our toolkit as they please) Fixes bug reported by Cwhelan Addresses some of James' comments and fixes a bug in a unit test for DragstrLocus class Fixes Dragstr Read Str analyzer failing unit test allowing nulls in the genotyper because its actually okay Fixed a bug causing the wrong priors to be taken (indel's instead of snp's in snp sites) Reverted a couple of changes in DragstrPairHMMInputScoreImputator to make it match better the corresponding HC integration-tests fixing a test i broke adding a test for the easy dragen calling mode Roll forward some of the rolled back changes Dragstr PairHMM code: * Limit GOP to 40 max. * byte-round rather than byte-truncate gop, gcp scores I Address most outstanding review comments Fix treatment of star alleles in use genotype posteriors for qual mode (#6856) * use genotype posteriors for qual spanning deletion fix
@kachulis I hope you find this entertaining.
Closes #150.
Closes #4919.
Also fixes a bug we didn't know about where
LeftAlignAndTrimVariants
didn't do anything when--dont-trim
was set.Also fixes a bug where
LeftAlignAndTrimVariants
couldn't left align an indel past a SNP.Also fixes a bug where
LeftAlignAndTrimVariants
didn't left align multiallelic variants when splitting was turned off.Probably fixes a few other bugs since the old code failed a bunch of my unit tests.