-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DepthOfCoverage port #5913
DepthOfCoverage port #5913
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5913 +/- ##
==========================================
Coverage ? 78.979%
Complexity ? 30649
==========================================
Files ? 2003
Lines ? 150459
Branches ? 16657
==========================================
Hits ? 118831
Misses ? 25832
Partials ? 5796
|
*/ | ||
@Advanced | ||
@Argument(fullName = "include-deletions", shortName = "dels", doc = "Include information on deletions", optional = true) | ||
private boolean includeDeletions = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the default be switched to include deletions as contributing to coverage as opposed to not?
note to self, this happens if no interval is provided... this should be fixed:
|
a7ed4fc
to
2228393
Compare
@droazen I have rebased this branch and cleaned up a few things that I noticed in the process. I think it could use your review now. |
2228393
to
4dc1ff1
Compare
@droazen I have just rebased on top of the interval merging code change. I would love to get a review of this to chew on. |
@jamesemery shared with you offline some data that seems to give (incorrectly) output of 0 coverage for all genes. Hopefully won't be too difficult to figure out what the issue was with that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery I've finished reviewing the non-test classes. I'll post a review of the test classes separately later this week.
...va/org/broadinstitute/hellbender/cmdline/argumentcollections/IntervalArgumentCollection.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/engine/LocusWalkerByInterval.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/engine/LocusWalkerByInterval.java
Outdated
Show resolved
Hide resolved
public abstract class LocusWalkerByInterval extends LocusWalker { | ||
|
||
private OverlapDetector<Locatable> intervalsToTrack = null; | ||
private Set<Locatable> previousIntervals = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any potential ordering issues with using HashSet
instead of LinkedHashSet
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So as it is currently written the consequence here is very low. Auditing this one the worst case here is that the order in which one writes out genes could be shuffled somewhat (the other cases were order independent). I'm going to switch this to a linked hash set to avoid the trouble though.
src/main/java/org/broadinstitute/hellbender/engine/LocusWalkerByInterval.java
Outdated
Show resolved
Hide resolved
* Returns a count of the total number of reference bases spanned by gene summary. Will total the length of the exons or | ||
* if absent, the lengthOnReference for the gene itself. | ||
* | ||
* NOTE: This currently makes the assumption that genes do not ever have overlapping exons. I do not know if this is a fair |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to my sources, genes can have overlapping exons, though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I say lets burn this bridge when we get to it. It is a fairly edge case functionality that GATK3 used and ideally we will replace this class entirely with something more general
|
||
/** Returns true if the specified interval 'that' overlaps with any of the exons actually spliced into this transcript */ | ||
@Override | ||
public boolean contains(Locatable that) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The definition of contains()
is that this Locatable completely contains the region represented by the Locatable passed in. Your implementation here, however, is merely checking for overlap with any of the exons, which seems wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an important functionality. This is what enforces that we are only counting exon coverage for genes. Introns are note "really" part of the transcript for coverage in this case.
src/main/java/org/broadinstitute/hellbender/utils/codecs/refseq/Transcript.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/utils/codecs/refseq/Transcript.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/utils/codecs/refseq/Transcript.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery And here are my comments on the test classes
src/test/java/org/broadinstitute/hellbender/engine/LocusWalkerByIntervalUnitTest.java
Outdated
Show resolved
Hide resolved
src/test/java/org/broadinstitute/hellbender/engine/LocusWalkerByIntervalUnitTest.java
Show resolved
Hide resolved
src/test/java/org/broadinstitute/hellbender/engine/LocusWalkerByIntervalUnitTest.java
Show resolved
Hide resolved
src/test/java/org/broadinstitute/hellbender/engine/LocusWalkerByIntervalUnitTest.java
Show resolved
Hide resolved
src/test/java/org/broadinstitute/hellbender/engine/LocusWalkerByIntervalUnitTest.java
Show resolved
Hide resolved
Assert.assertEquals(lines.get(0)[0], "SLC35E2"); | ||
Assert.assertEquals(lines.get(1)[0], "SLC35E2"); | ||
Assert.assertNotEquals(lines.get(0), lines.get(1)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too you're not calling compareOutputDirectories() to fully check the output files?
Iterator<String[]> expectedLines = StreamSupport.stream(expectedParser.spliterator(),false).sorted(Comparator.comparing(l -> l[0]+","+l[1])).iterator(); | ||
// Special case to handle scrambled output header for the base coverage in gatk3 | ||
if (getDoCExtensionFromFile(actualFile, actualFileBaseName).equals("")) { | ||
// Just assert all the correct counts are present for each line because the order is wrong in the expected files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the sort above compensate for ordering differences? I don't understand this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, gatk3 test files included bugs that I fixed in terms of column ordering.
src/test/java/org/broadinstitute/hellbender/utils/IntervalUtilsUnitTest.java
Outdated
Show resolved
Hide resolved
src/test/java/org/broadinstitute/hellbender/utils/codecs/refseq/RefSeqCodecUnitTest.java
Outdated
Show resolved
Hide resolved
Assert.assertTrue(codec.canDecode("filename" + EXTRA_CHAR + "." + RefSeqCodec.FILE_EXT)); | ||
Assert.assertFalse(codec.canDecode("filename." + RefSeqCodec.FILE_EXT + EXTRA_CHAR)); | ||
Assert.assertFalse(codec.canDecode("filename" + RefSeqCodec.FILE_EXT)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weren't there any RefSeq tests from GATK3 that we could port over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah here they all are:
public class RefSeqCodecUnitTest {
@Test
public void testCanDecode() {
final String EXTRA_CHAR = "1";
RefSeqCodec codec = new RefSeqCodec();
Assert.assertTrue(codec.canDecode("filename." + RefSeqCodec.FILE_EXT));
Assert.assertTrue(codec.canDecode("filename" + EXTRA_CHAR + "." + RefSeqCodec.FILE_EXT));
Assert.assertFalse(codec.canDecode("filename." + RefSeqCodec.FILE_EXT + EXTRA_CHAR));
Assert.assertFalse(codec.canDecode("filename" + RefSeqCodec.FILE_EXT));
}
}
@jamesemery How close is this to being merged? Once you've addressed all review comments to your satisfaction and have rebased to resolve the conflicts, let me know and I can give final approval after tests are green. |
…a few things for re-review
1271f32
to
74f8505
Compare
@jamesemery There appear to be some lingering test failures here after the rebase (mismatches between actual and expected VCF files). |
…tion of deletion sensitive locus code
@droazen I fixed the tests (it was an errant find and replace error). I have spot checked the changes and it looks like nothing obviously off-target in there. I need another approval in order to finally merge this branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery If you've addressed all comments, go ahead and merge once tests pass
* | ||
* <p> | ||
* Instructions for generating a RefSeq file for use with the RefSeq codec can be found on the documentation guide here | ||
* <a href="https://gatkforums.broadinstitute.org/gatk/discussion/1329/where-can-i-get-a-gene-list-in-refseq-format">https://gatkforums.broadinstitute.org/gatk/discussion/1329/where-can-i-get-a-gene-list-in-refseq-format</a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
url is broken. Should be : https://gatk.broadinstitute.org/hc/en-us/articles/360035532032-RefSeq-gene-list-format
Also user should be notified that file extension must be .refseq
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oguzhankalyon Could you please file a new ticket about these issues instead of commenting on this merged pull request? Thanks!
Currently this implementation could probably use more substantial testing, especially of the core functionality. It is also missing the following features from gatk3:
The following important differences exist from gatk3:
The following (new) features have been pushed into future branches:
.refseq
format for gene lists.Depends on #5887
Resolves #19