Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DepthOfCoverage port #5913

Merged
merged 12 commits into from
Mar 23, 2020
Merged

DepthOfCoverage port #5913

merged 12 commits into from
Mar 23, 2020

Conversation

jamesemery
Copy link
Collaborator

Currently this implementation could probably use more substantial testing, especially of the core functionality. It is also missing the following features from gatk3:

  • Depth of coverage by fragments
  • Rtable output format

The following important differences exist from gatk3:

  • Fixed numerous bugs with output line ordering, now we necessarily associate the right sample with the right data in the output tables. We also sort the output samples lexicographically consistently so the order of the output columns is deterministic.
  • Now gene outputs no longer rely on having exactly matching interval overlaps. Instead they are treated as intervals themselves with coverage information being collected for every exon base covered by the gene. There is an argument to demand that a gene must be completely covered by provided intervals to be in the result table. (Perhaps excising introns from coverage could be toggled?)
  • Intervals are no longer merged when reported in to the interval tables.

The following (new) features have been pushed into future branches:

  • Diagnose targets evaluation field (need to decide what partitioning to attach it to, perhaps sample partitioning info?)
  • Replacement for .refseq format for gene lists.
  • Optional output partitioning based a provided list of read filters.

Depends on #5887
Resolves #19

@droazen droazen self-requested a review May 3, 2019 20:53
@droazen droazen self-assigned this May 3, 2019
@codecov
Copy link

codecov bot commented May 3, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@00f1e43). Click here to learn what that means.
The diff coverage is 73.997%.

@@            Coverage Diff             @@
##             master     #5913   +/-   ##
==========================================
  Coverage          ?   78.979%           
  Complexity        ?     30649           
==========================================
  Files             ?      2003           
  Lines             ?    150459           
  Branches          ?     16657           
==========================================
  Hits              ?    118831           
  Misses            ?     25832           
  Partials          ?      5796
Impacted Files Coverage Δ Complexity Δ
...lkers/coverage/DepthOfCoverageIntegrationTest.java 0.758% <0.758%> (ø) 1 <1> (?)
...nstitute/hellbender/utils/IntervalMergingRule.java 100% <100%> (ø) 1 <0> (?)
...org/broadinstitute/hellbender/utils/BaseUtils.java 88.462% <100%> (ø) 59 <3> (?)
...itute/hellbender/engine/LocusWalkerByInterval.java 100% <100%> (ø) 7 <7> (?)
...llbender/engine/LocusWalkerByIntervalUnitTest.java 100% <100%> (ø) 5 <5> (?)
...ender/utils/codecs/refseq/RefSeqCodecUnitTest.java 100% <100%> (ø) 2 <2> (?)
...titute/hellbender/utils/IntervalUtilsUnitTest.java 91.906% <100%> (ø) 146 <0> (?)
...broadinstitute/hellbender/utils/IntervalUtils.java 92.11% <100%> (ø) 193 <4> (?)
...nder/tools/walkers/annotator/CoverageUnitTest.java 95.833% <100%> (ø) 12 <3> (?)
.../hellbender/utils/codecs/refseq/RefSeqFeature.java 39.706% <39.706%> (ø) 16 <16> (?)
... and 9 more

*/
@Advanced
@Argument(fullName = "include-deletions", shortName = "dels", doc = "Include information on deletions", optional = true)
private boolean includeDeletions = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the default be switched to include deletions as contributing to coverage as opposed to not?

@jamesemery jamesemery requested a review from jonn-smith July 29, 2019 16:46
@jamesemery
Copy link
Collaborator Author

note to self, this happens if no interval is provided... this should be fixed:

org.broadinstitute.hellbender.exceptions.GATKException: Cannot call getTraversalParameters() without specifying either intervals to include or exclude.
	at org.broadinstitute.hellbender.cmdline.argumentcollections.IntervalArgumentCollection.getSpecifiedIntervalsWithoutMerging(IntervalArgumentCollection.java:154)
	at org.broadinstitute.hellbender.tools.walkers.coverage.DepthOfCoverage.getIntervalObjectsToQueryOver(DepthOfCoverage.java:443)
	at org.broadinstitute.hellbender.engine.LocusWalkerByInterval.traverse(LocusWalkerByInterval.java:55)
	at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1039)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
	at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
	at org.broadinstitute.hellbender.Main.main(Main.java:291)```

@jamesemery jamesemery force-pushed the je_portDepthOfCoverge branch from a7ed4fc to 2228393 Compare December 17, 2019 21:59
@jamesemery
Copy link
Collaborator Author

@droazen I have rebased this branch and cleaned up a few things that I noticed in the process. I think it could use your review now.

@gatk-bot
Copy link

gatk-bot commented Dec 17, 2019

Travis reported job failures from build 28390
Failures in the following jobs:

Test Type JDK Job ID Logs
python openjdk8 28390.5 logs
integration oraclejdk8 28390.11 logs
integration openjdk11 28390.12 logs
integration openjdk8 28390.2 logs

@jamesemery jamesemery force-pushed the je_portDepthOfCoverge branch from 2228393 to 4dc1ff1 Compare December 20, 2019 17:10
@jamesemery
Copy link
Collaborator Author

@droazen I have just rebased on top of the interval merging code change. I would love to get a review of this to chew on.

@gatk-bot
Copy link

gatk-bot commented Dec 20, 2019

Travis reported job failures from build 28445
Failures in the following jobs:

Test Type JDK Job ID Logs
integration oraclejdk8 28445.11 logs
integration openjdk11 28445.12 logs
integration openjdk8 28445.2 logs

@droazen droazen changed the title DepthOfCoverage port checkpoint DepthOfCoverage port Feb 5, 2020
@kachulis
Copy link
Contributor

@jamesemery shared with you offline some data that seems to give (incorrectly) output of 0 coverage for all genes. Hopefully won't be too difficult to figure out what the issue was with that.

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery I've finished reviewing the non-test classes. I'll post a review of the test classes separately later this week.

public abstract class LocusWalkerByInterval extends LocusWalker {

private OverlapDetector<Locatable> intervalsToTrack = null;
private Set<Locatable> previousIntervals = new HashSet<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any potential ordering issues with using HashSet instead of LinkedHashSet here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as it is currently written the consequence here is very low. Auditing this one the worst case here is that the order in which one writes out genes could be shuffled somewhat (the other cases were order independent). I'm going to switch this to a linked hash set to avoid the trouble though.

* Returns a count of the total number of reference bases spanned by gene summary. Will total the length of the exons or
* if absent, the lengthOnReference for the gene itself.
*
* NOTE: This currently makes the assumption that genes do not ever have overlapping exons. I do not know if this is a fair
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to my sources, genes can have overlapping exons, though...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I say lets burn this bridge when we get to it. It is a fairly edge case functionality that GATK3 used and ideally we will replace this class entirely with something more general


/** Returns true if the specified interval 'that' overlaps with any of the exons actually spliced into this transcript */
@Override
public boolean contains(Locatable that) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition of contains() is that this Locatable completely contains the region represented by the Locatable passed in. Your implementation here, however, is merely checking for overlap with any of the exons, which seems wrong.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an important functionality. This is what enforces that we are only counting exon coverage for genes. Introns are note "really" part of the transcript for coverage in this case.

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery And here are my comments on the test classes

Assert.assertEquals(lines.get(0)[0], "SLC35E2");
Assert.assertEquals(lines.get(1)[0], "SLC35E2");
Assert.assertNotEquals(lines.get(0), lines.get(1));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too you're not calling compareOutputDirectories() to fully check the output files?

Iterator<String[]> expectedLines = StreamSupport.stream(expectedParser.spliterator(),false).sorted(Comparator.comparing(l -> l[0]+","+l[1])).iterator();
// Special case to handle scrambled output header for the base coverage in gatk3
if (getDoCExtensionFromFile(actualFile, actualFileBaseName).equals("")) {
// Just assert all the correct counts are present for each line because the order is wrong in the expected files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the sort above compensate for ordering differences? I don't understand this comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, gatk3 test files included bugs that I fixed in terms of column ordering.

Assert.assertTrue(codec.canDecode("filename" + EXTRA_CHAR + "." + RefSeqCodec.FILE_EXT));
Assert.assertFalse(codec.canDecode("filename." + RefSeqCodec.FILE_EXT + EXTRA_CHAR));
Assert.assertFalse(codec.canDecode("filename" + RefSeqCodec.FILE_EXT));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weren't there any RefSeq tests from GATK3 that we could port over?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah here they all are:

public class RefSeqCodecUnitTest {

    @Test
    public void testCanDecode() {
        final String EXTRA_CHAR = "1";
        RefSeqCodec codec = new RefSeqCodec();
        Assert.assertTrue(codec.canDecode("filename." + RefSeqCodec.FILE_EXT));
        Assert.assertTrue(codec.canDecode("filename" + EXTRA_CHAR + "." + RefSeqCodec.FILE_EXT));
        Assert.assertFalse(codec.canDecode("filename." + RefSeqCodec.FILE_EXT + EXTRA_CHAR));
        Assert.assertFalse(codec.canDecode("filename" + RefSeqCodec.FILE_EXT));
    }
}

@droazen
Copy link
Contributor

droazen commented Mar 19, 2020

@jamesemery How close is this to being merged? Once you've addressed all review comments to your satisfaction and have rebased to resolve the conflicts, let me know and I can give final approval after tests are green.

@jamesemery jamesemery force-pushed the je_portDepthOfCoverge branch from 1271f32 to 74f8505 Compare March 19, 2020 18:16
@gatk-bot
Copy link

gatk-bot commented Mar 19, 2020

Travis reported job failures from build 29604
Failures in the following jobs:

Test Type JDK Job ID Logs
unit openjdk11 29604.14 logs
integration oraclejdk8 29604.12 logs
integration openjdk11 29604.13 logs
integration openjdk8 29604.2 logs

@droazen
Copy link
Contributor

droazen commented Mar 20, 2020

@jamesemery There appear to be some lingering test failures here after the rebase (mismatches between actual and expected VCF files).

@gatk-bot
Copy link

Travis reported job failures from build 29680
Failures in the following jobs:

Test Type JDK Job ID Logs
unit openjdk11 29680.14 logs

@jamesemery
Copy link
Collaborator Author

@droazen I fixed the tests (it was an errant find and replace error). I have spot checked the changes and it looks like nothing obviously off-target in there. I need another approval in order to finally merge this branch.

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery If you've addressed all comments, go ahead and merge once tests pass

@jamesemery jamesemery merged commit b459b72 into master Mar 23, 2020
*
* <p>
* Instructions for generating a RefSeq file for use with the RefSeq codec can be found on the documentation guide here
* <a href="https://gatkforums.broadinstitute.org/gatk/discussion/1329/where-can-i-get-a-gene-list-in-refseq-format">https://gatkforums.broadinstitute.org/gatk/discussion/1329/where-can-i-get-a-gene-list-in-refseq-format</a>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

url is broken. Should be : https://gatk.broadinstitute.org/hc/en-us/articles/360035532032-RefSeq-gene-list-format
Also user should be notified that file extension must be .refseq

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oguzhankalyon Could you please file a new ticket about these issues instead of commenting on this merged pull request? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

make a flexible coverage tool (DepthOfCoverage, DiagnoseTargets, ...)
6 participants