Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize some Mutect-related tools #5073

Merged
merged 4 commits into from
Aug 2, 2018
Merged

Conversation

davidbenjamin
Copy link
Contributor

@LeeTL1220 The first commit makes all concordance tools and MC3/M2 vcf merge much faster in Firecloud -- tasks that have been taking an hour will take a few minutes. The second commit makes GetPileupSummaries, hence the contamination task much faster. Previously that tool has cached all reads for the 100,000 bases around each site, so basically it had to do a whole bam's worth of I/O just to get ~60,000 pileups.

@@ -77,7 +77,7 @@ private void initializeTruthVariantsIfNecessary() {
}

if (truthVariants == null) {
truthVariants = new FeatureDataSource<>(new FeatureInput<>(truthVariantsFile, "truth"), CACHE_LOOKAHEAD, VariantContext.class);
truthVariants = new FeatureDataSource<>(new FeatureInput<>(truthVariantsFile, "truth"), CACHE_LOOKAHEAD, VariantContext.class, cloudPrefetchBuffer, cloudIndexPrefetchBuffer);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidbenjamin You should make this same change to the evalVariants FeatureDataSource.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cmnbroad, did it.

@codecov-io
Copy link

codecov-io commented Jul 31, 2018

Codecov Report

Merging #5073 into master will increase coverage by 0.034%.
The diff coverage is 92.308%.

@@               Coverage Diff               @@
##              master     #5073       +/-   ##
===============================================
+ Coverage     86.385%   86.419%   +0.034%     
- Complexity     28822     29084      +262     
===============================================
  Files           1791      1791               
  Lines         133561    134336      +775     
  Branches       14902     15138      +236     
===============================================
+ Hits          115377    116092      +715     
- Misses         12791     12808       +17     
- Partials        5393      5436       +43
Impacted Files Coverage Δ Complexity Δ
...ntamination/GetPileupSummariesIntegrationTest.java 95.455% <100%> (+0.216%) 4 <0> (ø) ⬇️
...e/hellbender/engine/AbstractConcordanceWalker.java 87.059% <100%> (ø) 13 <0> (ø) ⬇️
...ools/walkers/contamination/GetPileupSummaries.java 81.633% <88.889%> (+0.782%) 19 <9> (+1) ⬆️
...walkers/genotyper/GenotypingGivenAllelesUtils.java 40.541% <0%> (-20.998%) 2% <0%> (ø)
...utils/variant/GATKVariantContextUtilsUnitTest.java 88.811% <0%> (+2.209%) 186% <0%> (+43%) ⬆️
...pecaller/readthreading/ReadThreadingAssembler.java 70.769% <0%> (+2.271%) 79% <0%> (+27%) ⬆️
...lbender/utils/variant/GATKVariantContextUtils.java 88.601% <0%> (+3.306%) 409% <0%> (+188%) ⬆️
...org/broadinstitute/hellbender/utils/BaseUtils.java 90.265% <0%> (+3.54%) 48% <0%> (+2%) ⬆️
...utils/smithwaterman/SmithWatermanIntelAligner.java 90% <0%> (+10%) 3% <0%> (ø) ⬇️
...ithwaterman/SmithWatermanIntelAlignerUnitTest.java 90% <0%> (+30%) 2% <0%> (ø) ⬇️

Copy link
Contributor

@LeeTL1220 LeeTL1220 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidbenjamin One minor comment.

@@ -55,14 +53,15 @@
* gatk GetPileupSummaries \
* -I tumor.bam \
* -V common_biallelic.vcf.gz \
* -V common_biallelic.vcf.gz \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why -V twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops -- one is -V and one should be -L. I have to admit this is kind of clumsy but it's not utterly redundant, at least. For example, one might want to use some custom interval list (-L) of common sites but grab the allele frequencies from gnomAD (-V).

Fixed it and expanded the javadoc a bit.

@davidbenjamin
Copy link
Contributor Author

@LeeTL1220 back to you.

Copy link
Contributor

@LeeTL1220 LeeTL1220 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidbenjamin One more doc comment, then feel free to merge.

@@ -65,6 +65,17 @@
* -O pileups.table
* </pre>
*
* Although the sites (-L) and variants (-V) resources will often be identical, this need not be the case. For example,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically these files would only be different if you were trying to use a subset of the variants? I.e. -L was a subset of -V? If so, can you add that to the doc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@davidbenjamin davidbenjamin merged commit 27716dc into master Aug 2, 2018
@davidbenjamin davidbenjamin deleted the db_m2_optimizations branch August 2, 2018 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants