Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(SV) Misc. improvements to the new location inference and type interpretation tool #4562

Merged
merged 1 commit into from
Mar 27, 2018

Conversation

SHuang-Broad
Copy link
Contributor

  • Output bam instead of sam for assembly alignments
  • Instead of creating directory, new interpretation tool writes files (behavior consistent with current interpretation tool)
  • Prefix with sample name for output files' names
  • Add INSLEN annotation when there's INSSEQ
  • Clarify the boundary between AlignedContig and AssemblyContigWithFineTunedAlignments
  • Increase test coverage for AssemblyContigAlignmentsConfigPicker

Up to date plans for more cleanups and improvements posted in #4111

@SHuang-Broad SHuang-Broad force-pushed the sh-misc-improvements branch 2 times, most recently from 8998c91 to 7e37a74 Compare March 23, 2018 19:55
@codecov-io
Copy link

codecov-io commented Mar 23, 2018

Codecov Report

Merging #4562 into master will decrease coverage by 18.083%.
The diff coverage is 42.373%.

@@               Coverage Diff                @@
##              master     #4562        +/-   ##
================================================
- Coverage     79.819%   61.737%   -18.083%     
+ Complexity     16999     13288      -3711     
================================================
  Files           1066      1062         -4     
  Lines          61876     61725       -151     
  Branches       10007      9993        -14     
================================================
- Hits           49389     38107     -11282     
- Misses          8578     20318     +11740     
+ Partials        3909      3300       -609
Impacted Files Coverage Δ Complexity Δ
...ender/tools/spark/sv/utils/GATKSVVCFConstants.java 75% <ø> (ø) 1 <0> (ø) ⬇️
.../discovery/alignment/ContigAlignmentsModifier.java 85.535% <ø> (ø) 41 <0> (ø) ⬇️
...ce/AssemblyContigAlignmentSignatureClassifier.java 12.366% <0%> (-46.553%) 17 <0> (-13)
.../DiscoverVariantsFromContigAlignmentsSAMSpark.java 27.957% <0%> (-65.521%) 8 <0> (-14)
...v/discovery/inference/BreakpointComplications.java 60.311% <100%> (-0.118%) 20 <0> (ø)
...der/tools/spark/sv/utils/GATKSVVCFHeaderLines.java 82.54% <100%> (+0.282%) 7 <0> (ø) ⬇️
.../sv/discovery/inference/CpxVariantInterpreter.java 64.925% <100%> (-5.97%) 24 <0> (-1)
...ls/spark/sv/discovery/alignment/AlignedContig.java 92.453% <100%> (+2.798%) 22 <8> (+1) ⬆️
...ry/inference/CpxVariantInducingAssemblyContig.java 83.178% <100%> (+0.467%) 24 <0> (ø) ⬇️
.../sv/StructuralVariationDiscoveryPipelineSpark.java 32.394% <15%> (-56.412%) 0 <0> (-10)
... and 455 more

Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine, just some minor comments. My biggest point is that it'd be nice to treat the input directory as a first class object and handle it programmatically.

@@ -110,14 +111,14 @@
= new DiscoverVariantsFromContigsAlignmentsSparkArgumentCollection();
@Argument(doc = "sam file for aligned contigs", fullName = "contig-sam-file")
private String outputAssemblyAlignments;
@Argument(doc = "filename for output vcf", shortName = StandardArgumentDefinitions.OUTPUT_SHORT_NAME,
@Argument(doc = "prefix for output vcf; sample name will be appended after the provided argument",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this assuming that if you want files to appear in a directory then the argument should end with a trailing slash? I'd like to avoid that if possible. Can we handle this with eg the Path API?

I.e. Let's make this actually call for a directory. Then we should validate that the directory actually exists and create it if it's not there. Finally we can create the files in that directory using an API call, like Paths.resolve(outputDirectory, filename) for each filename we want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the commit 2fd080d similar to what you have in mind?

" the directory contains multiple VCF's for different types and record-generating SAM files of assembly contigs,",
fullName = "exp-variants-out-dir", optional = true)
private String expVariantsOutDir;
@Argument(doc = "prefix to output files of our prototyping breakpoint and type inference tool in addition to the master VCF;",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove "in addition to the master VCF" from this doc line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

fullName = "exp-variants-out-dir", optional = true)
private String expVariantsOutDir;
@Argument(doc = "prefix to output files of our prototyping breakpoint and type inference tool in addition to the master VCF;",
fullName = "exp-variants-out-prefix", optional = true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep this called directory according to my comment above

@@ -163,8 +164,9 @@ protected void runTool( final JavaSparkContext ctx ) {
if(parsedAlignments.isEmpty()) return;

final Broadcast<SVIntervalTree<VariantContext>> cnvCallsBroadcast = broadcastCNVCalls(ctx, headerForReads, discoverStageArgs.cnvCallsFile);
final String outputPrefixWithSampleName = outputPrefix + (outputPrefix.endsWith("/") ? "" : "_") + SVUtils.getSampleId(headerForReads);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the user accidentally leaves off the trailing slash the files will not go into the directory requested but will be in its parent with a confusing name, i think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. thanks for catching that. corrected

SVFileUtils.writeSAMFile(outputDir+"/"+rawTypeString+".sam", splitLongReads.collect().iterator(),
headerBroadcast.getValue(), false);
.map(read -> read.convertToSAMRecord(header));
header.setSortOrder(SAMFileHeader.SortOrder.queryname);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why sort these by queryname and not by position? I could be convinced either way but this requires an extra step to load the file into IGV.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was because I basically used Ribbon for reviewing those alignments,
which is definitely easier with SAM compared to with BAM.
Now changed to adding an parameter to the utility

  • if BAM, sort by coordinate order.
  • if SAM, sort by query name.

@@ -391,7 +391,7 @@ private void experimentalInterpretation(final JavaSparkContext ctx,
final String contigName = AlignedAssemblyOrExcuse.formatContigName(alignedAssembly.getAssemblyId(), contigIdx);
final List<AlignmentInterval> arOfAContig
= getAlignmentsForOneContig(contigName, contigSequence, allAlignments.get(contigIdx), refNames, header);
return new AlignedContig(contigName, contigSequence, arOfAContig, false);
return new AlignedContig(contigName, contigSequence, arOfAContig);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this variable name? arOfAContig is not obvious any more. Maybe alignmentsForContig?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* or less summed mismatches if still tie
* first
* implement ordering that
* prefer the configuration with less alignments, then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"prefers"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* first
* implement ordering that
* prefer the configuration with less alignments, then
* prefer the configuration less summed mismatches if still tie
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"and prefers the configuration with a lower number of summed mismatches in case of a tie"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -19,7 +19,7 @@
"Italiam, fato profugus, Laviniaque venit " +
"litora, multum ille et terris iactatus et alto " +
"vi superum saevae memorem Iunonis ob iram; " +
"multa quoque et bello passus, dum conderet urbem, " +
"multa quoque et bello passus, testConfigurationSorting conderet urbem, " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this maybe a global search/replace change? My high school latin is pretty rusty but I don't think testConfigurationSorting is a word that can replace a conjunction like 'dum'. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely better than mine, who initially named a function "dum" 😰

     bug-fix and improvement commit

BUG-FIX:
  * fix a bug in ContigAlignmentsModifer.splitGappedAlignment() that was keeping the aligner score for the split child alignments while using NO_NM, it now uses NO_AS

IMPROVEMENT 1:
  * new INSLEN annotation to accompany INSSEQ
  * assembly alignments are now output as BAM instead of SAM
  * sample name included in output VCF files
  * CLI now asks user to provide directory for VCF output and a boolean for optionally running experimental interpretation tool
  * experimental interpretation only writes BAM for ambiguous, incomplete, and misasseblysuspects

IMPROVEMENT 2:
clarify the boundary between the two classes AlignedContig and AssemblyContigWithFineTunedAlignments
  * AlignedContig now represents assembly contig, possibly unmapped, whose alignments are given by aligner as-is
  * AssemblyContigWithFineTunedAlignments represents contig whose alignments underwent selection and gap-split

IMPROVEMENT 3:
increase test coverage for AssemblyContigAlignmentsConfigPicker
  * add test for speedUpWhenTooManyMappings()
  * add test for sortConfigurations()
  * add test for splitGaps() and gappedAlignmentOffersBetterCoverage()
@SHuang-Broad SHuang-Broad force-pushed the sh-misc-improvements branch from a6cb572 to c202eb3 Compare March 27, 2018 21:05
@SHuang-Broad SHuang-Broad merged commit 30669c3 into master Mar 27, 2018
@SHuang-Broad SHuang-Broad deleted the sh-misc-improvements branch March 27, 2018 22:07
@droazen
Copy link
Contributor

droazen commented Mar 28, 2018

@SHuang-Broad The testSAMWriter_chr20.bam bam that you added here should probably have gone into git lfs (src/test/resources/large), rather than directly into the repo. It's too late to fix (once it's in master, it's in the git history forever), but in the future please try to store test data bigger than ~1-2 MB or so in lfs.

cwhelan pushed a commit to cwhelan/gatk-linked-reads that referenced this pull request May 25, 2018
bug-fix and improvement commit

BUG-FIX:
  * fix a bug in ContigAlignmentsModifer.splitGappedAlignment() that was keeping the aligner score for the split child alignments while using NO_NM, it now uses NO_AS

IMPROVEMENT 1:
  * new INSLEN annotation to accompany INSSEQ
  * assembly alignments are now output as BAM instead of SAM
  * sample name included in output VCF files
  * CLI now asks user to provide directory for VCF output and a boolean for optionally running experimental interpretation tool
  * experimental interpretation only writes BAM for ambiguous, incomplete, and misasseblysuspects

IMPROVEMENT 2:
clarify the boundary between the two classes AlignedContig and AssemblyContigWithFineTunedAlignments
  * AlignedContig now represents assembly contig, possibly unmapped, whose alignments are given by aligner as-is
  * AssemblyContigWithFineTunedAlignments represents contig whose alignments underwent selection and gap-split

IMPROVEMENT 3:
increase test coverage for AssemblyContigAlignmentsConfigPicker
  * add test for speedUpWhenTooManyMappings()
  * add test for sortConfigurations()
  * add test for splitGaps() and gappedAlignmentOffersBetterCoverage()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants