-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(SV) Misc. improvements to the new location inference and type interpretation tool #4562
Conversation
8998c91
to
7e37a74
Compare
Codecov Report
@@ Coverage Diff @@
## master #4562 +/- ##
================================================
- Coverage 79.819% 61.737% -18.083%
+ Complexity 16999 13288 -3711
================================================
Files 1066 1062 -4
Lines 61876 61725 -151
Branches 10007 9993 -14
================================================
- Hits 49389 38107 -11282
- Misses 8578 20318 +11740
+ Partials 3909 3300 -609
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine, just some minor comments. My biggest point is that it'd be nice to treat the input directory as a first class object and handle it programmatically.
@@ -110,14 +111,14 @@ | |||
= new DiscoverVariantsFromContigsAlignmentsSparkArgumentCollection(); | |||
@Argument(doc = "sam file for aligned contigs", fullName = "contig-sam-file") | |||
private String outputAssemblyAlignments; | |||
@Argument(doc = "filename for output vcf", shortName = StandardArgumentDefinitions.OUTPUT_SHORT_NAME, | |||
@Argument(doc = "prefix for output vcf; sample name will be appended after the provided argument", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this assuming that if you want files to appear in a directory then the argument should end with a trailing slash? I'd like to avoid that if possible. Can we handle this with eg the Path API?
I.e. Let's make this actually call for a directory. Then we should validate that the directory actually exists and create it if it's not there. Finally we can create the files in that directory using an API call, like Paths.resolve(outputDirectory, filename)
for each filename we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the commit 2fd080d similar to what you have in mind?
" the directory contains multiple VCF's for different types and record-generating SAM files of assembly contigs,", | ||
fullName = "exp-variants-out-dir", optional = true) | ||
private String expVariantsOutDir; | ||
@Argument(doc = "prefix to output files of our prototyping breakpoint and type inference tool in addition to the master VCF;", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove "in addition to the master VCF" from this doc line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
fullName = "exp-variants-out-dir", optional = true) | ||
private String expVariantsOutDir; | ||
@Argument(doc = "prefix to output files of our prototyping breakpoint and type inference tool in addition to the master VCF;", | ||
fullName = "exp-variants-out-prefix", optional = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep this called directory
according to my comment above
@@ -163,8 +164,9 @@ protected void runTool( final JavaSparkContext ctx ) { | |||
if(parsedAlignments.isEmpty()) return; | |||
|
|||
final Broadcast<SVIntervalTree<VariantContext>> cnvCallsBroadcast = broadcastCNVCalls(ctx, headerForReads, discoverStageArgs.cnvCallsFile); | |||
final String outputPrefixWithSampleName = outputPrefix + (outputPrefix.endsWith("/") ? "" : "_") + SVUtils.getSampleId(headerForReads); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the user accidentally leaves off the trailing slash the files will not go into the directory requested but will be in its parent with a confusing name, i think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right. thanks for catching that. corrected
SVFileUtils.writeSAMFile(outputDir+"/"+rawTypeString+".sam", splitLongReads.collect().iterator(), | ||
headerBroadcast.getValue(), false); | ||
.map(read -> read.convertToSAMRecord(header)); | ||
header.setSortOrder(SAMFileHeader.SortOrder.queryname); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why sort these by queryname and not by position? I could be convinced either way but this requires an extra step to load the file into IGV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that was because I basically used Ribbon for reviewing those alignments,
which is definitely easier with SAM compared to with BAM.
Now changed to adding an parameter to the utility
- if BAM, sort by coordinate order.
- if SAM, sort by query name.
@@ -391,7 +391,7 @@ private void experimentalInterpretation(final JavaSparkContext ctx, | |||
final String contigName = AlignedAssemblyOrExcuse.formatContigName(alignedAssembly.getAssemblyId(), contigIdx); | |||
final List<AlignmentInterval> arOfAContig | |||
= getAlignmentsForOneContig(contigName, contigSequence, allAlignments.get(contigIdx), refNames, header); | |||
return new AlignedContig(contigName, contigSequence, arOfAContig, false); | |||
return new AlignedContig(contigName, contigSequence, arOfAContig); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change this variable name? arOfAContig
is not obvious any more. Maybe alignmentsForContig
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* or less summed mismatches if still tie | ||
* first | ||
* implement ordering that | ||
* prefer the configuration with less alignments, then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"prefers"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* first | ||
* implement ordering that | ||
* prefer the configuration with less alignments, then | ||
* prefer the configuration less summed mismatches if still tie |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"and prefers the configuration with a lower number of summed mismatches in case of a tie"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -19,7 +19,7 @@ | |||
"Italiam, fato profugus, Laviniaque venit " + | |||
"litora, multum ille et terris iactatus et alto " + | |||
"vi superum saevae memorem Iunonis ob iram; " + | |||
"multa quoque et bello passus, dum conderet urbem, " + | |||
"multa quoque et bello passus, testConfigurationSorting conderet urbem, " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this maybe a global search/replace change? My high school latin is pretty rusty but I don't think testConfigurationSorting
is a word that can replace a conjunction like 'dum'. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definitely better than mine, who initially named a function "dum" 😰
bug-fix and improvement commit BUG-FIX: * fix a bug in ContigAlignmentsModifer.splitGappedAlignment() that was keeping the aligner score for the split child alignments while using NO_NM, it now uses NO_AS IMPROVEMENT 1: * new INSLEN annotation to accompany INSSEQ * assembly alignments are now output as BAM instead of SAM * sample name included in output VCF files * CLI now asks user to provide directory for VCF output and a boolean for optionally running experimental interpretation tool * experimental interpretation only writes BAM for ambiguous, incomplete, and misasseblysuspects IMPROVEMENT 2: clarify the boundary between the two classes AlignedContig and AssemblyContigWithFineTunedAlignments * AlignedContig now represents assembly contig, possibly unmapped, whose alignments are given by aligner as-is * AssemblyContigWithFineTunedAlignments represents contig whose alignments underwent selection and gap-split IMPROVEMENT 3: increase test coverage for AssemblyContigAlignmentsConfigPicker * add test for speedUpWhenTooManyMappings() * add test for sortConfigurations() * add test for splitGaps() and gappedAlignmentOffersBetterCoverage()
a6cb572
to
c202eb3
Compare
@SHuang-Broad The |
bug-fix and improvement commit BUG-FIX: * fix a bug in ContigAlignmentsModifer.splitGappedAlignment() that was keeping the aligner score for the split child alignments while using NO_NM, it now uses NO_AS IMPROVEMENT 1: * new INSLEN annotation to accompany INSSEQ * assembly alignments are now output as BAM instead of SAM * sample name included in output VCF files * CLI now asks user to provide directory for VCF output and a boolean for optionally running experimental interpretation tool * experimental interpretation only writes BAM for ambiguous, incomplete, and misasseblysuspects IMPROVEMENT 2: clarify the boundary between the two classes AlignedContig and AssemblyContigWithFineTunedAlignments * AlignedContig now represents assembly contig, possibly unmapped, whose alignments are given by aligner as-is * AssemblyContigWithFineTunedAlignments represents contig whose alignments underwent selection and gap-split IMPROVEMENT 3: increase test coverage for AssemblyContigAlignmentsConfigPicker * add test for speedUpWhenTooManyMappings() * add test for sortConfigurations() * add test for splitGaps() and gappedAlignmentOffersBetterCoverage()
INSLEN
annotation when there'sINSSEQ
AlignedContig
andAssemblyContigWithFineTunedAlignments
AssemblyContigAlignmentsConfigPicker
Up to date plans for more cleanups and improvements posted in #4111