(SV) re-interpreting CPX records by experimental interpretation tool #4602

SHuang-Broad · 2018-03-27T23:38:52Z

A not-so-elegant way to tackle #4323, and part of #4111 .

UPDATE: fixes #4323

Brief explanation:

The <CPX> variants we currently output has an annotation SEGMENTS, which could contain

0-entries (which will be simply omitted): this is when the head and tail alignments seamlessly stitch together on the reference, and middle alignments are all taken as inserted sequence
1-entry: this is when head/tail alignment overlap on reference over the region specified in the entry, hence we have a deletion or duplication of that region (depending on if the segment is present in another annotation ALT_ARRANGEMENT, or if present, whether it is inverted), and insertion of more sequences
multiple entries: there are the truly complex ones

while the first two cases are easy to deal with, the last one is very difficult to parse into simple variants, and has the inherent evil of ambiguity in representations (to demonstrate, a not very complicated one could be like this)

chr6	166997615	CPX_chr6:166997615-166997944	ACCCACAGACAGAAACACAGAGACATGTTTGGAAGCCAGTGTGGATGCCCTGTGATCTGTGTGTACACATGACAAGTGCATACACACGCACATAAAGGAACCCAGAGACGTGTTTGGAAGCCAGTGTGGACACCCTGTGATCTGTGCGTACACATTTGACACCTGCGTACACACTCACAGACAGAAACACAGAGATGTGTTTGGAAGCCAGTGTGGACATCCTGTGGTCTGCGCGTACACATGTGACAGGTACGTGCACGCCCACATACAGGAACACACAGAGGCCTTTGGAAGCCAGCATGGGCAGACAGGCCCTATCCCAAAGCGGCC	<CPX>	.	.	ALIGN_LENGTHS=309;ALT_ARRANGEMENT=1,2,3,4,5,UINS-733,2,UINS-94,-6,-5,UINS-41,4,5,UINS-40,1,2,3,4,5,6;CTG_GOOD_NONCANONICAL_MAPPING=chrUn_JTFH01000473v1_decoy,1,-,51H1640M204H,60,0,1640;CTG_NAMES=asm011602:tig00001;END=166997944;HQ_MAPPINGS=1;MAPPING_QUALITIES=60;MAX_ALIGN_LENGTH=309;SEGMENTS=chr6:166997615-166997617,chr6:166997617-166997679,chr6:166997679-166997727,chr6:166997727-166997787,chr6:166997787-166997831,chr6:166997832-166997944;SEQ_ALT_HAPLOTYPE=ACCCACAGACAGAAACACAGAGACATGTTTGGAAGCCAGTGTGGATGCCCTGTGATCTGTGTGTACACATGACAAGTGCATACACACGCACATAAAGGAACCCAGAGACGTGTTTGGAAGCCAGTGTGGACACCCTGTGATCTGTGCGTACACATTTGACACCTGCGTACACACTCACAGACAGAAACACAGAGATGTGTTTGGAAGCCAGCGTGGATGCCCTGTGATCTCTGCATACACGTGACACATGCATGCACAGGCCCATACAGGAGCAGAGAGACACATTTGGAAGCCGATGTACGCCCTGTGATCTGTGCGTACACGTGACACATGCGTACACACCCACTGACAAGAACACAGAGACGTGTTTGGAAGCCAGTGTGGACGCCCTGTAATCTGTGTGTACACACGTGACACATGCGTGCACACCCACTGACAAGAACAGAGACCCATTTGGAAGCCAGTGTGGGTGCCCTGTGATCTGATCTGTGTGTACACATGTGACACGTGCATCCACACCCACTGACAAGCACACAAGAGACACATTTGAAAGCCAGTGTGGATGCCTTGTGATCTGTGTGTACACATGTGACATGGGCATATGCACCTACAGACAGAAACGCAGAGATGCATTTGGAAGTCACTGTGGATACCTTGTCATCTGTGTGTACACATGAGACACTTGCATACACACCCACATACAGGAACACAGAGACACGTTTGGAAGCCAGTGTGGATGTCCTGTGATCTGTGTGCACACGTTACACGTGTACACAACCACTGACAAGAACATGGAGACACATTTGGAAGCTAGTGTGGACGCCCTGTAATCTGTGCATACACATGTGATACGTGTGTGCACACCCACTGACAAGAACATGGAGACCCATTTGGAAGGCAGTGTGGATGCCCTGTGATCTGTGTGCACACATGTGACACGTGCATGCACATCCACAGACAGAAACACAGAGACACGTTTGGAAGGCAGTGTGGATGCCCTGTGATCTGTGTGTATACGTGACACATGCATGCAAACCCACTGACAAGAACACATAGATGCATTTGGAAGCCAGTGTGGACGCCATGTGATCTGTGCCCACATATCACATGGCCGCTTTGGGATAGGGCCTGTCTGCCCATACTGGCTTCCAAACGCCTCTGTGTGTTCCTGTATGTGGGTGTGCACGTACCTGTCACATGTGTATGCACAGACCACAGGATGTCCACACTGGCTTCCAAATGCGTCTCTGTGTTCCTGTCTGTGAGTTCCAAATGTGTGCACACCTACAGACAGGAACATGGAAACACATTTGGAAGCCAGTGTGGACACCCTGTGATCTGTGCGTACACATGTGACACGTGCATGCACACCCACAGACAGGAACACAGAGACACATTTGGAAGCCAGTGTGGACGCCCTGTGATCTGTGCCCACACACATCACACGTGCATACACACCCACAGACAGGAACACAGAGACACATTTGGAAGCCAGTGTGGATGCCCTGTGATCTGTGTGTACACGTGACACGTGCGTACACACCCACATACAGGAACACAGCCACATTTGGAAGCCAGTGCAGACGCCCTGTGATCTGTGTGTACACATGTGACACGTGCGTGCACACTCACAGACAGGAACACAGAGACGCATTTGGAAGCCAGTGTGGACATCCTGTGGTCTGCGCGTACACATGTGACAGGTACGTGCACGCCCACATACAGGAACACACAGAGGCCTTTGGAAGCCAGCATGGGCAGACAGGCCCTATCCCAAAGCGGCC;SVLEN=1454;SVTYPE=CPX;TOTAL_MAPPINGS=1

So the strategy taken in this branch is

for the first two cases, re-interpretation is easy and done in this "post-processing" tool, and bare-bone annotated simple variants are given , annotated with EVENT that links the simple variants back to the complex variant
for the last case,
- re-collect the contigs that induced the CPX call, preprocess its alignment, then send the contig to the current pair-iteration algorithm for re-interpretation, the returned simple variants will be checked for consistency with the CPX variant that was induced by the same contig, and dropped if it is inconsistent (the two types of variants <DEL> and <INV>, are main concerns as they could easily stem from mis-interpretations of small dispersed duplications); then,
- the CPX variants who have rejected re-interpreted simple variants will be analyzed one last time, to extract <DEL> and <INV>;
- these variants will also be annotated with EVENT to link back to the CPX variants.

Based on manual review, this salvages ~600 variants that would be dropped by evaluation scripts that would simply ignore the CPX variants.

Tests will be added if this strategy is given the green light (so no merging yet).

codecov-io · 2018-04-07T17:53:34Z

Codecov Report

Merging #4602 into master will decrease coverage by 0.063%.
The diff coverage is 86.085%.

@@              Coverage Diff               @@
##             master     #4602       +/-   ##
==============================================
- Coverage      80.2%   80.137%   -0.063%     
+ Complexity    17502     17464       -38     
==============================================
  Files          1085      1082        -3     
  Lines         63248     63566      +318     
  Branches      10197     10241       +44     
==============================================
+ Hits          50725     50940      +215     
- Misses         8538      8616       +78     
- Partials       3985      4010       +25

Impacted Files	Coverage Δ	Complexity Δ
...e/hellbender/tools/spark/sv/utils/SVVCFWriter.java	`87.755% <ø> (ø)`	`11 <0> (ø)`	⬇️
...covery/inference/CpxVariantReInterpreterSpark.java	`0% <0%> (ø)`	`0 <0> (?)`
.../DiscoverVariantsFromContigAlignmentsSAMSpark.java	`83.333% <100%> (+0.14%)`	`30 <0> (ø)`	⬇️
...s/spark/sv/discovery/SvDiscoveryInputMetaData.java	`100% <100%> (ø)`	`7 <5> (+5)`	⬆️
...ery/inference/SimpleNovelAdjacencyInterpreter.java	`74.667% <100%> (+0.694%)`	`11 <1> (ø)`	⬇️
.../sv/discovery/inference/CpxVariantInterpreter.java	`68.382% <50%> (ø)`	`25 <1> (ø)`	⬇️
...iscoverFromLocalAssemblyContigAlignmentsSpark.java	`77.982% <86.364%> (+1.14%)`	`2 <2> (ø)`	⬇️
...nce/SegmentedCpxVariantSimpleVariantExtractor.java	`89.038% <89.038%> (ø)`	`58 <58> (?)`
.../sv/StructuralVariationDiscoveryPipelineSpark.java	`88.652% <92.308%> (+0.081%)`	`13 <1> (ø)`	⬇️
...adinstitute/hellbender/utils/spark/SparkUtils.java	`72.727% <0%> (-11.797%)`	`12% <0%> (-9%)`
... and 60 more

mwalker174

I have some mostly minor technical comments about the code. I tried not to be too picky because this is experimental. The overall motivation and approach for this tool are sound. The code looks functional, but I am not familiar enough with our SV VCF spec to know that it handles possible corner cases correctly.

mwalker174 · 2018-04-30T16:05:13Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/SvDiscoveryInputData.java

-    public final DiscoverVariantsFromContigsAlignmentsSparkArgumentCollection discoverStageArgs;
-
-    public final JavaRDD<GATKRead> assemblyRawAlignments;
+    public static final class InputMetaData {


I am just beginning to review and it is not clear why you've done this, but it is making the code rather hard to read. If it does not conflict with any of your other branches in PR, could you make InputMetaData a separate class?

Sorry for the confusion.
Basically the class grew out of having two parallel tools for interpreting variants,
one standard one experimental, so I wanted to limit the number of parameters of both tools,
and as the number of fields of this class grow,
I feel like I should group them into smaller structs as well, but a different (and IMO slightly better)
way is produced in PR 4663, so I'll leave this entire class unchanged.

mwalker174 · 2018-04-30T16:09:21Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/SvDiscoveryInputData.java

-    public final JavaRDD<GATKRead> assemblyRawAlignments;
+    public static final class InputMetaData {
+
+        public final String sampleId;


Even though these fields are final, I think that you should use private variables and supply getter methods.

yep, will do that in PR 4663.

mwalker174 · 2018-04-30T16:11:05Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/SvDiscoveryInputData.java

-    public final List<SVInterval> assembledIntervals;
-    public final PairedStrandedIntervalTree<EvidenceTargetLink> evidenceTargetLinks;
-    public final ReadMetadata metadata;
+    public final InputMetaData inputMetaData;


Again make private and supply getter/setter methods.

mwalker174 · 2018-04-30T16:19:59Z

...rg/broadinstitute/hellbender/tools/spark/sv/discovery/inference/CpxSVInferenceTestUtils.java

@@ -151,7 +151,7 @@ private static PreprocessedAndAnalysisReadyContigWithExpectedResults buildForCon
                .id("CPX_chr2:4452298-4452406")
                .attribute(VCFConstants.END_KEY, 4452406)
                .attribute(SVTYPE, GATKSVVCFConstants.CPX_SV_SYB_ALT_ALLELE_STR)
-                .attribute(SVLEN, 109)
+                .attribute(SVLEN, 783)


What's the reason for changing the lengths? It would be good to provide a comment explaining this change.

this would be gone in this PR, as explained in the commit message.
Basically, I'm changing the SVLEN field to follow its technical definition
in VCF spec, i.e. for precise variants the difference between reference and alt alleles.

mwalker174 · 2018-04-30T16:21:31Z