Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(SV) re-interpreting CPX records by experimental interpretation tool #4602

Merged
merged 4 commits into from
May 24, 2018

Conversation

SHuang-Broad
Copy link
Contributor

@SHuang-Broad SHuang-Broad commented Mar 27, 2018

A not-so-elegant way to tackle #4323, and part of #4111 .

UPDATE: fixes #4323


Brief explanation:

The <CPX> variants we currently output has an annotation SEGMENTS, which could contain

  • 0-entries (which will be simply omitted): this is when the head and tail alignments seamlessly stitch together on the reference, and middle alignments are all taken as inserted sequence
  • 1-entry: this is when head/tail alignment overlap on reference over the region specified in the entry, hence we have a deletion or duplication of that region (depending on if the segment is present in another annotation ALT_ARRANGEMENT, or if present, whether it is inverted), and insertion of more sequences
  • multiple entries: there are the truly complex ones

while the first two cases are easy to deal with, the last one is very difficult to parse into simple variants, and has the inherent evil of ambiguity in representations (to demonstrate, a not very complicated one could be like this)

chr6	166997615	CPX_chr6:166997615-166997944	ACCCACAGACAGAAACACAGAGACATGTTTGGAAGCCAGTGTGGATGCCCTGTGATCTGTGTGTACACATGACAAGTGCATACACACGCACATAAAGGAACCCAGAGACGTGTTTGGAAGCCAGTGTGGACACCCTGTGATCTGTGCGTACACATTTGACACCTGCGTACACACTCACAGACAGAAACACAGAGATGTGTTTGGAAGCCAGTGTGGACATCCTGTGGTCTGCGCGTACACATGTGACAGGTACGTGCACGCCCACATACAGGAACACACAGAGGCCTTTGGAAGCCAGCATGGGCAGACAGGCCCTATCCCAAAGCGGCC	<CPX>	.	.	ALIGN_LENGTHS=309;ALT_ARRANGEMENT=1,2,3,4,5,UINS-733,2,UINS-94,-6,-5,UINS-41,4,5,UINS-40,1,2,3,4,5,6;CTG_GOOD_NONCANONICAL_MAPPING=chrUn_JTFH01000473v1_decoy,1,-,51H1640M204H,60,0,1640;CTG_NAMES=asm011602:tig00001;END=166997944;HQ_MAPPINGS=1;MAPPING_QUALITIES=60;MAX_ALIGN_LENGTH=309;SEGMENTS=chr6:166997615-166997617,chr6:166997617-166997679,chr6:166997679-166997727,chr6:166997727-166997787,chr6:166997787-166997831,chr6:166997832-166997944;SEQ_ALT_HAPLOTYPE=ACCCACAGACAGAAACACAGAGACATGTTTGGAAGCCAGTGTGGATGCCCTGTGATCTGTGTGTACACATGACAAGTGCATACACACGCACATAAAGGAACCCAGAGACGTGTTTGGAAGCCAGTGTGGACACCCTGTGATCTGTGCGTACACATTTGACACCTGCGTACACACTCACAGACAGAAACACAGAGATGTGTTTGGAAGCCAGCGTGGATGCCCTGTGATCTCTGCATACACGTGACACATGCATGCACAGGCCCATACAGGAGCAGAGAGACACATTTGGAAGCCGATGTACGCCCTGTGATCTGTGCGTACACGTGACACATGCGTACACACCCACTGACAAGAACACAGAGACGTGTTTGGAAGCCAGTGTGGACGCCCTGTAATCTGTGTGTACACACGTGACACATGCGTGCACACCCACTGACAAGAACAGAGACCCATTTGGAAGCCAGTGTGGGTGCCCTGTGATCTGATCTGTGTGTACACATGTGACACGTGCATCCACACCCACTGACAAGCACACAAGAGACACATTTGAAAGCCAGTGTGGATGCCTTGTGATCTGTGTGTACACATGTGACATGGGCATATGCACCTACAGACAGAAACGCAGAGATGCATTTGGAAGTCACTGTGGATACCTTGTCATCTGTGTGTACACATGAGACACTTGCATACACACCCACATACAGGAACACAGAGACACGTTTGGAAGCCAGTGTGGATGTCCTGTGATCTGTGTGCACACGTTACACGTGTACACAACCACTGACAAGAACATGGAGACACATTTGGAAGCTAGTGTGGACGCCCTGTAATCTGTGCATACACATGTGATACGTGTGTGCACACCCACTGACAAGAACATGGAGACCCATTTGGAAGGCAGTGTGGATGCCCTGTGATCTGTGTGCACACATGTGACACGTGCATGCACATCCACAGACAGAAACACAGAGACACGTTTGGAAGGCAGTGTGGATGCCCTGTGATCTGTGTGTATACGTGACACATGCATGCAAACCCACTGACAAGAACACATAGATGCATTTGGAAGCCAGTGTGGACGCCATGTGATCTGTGCCCACATATCACATGGCCGCTTTGGGATAGGGCCTGTCTGCCCATACTGGCTTCCAAACGCCTCTGTGTGTTCCTGTATGTGGGTGTGCACGTACCTGTCACATGTGTATGCACAGACCACAGGATGTCCACACTGGCTTCCAAATGCGTCTCTGTGTTCCTGTCTGTGAGTTCCAAATGTGTGCACACCTACAGACAGGAACATGGAAACACATTTGGAAGCCAGTGTGGACACCCTGTGATCTGTGCGTACACATGTGACACGTGCATGCACACCCACAGACAGGAACACAGAGACACATTTGGAAGCCAGTGTGGACGCCCTGTGATCTGTGCCCACACACATCACACGTGCATACACACCCACAGACAGGAACACAGAGACACATTTGGAAGCCAGTGTGGATGCCCTGTGATCTGTGTGTACACGTGACACGTGCGTACACACCCACATACAGGAACACAGCCACATTTGGAAGCCAGTGCAGACGCCCTGTGATCTGTGTGTACACATGTGACACGTGCGTGCACACTCACAGACAGGAACACAGAGACGCATTTGGAAGCCAGTGTGGACATCCTGTGGTCTGCGCGTACACATGTGACAGGTACGTGCACGCCCACATACAGGAACACACAGAGGCCTTTGGAAGCCAGCATGGGCAGACAGGCCCTATCCCAAAGCGGCC;SVLEN=1454;SVTYPE=CPX;TOTAL_MAPPINGS=1

So the strategy taken in this branch is

  • for the first two cases, re-interpretation is easy and done in this "post-processing" tool, and bare-bone annotated simple variants are given , annotated with EVENT that links the simple variants back to the complex variant
  • for the last case,
    • re-collect the contigs that induced the CPX call, preprocess its alignment, then send the contig to the current pair-iteration algorithm for re-interpretation, the returned simple variants will be checked for consistency with the CPX variant that was induced by the same contig, and dropped if it is inconsistent (the two types of variants <DEL> and <INV>, are main concerns as they could easily stem from mis-interpretations of small dispersed duplications); then,
    • the CPX variants who have rejected re-interpreted simple variants will be analyzed one last time, to extract <DEL> and <INV>;
    • these variants will also be annotated with EVENT to link back to the CPX variants.

Based on manual review, this salvages ~600 variants that would be dropped by evaluation scripts that would simply ignore the CPX variants.


Tests will be added if this strategy is given the green light (so no merging yet).

@codecov-io
Copy link

codecov-io commented Apr 7, 2018

Codecov Report

Merging #4602 into master will decrease coverage by 0.063%.
The diff coverage is 86.085%.

@@              Coverage Diff               @@
##             master     #4602       +/-   ##
==============================================
- Coverage      80.2%   80.137%   -0.063%     
+ Complexity    17502     17464       -38     
==============================================
  Files          1085      1082        -3     
  Lines         63248     63566      +318     
  Branches      10197     10241       +44     
==============================================
+ Hits          50725     50940      +215     
- Misses         8538      8616       +78     
- Partials       3985      4010       +25
Impacted Files Coverage Δ Complexity Δ
...e/hellbender/tools/spark/sv/utils/SVVCFWriter.java 87.755% <ø> (ø) 11 <0> (ø) ⬇️
...covery/inference/CpxVariantReInterpreterSpark.java 0% <0%> (ø) 0 <0> (?)
.../DiscoverVariantsFromContigAlignmentsSAMSpark.java 83.333% <100%> (+0.14%) 30 <0> (ø) ⬇️
...s/spark/sv/discovery/SvDiscoveryInputMetaData.java 100% <100%> (ø) 7 <5> (+5) ⬆️
...ery/inference/SimpleNovelAdjacencyInterpreter.java 74.667% <100%> (+0.694%) 11 <1> (ø) ⬇️
.../sv/discovery/inference/CpxVariantInterpreter.java 68.382% <50%> (ø) 25 <1> (ø) ⬇️
...iscoverFromLocalAssemblyContigAlignmentsSpark.java 77.982% <86.364%> (+1.14%) 2 <2> (ø) ⬇️
...nce/SegmentedCpxVariantSimpleVariantExtractor.java 89.038% <89.038%> (ø) 58 <58> (?)
.../sv/StructuralVariationDiscoveryPipelineSpark.java 88.652% <92.308%> (+0.081%) 13 <1> (ø) ⬇️
...adinstitute/hellbender/utils/spark/SparkUtils.java 72.727% <0%> (-11.797%) 12% <0%> (-9%)
... and 60 more

@SHuang-Broad SHuang-Broad force-pushed the sh-sv-cpx-improve branch 3 times, most recently from f6b6cd1 to 0f85fb6 Compare April 19, 2018 22:32
@SHuang-Broad SHuang-Broad force-pushed the sh-sv-cpx-improve branch 3 times, most recently from ec7f516 to 13bf859 Compare April 26, 2018 22:34
@mwalker174 mwalker174 self-assigned this Apr 30, 2018
Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some mostly minor technical comments about the code. I tried not to be too picky because this is experimental. The overall motivation and approach for this tool are sound. The code looks functional, but I am not familiar enough with our SV VCF spec to know that it handles possible corner cases correctly.

public final DiscoverVariantsFromContigsAlignmentsSparkArgumentCollection discoverStageArgs;

public final JavaRDD<GATKRead> assemblyRawAlignments;
public static final class InputMetaData {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just beginning to review and it is not clear why you've done this, but it is making the code rather hard to read. If it does not conflict with any of your other branches in PR, could you make InputMetaData a separate class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion.
Basically the class grew out of having two parallel tools for interpreting variants,
one standard one experimental, so I wanted to limit the number of parameters of both tools,
and as the number of fields of this class grow,
I feel like I should group them into smaller structs as well, but a different (and IMO slightly better)
way is produced in PR 4663, so I'll leave this entire class unchanged.

public final JavaRDD<GATKRead> assemblyRawAlignments;
public static final class InputMetaData {

public final String sampleId;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though these fields are final, I think that you should use private variables and supply getter methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, will do that in PR 4663.

public final List<SVInterval> assembledIntervals;
public final PairedStrandedIntervalTree<EvidenceTargetLink> evidenceTargetLinks;
public final ReadMetadata metadata;
public final InputMetaData inputMetaData;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again make private and supply getter/setter methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@@ -151,7 +151,7 @@ private static PreprocessedAndAnalysisReadyContigWithExpectedResults buildForCon
.id("CPX_chr2:4452298-4452406")
.attribute(VCFConstants.END_KEY, 4452406)
.attribute(SVTYPE, GATKSVVCFConstants.CPX_SV_SYB_ALT_ALLELE_STR)
.attribute(SVLEN, 109)
.attribute(SVLEN, 783)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for changing the lengths? It would be good to provide a comment explaining this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be gone in this PR, as explained in the commit message.
Basically, I'm changing the SVLEN field to follow its technical definition
in VCF spec, i.e. for precise variants the difference between reference and alt alleles.

* (Internal) Tries to extract simple variants from a provided GATK-SV CPX.vcf
*/
@DocumentedFeature
@BetaFeature
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add @Experimental annotation here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to know, thanks!

try {
fatInsertionRefAllele = Allele.create(reference.getReferenceBases(refSegment).getBases(), true);
} catch (final IOException ioex) {
throw new GATKException("", ioex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a message here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


final List<VariantContext> result = new ArrayList<>();
final Allele anchorBaseRefAllele = Allele.create(vc.getAlleles().get(0).getBases()[0], true);
final Allele fatInsertionRefAllele;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to define "anchor" and "fat" in a comment somewhere. Are these standard members of the vernacular? I would maybe use "expanded" instead of "fat."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying the comments here:

  • anchor base is defined per-VCF spec, that is, for DEL and INS variants, (see 1.4.1#REF version 4.2), the reference base at the position pointed to by POS, basically, for DEL, the reference bases immediately following POS are deleted (up to and including the END base), for INS, the INSSEQ are inserted immediately after POS; variants like INV are not required to be reported this way, so there's some ambiguity
  • "fat" insertions exist because sometimes we have micro deletions surrounding the insertion breakpoint, so here the strategy is to report them as "fat", i.e. the anchor base and micro-deleted bases are reported in REF.

.attribute(VCFConstants.END_KEY, refSegment.getEnd())
.attribute(EVENT_KEY, refSegment.getEnd())
.attribute(GATKSVVCFConstants.CONTIG_NAMES, evidenceContigs);
final VariantContext del = vcBuilderForDel.make();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of readability, can you make a single function that will let you build any of these VC types e.g.

private static VariantContext buildVariant(final SimpleInterval refSegment, final int contig, final int start, final int stop, etc...) { return new VariantContextBuilder().chr(contig). ... .make(); }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, extracted a static utility method

}

//==================================================================================================================
private static VariantContext getFrontIns(final VariantContext vc,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of shared code with getRearIns() - could you pull out the guts and make something like

private static VariantContext getInsOnSublist(final int altArrangementStart, final int altArrangementEnd, ...) {...}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

final Set<Integer> presentSegments = presentAndInvertedSegments._1;
final Set<Integer> invertedSegments = new HashSet<>( presentAndInvertedSegments._2 );

final String typeString = (String) simple.getAttribute(GATKSVVCFConstants.SVTYPE, "");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simple.getAttributeAsString() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why I did that... done

@SHuang-Broad SHuang-Broad force-pushed the sh-sv-cpx-improve branch 5 times, most recently from 9f23939 to 05d8bc3 Compare May 4, 2018 22:36
@SHuang-Broad
Copy link
Contributor Author

@mwalker174 Hi Mark, I've refactored the code significantly, would you take a look again? Thanks!

@mwalker174
Copy link
Contributor

@SHuang-Broad Thanks the code looks much better. Next step is tests then?

@SHuang-Broad
Copy link
Contributor Author

@mwalker174
absolutely. I'll begin the testing process. Thanks!

@SHuang-Broad SHuang-Broad force-pushed the sh-sv-cpx-improve branch 7 times, most recently from b93d584 to 98caf53 Compare May 15, 2018 22:26
@SHuang-Broad
Copy link
Contributor Author

@mwalker174
Hi Mark, I've finished writing the tests and would you please check again?
Here's the log running the whole pipeline (the number of simple variants extracted is approximately 1.5X the number of complex variants):

.... below is output for complex variants only
23:09:25.288 INFO  StructuralVariationDiscoveryPipelineSpark - Discovered 1334 variants.
23:09:25.288 INFO  StructuralVariationDiscoveryPipelineSpark - CPX: 1334
23:09:25.289 INFO  StructuralVariationDiscoveryPipelineSpark - INV: 0
23:09:25.289 INFO  StructuralVariationDiscoveryPipelineSpark - BND_NOSS: 0
23:09:25.289 INFO  StructuralVariationDiscoveryPipelineSpark - DUP_INV: 0
23:09:25.289 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV33: 0
23:09:25.289 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV55: 0
23:09:25.289 INFO  StructuralVariationDiscoveryPipelineSpark - DEL: 0
23:09:25.289 INFO  StructuralVariationDiscoveryPipelineSpark - DUP: 0
23:09:25.289 INFO  StructuralVariationDiscoveryPipelineSpark - INS: 0
..... below is output from this tool
23:09:48.167 INFO  StructuralVariationDiscoveryPipelineSpark - Discovered 688 variants.
23:09:48.168 INFO  StructuralVariationDiscoveryPipelineSpark - INV: 1
23:09:48.168 INFO  StructuralVariationDiscoveryPipelineSpark - DEL: 125
23:09:48.168 INFO  StructuralVariationDiscoveryPipelineSpark - INS: 562
23:09:48.168 INFO  StructuralVariationDiscoveryPipelineSpark - BND_NOSS: 0
23:09:48.168 INFO  StructuralVariationDiscoveryPipelineSpark - DUP_INV: 0
23:09:48.168 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV33: 0
23:09:48.168 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV55: 0
23:09:48.168 INFO  StructuralVariationDiscoveryPipelineSpark - CPX: 0
23:09:48.168 INFO  StructuralVariationDiscoveryPipelineSpark - DUP: 0
23:09:48.215 INFO  StructuralVariationDiscoveryPipelineSpark - Discovered 1555 variants.
23:09:48.216 INFO  StructuralVariationDiscoveryPipelineSpark - INV: 21
23:09:48.216 INFO  StructuralVariationDiscoveryPipelineSpark - DEL: 500
23:09:48.216 INFO  StructuralVariationDiscoveryPipelineSpark - DUP: 189
23:09:48.216 INFO  StructuralVariationDiscoveryPipelineSpark - INS: 845
23:09:48.216 INFO  StructuralVariationDiscoveryPipelineSpark - BND_NOSS: 0
23:09:48.216 INFO  StructuralVariationDiscoveryPipelineSpark - DUP_INV: 0
23:09:48.216 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV33: 0
23:09:48.216 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV55: 0
23:09:48.216 INFO  StructuralVariationDiscoveryPipelineSpark - CPX: 0
``

@SHuang-Broad SHuang-Broad force-pushed the sh-sv-cpx-improve branch 2 times, most recently from f44a177 to 4cc4ac7 Compare May 19, 2018 18:05
Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, @SHuang-Broad

One final question (you might have already explained in a comment somewhere) is what was the reasoning for reporting some variants as a single locus (length 1). For example, an insertion of size N could also be represented as an interval starting at the insertion point and extending for N bases. This is usually the way I have seen it done with other tools, but if your evaluation scripts are consistent I don't have any issue.

@SHuang-Broad
Copy link
Contributor Author

@mwalker174
Thanks!

To answer you question about the END==POS insertion variants,

First quote the spec

For precise variants, END is POS + length of REF allele - 1,

Now for simple insertions, the REF allele is a single base allele, which by the above definition forces be equal to POS.

Second, if you look at the 4th variant (insertion) on page 11 of the spec version 4.2, END == POS.

So I'm following the VCF spec.

It's a little ambiguous as the spec doesn't give any example for replacements, i.e. some ref bases are replaced by other bases, so

  • when the ref sequence being replaced is <50bp, I emit "fat insertion", as documented here.
  • when the replaced region is >49 bp, a DEL call is emitted.

  * SvDiscoveryInputMetaData fields made private and replaced with getters
  * refactor test utils and data provider for sv discovery subpackage
  * make changes in StructuralVariationDiscoveryPipelineSparkIntegrationTest to accomodate later integration tests
 adds a new tool named CpxVariantReInterprepterSpark
    to extract barebone-annotated simple variants from
    an GATK-SV discovery pipeline produced VCF containing complex variants
(hence StructuralVariationDiscoveryPipelineSpark as well)
  * unit test code
  * integration test and associated intput & output files
@SHuang-Broad SHuang-Broad merged commit f005644 into master May 24, 2018
@SHuang-Broad SHuang-Broad deleted the sh-sv-cpx-improve branch May 25, 2018 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tool to parse CPX variants into basic types
3 participants