-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(SV) re-interpreting CPX records by experimental interpretation tool #4602
Conversation
51eb3d3
to
936b0f2
Compare
Codecov Report
@@ Coverage Diff @@
## master #4602 +/- ##
==============================================
- Coverage 80.2% 80.137% -0.063%
+ Complexity 17502 17464 -38
==============================================
Files 1085 1082 -3
Lines 63248 63566 +318
Branches 10197 10241 +44
==============================================
+ Hits 50725 50940 +215
- Misses 8538 8616 +78
- Partials 3985 4010 +25
|
f6b6cd1
to
0f85fb6
Compare
ec7f516
to
13bf859
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some mostly minor technical comments about the code. I tried not to be too picky because this is experimental. The overall motivation and approach for this tool are sound. The code looks functional, but I am not familiar enough with our SV VCF spec to know that it handles possible corner cases correctly.
public final DiscoverVariantsFromContigsAlignmentsSparkArgumentCollection discoverStageArgs; | ||
|
||
public final JavaRDD<GATKRead> assemblyRawAlignments; | ||
public static final class InputMetaData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just beginning to review and it is not clear why you've done this, but it is making the code rather hard to read. If it does not conflict with any of your other branches in PR, could you make InputMetaData a separate class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion.
Basically the class grew out of having two parallel tools for interpreting variants,
one standard one experimental, so I wanted to limit the number of parameters of both tools,
and as the number of fields of this class grow,
I feel like I should group them into smaller structs as well, but a different (and IMO slightly better)
way is produced in PR 4663, so I'll leave this entire class unchanged.
public final JavaRDD<GATKRead> assemblyRawAlignments; | ||
public static final class InputMetaData { | ||
|
||
public final String sampleId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though these fields are final, I think that you should use private variables and supply getter methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, will do that in PR 4663.
public final List<SVInterval> assembledIntervals; | ||
public final PairedStrandedIntervalTree<EvidenceTargetLink> evidenceTargetLinks; | ||
public final ReadMetadata metadata; | ||
public final InputMetaData inputMetaData; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again make private and supply getter/setter methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
@@ -151,7 +151,7 @@ private static PreprocessedAndAnalysisReadyContigWithExpectedResults buildForCon | |||
.id("CPX_chr2:4452298-4452406") | |||
.attribute(VCFConstants.END_KEY, 4452406) | |||
.attribute(SVTYPE, GATKSVVCFConstants.CPX_SV_SYB_ALT_ALLELE_STR) | |||
.attribute(SVLEN, 109) | |||
.attribute(SVLEN, 783) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for changing the lengths? It would be good to provide a comment explaining this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would be gone in this PR, as explained in the commit message.
Basically, I'm changing the SVLEN field to follow its technical definition
in VCF spec, i.e. for precise variants the difference between reference and alt alleles.
* (Internal) Tries to extract simple variants from a provided GATK-SV CPX.vcf | ||
*/ | ||
@DocumentedFeature | ||
@BetaFeature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add @Experimental
annotation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to know, thanks!
try { | ||
fatInsertionRefAllele = Allele.create(reference.getReferenceBases(refSegment).getBases(), true); | ||
} catch (final IOException ioex) { | ||
throw new GATKException("", ioex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a message here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
final List<VariantContext> result = new ArrayList<>(); | ||
final Allele anchorBaseRefAllele = Allele.create(vc.getAlleles().get(0).getBases()[0], true); | ||
final Allele fatInsertionRefAllele; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to define "anchor" and "fat" in a comment somewhere. Are these standard members of the vernacular? I would maybe use "expanded" instead of "fat."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copying the comments here:
- anchor base is defined per-VCF spec, that is, for DEL and INS variants, (see 1.4.1#REF version 4.2), the reference base at the position pointed to by POS, basically, for DEL, the reference bases immediately following POS are deleted (up to and including the END base), for INS, the INSSEQ are inserted immediately after POS; variants like INV are not required to be reported this way, so there's some ambiguity
- "fat" insertions exist because sometimes we have micro deletions surrounding the insertion breakpoint, so here the strategy is to report them as "fat", i.e. the anchor base and micro-deleted bases are reported in REF.
.attribute(VCFConstants.END_KEY, refSegment.getEnd()) | ||
.attribute(EVENT_KEY, refSegment.getEnd()) | ||
.attribute(GATKSVVCFConstants.CONTIG_NAMES, evidenceContigs); | ||
final VariantContext del = vcBuilderForDel.make(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the sake of readability, can you make a single function that will let you build any of these VC types e.g.
private static VariantContext buildVariant(final SimpleInterval refSegment, final int contig, final int start, final int stop, etc...) { return new VariantContextBuilder().chr(contig). ... .make(); }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, extracted a static utility method
} | ||
|
||
//================================================================================================================== | ||
private static VariantContext getFrontIns(final VariantContext vc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of shared code with getRearIns() - could you pull out the guts and make something like
private static VariantContext getInsOnSublist(final int altArrangementStart, final int altArrangementEnd, ...) {...}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
final Set<Integer> presentSegments = presentAndInvertedSegments._1; | ||
final Set<Integer> invertedSegments = new HashSet<>( presentAndInvertedSegments._2 ); | ||
|
||
final String typeString = (String) simple.getAttribute(GATKSVVCFConstants.SVTYPE, ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not simple.getAttributeAsString()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure why I did that... done
9f23939
to
05d8bc3
Compare
@mwalker174 Hi Mark, I've refactored the code significantly, would you take a look again? Thanks! |
@SHuang-Broad Thanks the code looks much better. Next step is tests then? |
@mwalker174 |
b93d584
to
98caf53
Compare
@mwalker174
|
f44a177
to
4cc4ac7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, @SHuang-Broad
One final question (you might have already explained in a comment somewhere) is what was the reasoning for reporting some variants as a single locus (length 1). For example, an insertion of size N could also be represented as an interval starting at the insertion point and extending for N bases. This is usually the way I have seen it done with other tools, but if your evaluation scripts are consistent I don't have any issue.
@mwalker174 To answer you question about the END==POS insertion variants, First quote the spec
Now for simple insertions, the REF allele is a single base allele, which by the above definition forces be equal to POS. Second, if you look at the 4th variant (insertion) on page 11 of the spec version 4.2, END == POS. So I'm following the VCF spec. It's a little ambiguous as the spec doesn't give any example for replacements, i.e. some ref bases are replaced by other bases, so
|
* SvDiscoveryInputMetaData fields made private and replaced with getters * refactor test utils and data provider for sv discovery subpackage * make changes in StructuralVariationDiscoveryPipelineSparkIntegrationTest to accomodate later integration tests
adds a new tool named CpxVariantReInterprepterSpark to extract barebone-annotated simple variants from an GATK-SV discovery pipeline produced VCF containing complex variants
(hence StructuralVariationDiscoveryPipelineSpark as well)
* unit test code * integration test and associated intput & output files
4cc4ac7
to
49de851
Compare
A not-so-elegant way to tackle #4323, and part of #4111 .
UPDATE: fixes #4323
Brief explanation:
The
<CPX>
variants we currently output has an annotationSEGMENTS
, which could containALT_ARRANGEMENT
, or if present, whether it is inverted), and insertion of more sequenceswhile the first two cases are easy to deal with, the last one is very difficult to parse into simple variants, and has the inherent evil of ambiguity in representations (to demonstrate, a not very complicated one could be like this)
So the strategy taken in this branch is
EVENT
that links the simple variants back to the complex variant<DEL>
and<INV>
, are main concerns as they could easily stem from mis-interpretations of small dispersed duplications); then,<DEL>
and<INV>
;EVENT
to link back to the CPX variants.Based on manual review, this salvages ~600 variants that would be dropped by evaluation scripts that would simply ignore the CPX variants.
Tests will be added if this strategy is given the green light (so no merging yet).