-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PrintSVEvidence tool #7026
PrintSVEvidence tool #7026
Conversation
Not sure if this is already under discussion, but it would be great to output raw allele counts rather than BAF at some point. (I see @tedsharpe has a branch, maybe this is meant to cover that---in which case, carry on!) |
Yes we are working on changing output format. We may actually move to a binary format for evidence in the future, as the text parsing is rather slow, prohibitively so as we look to scale to cohorts of 1M+. But I agree, it may be beneficial to have counts and depth instead of fractions. |
Travis reported job failures from build 32512
|
} else { | ||
throw new UserException("File " + drivingPath.getRawInputString() + " contains features of the wrong type."); | ||
throw new UserException("File " + drivingFeaturesPath.toPath() + " contains features of the wrong type."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mwalker174 Some of these FeatureWalker
changes appear to revert changes that we made recently. I'm wondering if these are really necessary. Can they be removed (other than the header caching anyway) ? Does something fail without these ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmnbroad Thanks for catching this, I think these are errors from merge conflicts. I've reverted most of the changes, but I don't see another way to retrieve the header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good apart from a few minor changes. I also took a crack at refactoring the evidence types and codecs to use wildcard types and generics so that PrintSVEvidence could use a single output stream and not have so many cascading if clauses depending on the type of evidence its working with. My attempt is in the branch cw_print_sv_evidence_refactor
. Let me know what you think of it. I'll make a PR against your branch you can use if you like it. We should probably have an engine team member review the classes in hellbender.utils.io
closely if they haven't done so already.
|
||
@Override | ||
public int getEnd() { | ||
return position + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit counter-intuitive to add one to the position in the getter method, it would lead to this:
BafEvidence e = new BafEvidence("Sample", "chr1", 10, 5.0); // set position to 10
int p = e.getPosition(); // returns 11
IMO it would be clearer to put all of the +/- 1 stuff in encode and decode methods, and have the actual position field in this object be documented as 1-based or 0-based.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The +1
is for getEnd()
only. But I now realize this is actually incorrect since GATK intervals are closed.
* | ||
* <ul> | ||
* <li> | ||
* Coordinate-sorted evidence file, indexed if block compressed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd make this a little more explicit, ie something like "... indexed if block compressed output requested by specifying an output path ending in .gz"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} catch (IOException e) { | ||
throw new GATKException("Could not write to PE file", e); | ||
} | ||
// subtract 1 from positions to match pysam output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment doesn't look applicable here anymore (since we're not subtracting one on the line below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if (IOUtil.hasBlockCompressedExtension(path.toPath())) { | ||
final FeatureCodec<? extends Feature, ?> codec = FeatureManager.getCodecForFile(path.toPath()); | ||
return new TabixIndexedFeatureOutputStream<>(path, codec, encoder, dictionary, compressionLevel); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add an else {
here to be explicit that those are the two choices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* | ||
* @param header header text (without final newline character), cannot be null | ||
*/ | ||
public void writeHeader(final String header) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method looks identical to the one in UncompressedFeatureOutputStream
suggesting that it could maybe be pulled up to the superclass (FeatureOutputStream
) would have to be made into a class instead of an interface. The other two methods are pretty close as well -- ie they could be the same if you check if (indexCreator != null)
in both. Did you consider consolidating them into one class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done - this is indeed much better consolidated to just a FeatureOutputStream
class.
4eb5bbc
to
d4235ff
Compare
Travis reported job failures from build 32607
|
Start adding block compression Implement FeatureOutputStream classes Clean up feature streaming, integrate with PairedEndAndSplitReadEvidenceCollection PrintSVEvidence integration tests FeatureOutputStream unit tests Document feature output stream classes Some more polishing Fix up tests
d4235ff
to
013c5c1
Compare
Tests are passing now, back to you @cwhelan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Adds a new tool that prints any of the SV evidence file types: read count (RD), discordant pair (PE), split-read (SR), or B-allele frequency (BAF). This tool is used frequently in the gatk-sv pipeline for retrieving subsets of evidence records from a bucket over specific intervals. Evidence file formats comply with the current specifications in the existing gatk-sv pipeline.
The tool is implemented as a FeatureWalker, which needed to be modified slightly to retrieve the Feature file header. Thus each evidence type has its own classes implementing a Feature and a codec. There are also new OutputStream classes for conveniently writing Features in compressed (and indexed) or plain text formats. The existing PairedEndAndSplitReadEvidenceCollection tool has been modified to use these OutputStream classes.
The IntegrationSpec class can now also check for the existence of expected index file output.