Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PrintSVEvidence tool #7026

Merged
merged 11 commits into from
Jan 28, 2021
Merged

PrintSVEvidence tool #7026

merged 11 commits into from
Jan 28, 2021

Conversation

mwalker174
Copy link
Contributor

Adds a new tool that prints any of the SV evidence file types: read count (RD), discordant pair (PE), split-read (SR), or B-allele frequency (BAF). This tool is used frequently in the gatk-sv pipeline for retrieving subsets of evidence records from a bucket over specific intervals. Evidence file formats comply with the current specifications in the existing gatk-sv pipeline.

The tool is implemented as a FeatureWalker, which needed to be modified slightly to retrieve the Feature file header. Thus each evidence type has its own classes implementing a Feature and a codec. There are also new OutputStream classes for conveniently writing Features in compressed (and indexed) or plain text formats. The existing PairedEndAndSplitReadEvidenceCollection tool has been modified to use these OutputStream classes.

The IntegrationSpec class can now also check for the existence of expected index file output.

@mwalker174 mwalker174 requested a review from cwhelan January 6, 2021 17:09
@samuelklee
Copy link
Contributor

Not sure if this is already under discussion, but it would be great to output raw allele counts rather than BAF at some point. (I see @tedsharpe has a branch, maybe this is meant to cover that---in which case, carry on!)

@mwalker174
Copy link
Contributor Author

it would be great to output raw allele counts rather than BAF at some point. (I see @tedsharpe has a branch, maybe this is meant to cover that---in which case, carry on!)

Yes we are working on changing output format. We may actually move to a binary format for evidence in the future, as the text parsing is rather slow, prohibitively so as we look to scale to cohorts of 1M+. But I agree, it may be beneficial to have counts and depth instead of fractions.

@broadinstitute broadinstitute deleted a comment from gatk-bot Jan 11, 2021
@gatk-bot
Copy link

gatk-bot commented Jan 11, 2021

Travis reported job failures from build 32512
Failures in the following jobs:

Test Type JDK Job ID Logs
unit openjdk8 32512.3 logs
integration openjdk8 32512.2 logs
cloud openjdk11 32512.14 logs
cloud openjdk8 32512.1 logs
unit openjdk11 32512.13 logs
integration openjdk11 32512.12 logs
unit openjdk8 32512.3 logs
integration openjdk8 32512.2 logs

} else {
throw new UserException("File " + drivingPath.getRawInputString() + " contains features of the wrong type.");
throw new UserException("File " + drivingFeaturesPath.toPath() + " contains features of the wrong type.");
Copy link
Collaborator

@cmnbroad cmnbroad Jan 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwalker174 Some of these FeatureWalker changes appear to revert changes that we made recently. I'm wondering if these are really necessary. Can they be removed (other than the header caching anyway) ? Does something fail without these ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmnbroad Thanks for catching this, I think these are errors from merge conflicts. I've reverted most of the changes, but I don't see another way to retrieve the header.

Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good apart from a few minor changes. I also took a crack at refactoring the evidence types and codecs to use wildcard types and generics so that PrintSVEvidence could use a single output stream and not have so many cascading if clauses depending on the type of evidence its working with. My attempt is in the branch cw_print_sv_evidence_refactor. Let me know what you think of it. I'll make a PR against your branch you can use if you like it. We should probably have an engine team member review the classes in hellbender.utils.io closely if they haven't done so already.


@Override
public int getEnd() {
return position + 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit counter-intuitive to add one to the position in the getter method, it would lead to this:

BafEvidence e = new BafEvidence("Sample", "chr1", 10, 5.0); // set position to 10
int p = e.getPosition(); // returns 11

IMO it would be clearer to put all of the +/- 1 stuff in encode and decode methods, and have the actual position field in this object be documented as 1-based or 0-based.

Copy link
Contributor Author

@mwalker174 mwalker174 Jan 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The +1 is for getEnd() only. But I now realize this is actually incorrect since GATK intervals are closed.

*
* <ul>
* <li>
* Coordinate-sorted evidence file, indexed if block compressed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make this a little more explicit, ie something like "... indexed if block compressed output requested by specifying an output path ending in .gz"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

} catch (IOException e) {
throw new GATKException("Could not write to PE file", e);
}
// subtract 1 from positions to match pysam output
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't look applicable here anymore (since we're not subtracting one on the line below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (IOUtil.hasBlockCompressedExtension(path.toPath())) {
final FeatureCodec<? extends Feature, ?> codec = FeatureManager.getCodecForFile(path.toPath());
return new TabixIndexedFeatureOutputStream<>(path, codec, encoder, dictionary, compressionLevel);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add an else { here to be explicit that those are the two choices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

*
* @param header header text (without final newline character), cannot be null
*/
public void writeHeader(final String header) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method looks identical to the one in UncompressedFeatureOutputStream suggesting that it could maybe be pulled up to the superclass (FeatureOutputStream) would have to be made into a class instead of an interface. The other two methods are pretty close as well -- ie they could be the same if you check if (indexCreator != null) in both. Did you consider consolidating them into one class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - this is indeed much better consolidated to just a FeatureOutputStream class.

@mwalker174 mwalker174 force-pushed the mw_print_sv_evidence_pr branch from 4eb5bbc to d4235ff Compare January 22, 2021 19:56
@gatk-bot
Copy link

gatk-bot commented Jan 22, 2021

Travis reported job failures from build 32607
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 32607.1 logs
cloud openjdk11 32607.14 logs
integration openjdk11 32607.12 logs
integration openjdk8 32607.2 logs
cloud openjdk11 32607.14 logs
cloud openjdk8 32607.1 logs
integration openjdk11 32607.12 logs
integration openjdk8 32607.2 logs

Start adding block compression

Implement FeatureOutputStream classes

Clean up feature streaming, integrate with PairedEndAndSplitReadEvidenceCollection

PrintSVEvidence integration tests

FeatureOutputStream unit tests

Document feature output stream classes

Some more polishing

Fix up tests
@mwalker174 mwalker174 force-pushed the mw_print_sv_evidence_pr branch from d4235ff to 013c5c1 Compare January 25, 2021 18:55
@gatk-bot
Copy link

gatk-bot commented Jan 25, 2021

Travis reported job failures from build 32622
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 32622.1 logs
cloud openjdk11 32622.14 logs
integration openjdk11 32622.12 logs
integration openjdk8 32622.2 logs

@gatk-bot
Copy link

gatk-bot commented Jan 25, 2021

Travis reported job failures from build 32624
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 32624.12 logs
integration openjdk8 32624.2 logs

@mwalker174
Copy link
Contributor Author

Tests are passing now, back to you @cwhelan.

Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@mwalker174 mwalker174 merged commit b68eb87 into master Jan 28, 2021
@mwalker174 mwalker174 deleted the mw_print_sv_evidence_pr branch January 28, 2021 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants