PrintSVEvidence tool #7026

mwalker174 · 2021-01-06T17:09:06Z

Adds a new tool that prints any of the SV evidence file types: read count (RD), discordant pair (PE), split-read (SR), or B-allele frequency (BAF). This tool is used frequently in the gatk-sv pipeline for retrieving subsets of evidence records from a bucket over specific intervals. Evidence file formats comply with the current specifications in the existing gatk-sv pipeline.

The tool is implemented as a FeatureWalker, which needed to be modified slightly to retrieve the Feature file header. Thus each evidence type has its own classes implementing a Feature and a codec. There are also new OutputStream classes for conveniently writing Features in compressed (and indexed) or plain text formats. The existing PairedEndAndSplitReadEvidenceCollection tool has been modified to use these OutputStream classes.

The IntegrationSpec class can now also check for the existence of expected index file output.

samuelklee · 2021-01-06T17:35:22Z

Not sure if this is already under discussion, but it would be great to output raw allele counts rather than BAF at some point. (I see @tedsharpe has a branch, maybe this is meant to cover that---in which case, carry on!)

mwalker174 · 2021-01-11T18:26:38Z

it would be great to output raw allele counts rather than BAF at some point. (I see @tedsharpe has a branch, maybe this is meant to cover that---in which case, carry on!)

Yes we are working on changing output format. We may actually move to a binary format for evidence in the future, as the text parsing is rather slow, prohibitively so as we look to scale to cohorts of 1M+. But I agree, it may be beneficial to have counts and depth instead of fractions.

gatk-bot · 2021-01-11T18:47:57Z

Travis reported job failures from build 32512
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
unit	openjdk8	32512.3	logs
integration	openjdk8	32512.2	logs
cloud	openjdk11	32512.14	logs
cloud	openjdk8	32512.1	logs
unit	openjdk11	32512.13	logs
integration	openjdk11	32512.12	logs
unit	openjdk8	32512.3	logs
integration	openjdk8	32512.2	logs

cmnbroad · 2021-01-12T16:46:05Z

src/main/java/org/broadinstitute/hellbender/engine/FeatureWalker.java

        } else {
-            throw new UserException("File " + drivingPath.getRawInputString() + " contains features of the wrong type.");
+            throw new UserException("File " + drivingFeaturesPath.toPath() + " contains features of the wrong type.");


@mwalker174 Some of these FeatureWalker changes appear to revert changes that we made recently. I'm wondering if these are really necessary. Can they be removed (other than the header caching anyway) ? Does something fail without these ?

@cmnbroad Thanks for catching this, I think these are errors from merge conflicts. I've reverted most of the changes, but I don't see another way to retrieve the header.

cwhelan

This looks good apart from a few minor changes. I also took a crack at refactoring the evidence types and codecs to use wildcard types and generics so that PrintSVEvidence could use a single output stream and not have so many cascading if clauses depending on the type of evidence its working with. My attempt is in the branch cw_print_sv_evidence_refactor. Let me know what you think of it. I'll make a PR against your branch you can use if you like it. We should probably have an engine team member review the classes in hellbender.utils.io closely if they haven't done so already.

cwhelan · 2021-01-20T15:03:36Z

src/main/java/org/broadinstitute/hellbender/tools/sv/BafEvidence.java

+
+    @Override
+    public int getEnd() {
+        return position + 1;


It seems a bit counter-intuitive to add one to the position in the getter method, it would lead to this:

BafEvidence e = new BafEvidence("Sample", "chr1", 10, 5.0); // set position to 10 int p = e.getPosition(); // returns 11

IMO it would be clearer to put all of the +/- 1 stuff in encode and decode methods, and have the actual position field in this object be documented as 1-based or 0-based.

The +1 is for getEnd() only. But I now realize this is actually incorrect since GATK intervals are closed.

cwhelan · 2021-01-20T16:26:39Z

src/main/java/org/broadinstitute/hellbender/tools/sv/PrintSVEvidence.java

+ *
+ * <ul>
+ *     <li>
+ *         Coordinate-sorted evidence file, indexed if block compressed


I'd make this a little more explicit, ie something like "... indexed if block compressed output requested by specifying an output path ending in .gz"

cwhelan · 2021-01-20T17:41:31Z

.../org/broadinstitute/hellbender/tools/walkers/sv/PairedEndAndSplitReadEvidenceCollection.java

-        } catch (IOException e) {
-            throw new GATKException("Could not write to PE file", e);
-        }
+        // subtract 1 from positions to match pysam output


This comment doesn't look applicable here anymore (since we're not subtracting one on the line below)

cwhelan · 2021-01-20T17:48:08Z

src/main/java/org/broadinstitute/hellbender/utils/io/FeatureOutputStreamFactory.java

+        if (IOUtil.hasBlockCompressedExtension(path.toPath())) {
+            final FeatureCodec<? extends Feature, ?> codec = FeatureManager.getCodecForFile(path.toPath());
+            return new TabixIndexedFeatureOutputStream<>(path, codec, encoder, dictionary, compressionLevel);
+        }


I'd add an else { here to be explicit that those are the two choices.

cwhelan · 2021-01-20T17:52:52Z

src/main/java/org/broadinstitute/hellbender/utils/io/TabixIndexedFeatureOutputStream.java

+     *
+     * @param header header text (without final newline character), cannot be null
+     */
+    public void writeHeader(final String header) {


This method looks identical to the one in UncompressedFeatureOutputStream suggesting that it could maybe be pulled up to the superclass (FeatureOutputStream) would have to be made into a class instead of an interface. The other two methods are pretty close as well -- ie they could be the same if you check if (indexCreator != null) in both. Did you consider consolidating them into one class?

Done - this is indeed much better consolidated to just a FeatureOutputStream class.

gatk-bot · 2021-01-22T20:32:56Z

Travis reported job failures from build 32607
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	openjdk8	32607.1	logs
cloud	openjdk11	32607.14	logs
integration	openjdk11	32607.12	logs
integration	openjdk8	32607.2	logs
cloud	openjdk11	32607.14	logs
cloud	openjdk8	32607.1	logs
integration	openjdk11	32607.12	logs
integration	openjdk8	32607.2	logs

Start adding block compression Implement FeatureOutputStream classes Clean up feature streaming, integrate with PairedEndAndSplitReadEvidenceCollection PrintSVEvidence integration tests FeatureOutputStream unit tests Document feature output stream classes Some more polishing Fix up tests

gatk-bot · 2021-01-25T19:26:21Z

Travis reported job failures from build 32622
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	openjdk8	32622.1	logs
cloud	openjdk11	32622.14	logs
integration	openjdk11	32622.12	logs
integration	openjdk8	32622.2	logs

gatk-bot · 2021-01-25T21:15:06Z

Travis reported job failures from build 32624
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
integration	openjdk11	32624.12	logs
integration	openjdk8	32624.2	logs

mwalker174 · 2021-01-26T21:57:43Z

Tests are passing now, back to you @cwhelan.

cwhelan

Looks good!

mwalker174 requested a review from cwhelan January 6, 2021 17:09

broadinstitute deleted a comment from gatk-bot Jan 11, 2021

cmnbroad reviewed Jan 12, 2021

View reviewed changes

cwhelan requested changes Jan 20, 2021

View reviewed changes

mwalker174 force-pushed the mw_print_sv_evidence_pr branch from 4eb5bbc to d4235ff Compare January 22, 2021 19:56

mwalker174 added 9 commits January 25, 2021 13:55

Fix compiler warnings and tests

e6045ea

Remove unneeded SVIOUtils class

2ca5b5d

Fix compiler warnings in test classes

28b1c54

Fix FeatureWalker merge errors

71a58b7

Fix file not found errors

678b519

Remove +1 from evidence end getters; add end column to Depth codec

a2af23e

Address most comments

441ab09

Consolidate FeatureOutputStream classes

013c5c1

mwalker174 force-pushed the mw_print_sv_evidence_pr branch from d4235ff to 013c5c1 Compare January 25, 2021 18:55

Fix null pointer exception in IntegrationTestSpec

b35e036

Revert FeatureWalker changes

22febca

cwhelan approved these changes Jan 27, 2021

View reviewed changes

mwalker174 merged commit b68eb87 into master Jan 28, 2021

mwalker174 deleted the mw_print_sv_evidence_pr branch January 28, 2021 18:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PrintSVEvidence tool #7026

PrintSVEvidence tool #7026

mwalker174 commented Jan 6, 2021

samuelklee commented Jan 6, 2021

mwalker174 commented Jan 11, 2021

gatk-bot commented Jan 11, 2021 •

edited

Loading

cmnbroad Jan 12, 2021 •

edited

Loading

mwalker174 Jan 20, 2021

cwhelan left a comment

cwhelan Jan 20, 2021

mwalker174 Jan 22, 2021 •

edited

Loading

cwhelan Jan 20, 2021

mwalker174 Jan 22, 2021

cwhelan Jan 20, 2021

mwalker174 Jan 22, 2021

cwhelan Jan 20, 2021

mwalker174 Jan 22, 2021

cwhelan Jan 20, 2021

mwalker174 Jan 22, 2021

gatk-bot commented Jan 22, 2021 •

edited

Loading

gatk-bot commented Jan 25, 2021 •

edited

Loading

gatk-bot commented Jan 25, 2021 •

edited

Loading

mwalker174 commented Jan 26, 2021

cwhelan left a comment

PrintSVEvidence tool #7026

PrintSVEvidence tool #7026

Conversation

mwalker174 commented Jan 6, 2021

samuelklee commented Jan 6, 2021

mwalker174 commented Jan 11, 2021

gatk-bot commented Jan 11, 2021 • edited Loading

cmnbroad Jan 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cwhelan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwalker174 Jan 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatk-bot commented Jan 22, 2021 • edited Loading

gatk-bot commented Jan 25, 2021 • edited Loading

gatk-bot commented Jan 25, 2021 • edited Loading

mwalker174 commented Jan 26, 2021

cwhelan left a comment

Choose a reason for hiding this comment

gatk-bot commented Jan 11, 2021 •

edited

Loading

cmnbroad Jan 12, 2021 •

edited

Loading

mwalker174 Jan 22, 2021 •

edited

Loading

gatk-bot commented Jan 22, 2021 •

edited

Loading

gatk-bot commented Jan 25, 2021 •

edited

Loading

gatk-bot commented Jan 25, 2021 •

edited

Loading