Collect split read and paired end evidence files for GATK-SV pipeline #6356

cwhelan · 2020-01-07T21:09:24Z

This PR creates a tool for generating split read and paired end SV evidence files from an input WGS CRAM or BAM file for use in the GATK-SV pipeline.

This tool emulates the behavior of svtk collect-pesr, which is the tool used in the current version of the pipeline.

Briefly, it creates two tab-delimited, tabix-able output files. The first stores information about discordant read pairs -- the positions and orientations of a read and its mate, for each read pair marked "not properly paired" in the input file. Records are reported only for the upstream read in the pair. The second file contains the locations of all soft clips in the input file, including the coordinate and "direction" (right or left clipping) and the count of the number of reads clipped at that position and direction.

The integration test expected results file was generated using svtk collect-pesr to help ensure that the results are identical. We hope to eventually replace this component of the SV pipeline with this GATK tool.

mwalker174

Looks great, I have a few minor comments. Some of the classes (like DiscordantPair and SplitPos) I'd like to use for other tools such as the clusterer, but we can move them up to separate classes at a later time.

mwalker174 · 2020-01-10T17:45:48Z

.../org/broadinstitute/hellbender/tools/walkers/sv/PairedEndAndSplitReadEvidenceCollection.java

+ * <ul>
+ *     <li>contig</li>
+ *     <li>clipping position</li>
+ *     <li>direction: either LEFT or RIGHT depending on whether the reads were clipped on the left or right side</li>


Could phrase it this way too: direction: side of read that is clipped ("left" or "right"). I initially found the use of left and right confusing. If we end up sticking with this data format it might make more sense to use strand labels (left = -, right = +) which also use fewer characters.

Done, I agree that it would be better to eventually convert to using +/- or some other one-character indicator in the file format.

mwalker174 · 2020-01-10T17:54:25Z

.../org/broadinstitute/hellbender/tools/walkers/sv/PairedEndAndSplitReadEvidenceCollection.java

+)
+public class PairedEndAndSplitReadEvidenceCollection extends ReadWalker {
+
+    @Argument(shortName = "p", fullName = "pe-file", doc = "Output file for paired end evidence", optional=false)


When we migrated GATK4 tools, Geraldine discouraged the use of any "non-standard" short arguments (eg -I, -O, -R) because they are hard to interpret at a glance and could end up conflicting across different tools. I tend to agree, although the PE and SR files may become common for us. I'd suggest using -PE and -SR as those shouldn't cause any conflicts and follow the upper-case convention.

Also do we have a place for defining SV arguments? I think at the very least you should have them as public constants.

Changed to -PE and -SR, and moved argument strings to public static constants.

mwalker174 · 2020-01-10T17:55:45Z

.../org/broadinstitute/hellbender/tools/walkers/sv/PairedEndAndSplitReadEvidenceCollection.java

+    @Argument(fullName = "sample-name", doc = "Sample name")
+    String sampleName = null;
+
+    Set<String> observedDiscordantNames = new HashSet<>();


I think this can be final

Done, and made splitPosBuffer and discordantPairs final as well by move initialization from onTraversalStart.

.../org/broadinstitute/hellbender/tools/walkers/sv/PairedEndAndSplitReadEvidenceCollection.java

mwalker174 · 2020-01-10T20:27:03Z

.../org/broadinstitute/hellbender/tools/walkers/sv/PairedEndAndSplitReadEvidenceCollection.java

+
+    /**
+     * Adds read information to the counts in splitCounts.
+     * @return the new prevClippedReadEndPos after counting this read, which is the rightmost aligned position of the read


Adds to split read counts list the new prevClippedReadEndPos... (nothing is actually returned)

mwalker174 · 2020-01-10T20:41:17Z

.../org/broadinstitute/hellbender/tools/walkers/sv/PairedEndAndSplitReadEvidenceCollection.java

+
+    private void flushDiscordantReadPairs() {
+        final Comparator<DiscordantRead> discReadComparator =
+                Comparator.comparing((DiscordantRead r) -> getBestAvailableSequenceDictionary().getSequenceIndex(r.getContig()))


Could replace getBestAvailableSequenceDictionary () with sequenceDictionary

mwalker174 · 2020-01-10T20:41:47Z

.../org/broadinstitute/hellbender/tools/walkers/sv/PairedEndAndSplitReadEvidenceCollection.java

+
+    @VisibleForTesting
+    public DiscordantRead getReportableDiscordantReadPair(final GATKRead read, final Set<String> observedDiscordantNamesAtThisLocus,
+                                                          final SAMSequenceDictionary samSequenceDictionary) {


Could replace samSequenceDictionary with sequenceDictionary?

This is this way for ease of testing so I might leave it as is..

mwalker174 · 2020-01-10T21:00:33Z

.../org/broadinstitute/hellbender/tools/walkers/sv/PairedEndAndSplitReadEvidenceCollection.java

+        private String description;
+
+        POSITION(final String description) {
+            this.description = description;
+        }
+
+        public String getDescription() {
+            return description;
+        }


What's this for?

Just to store the string that will actually make it into the file VS the name of the enum element. In this case it's just "left" vs LEFT, etc..

mwalker174 · 2020-01-10T21:08:30Z

...adinstitute/hellbender/tools/walkers/sv/PairedEndAndSplitReadEvidenceCollectionUnitTest.java

+        Mockito.verify(mockSrWriter).write("1" + "\t" + 1099 + "\t" + "left" + "\t" + 1 + "\t" + "sample" + "\n");
+        Mockito.verify(mockSrWriter).write("1" + "\t" + 1099 + "\t" + "right" + "\t" + 2 + "\t" + "sample" + "\n");


This is so cool

I love Mockito

mwalker174 · 2020-01-10T21:10:24Z

.../org/broadinstitute/hellbender/tools/walkers/sv/PairedEndAndSplitReadEvidenceCollection.java

+    }
+
+    private void flushDiscordantReadPairs() {
+        final Comparator<DiscordantRead> discReadComparator =


Can you make a class out of this? Could be useful in the future

Wrapped this up in a static inner class for now.

mwalker174 approved these changes Jan 10, 2020

View reviewed changes

cwhelan force-pushed the cw_svpipe_pesr_collection branch from c6e2ea5 to 0b36d11 Compare January 15, 2020 18:57

cwhelan added 18 commits January 15, 2020 13:57

prototype port of PESR collection

bd3b396

fix discordant read comparator

7ee8ef3

minor bug fix to pesr collection

401a5fb

use an intermediate discordant read object to save memory

27c02b5

fix sorting and flushing bugs

4ef7a58

memory improvements

8e38b8c

switch to list + sort

8783bec

fix some bugs

6d91bb1

make some instance variables private

215e14b

fix bad bai index

c3eb895

stubs for tests

e076c05

fix test stub

39f5133

start adding unit tests

31f7080

add unit test and integration test

b22cc71

refactor and add test cases

886e28e

clean up and add some documentation

5c8f7de

address PR comments

0b36d11

add the beta tool marker

333081d

cwhelan merged commit 9bca511 into master Jan 16, 2020

cwhelan deleted the cw_svpipe_pesr_collection branch October 8, 2020 12:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect split read and paired end evidence files for GATK-SV pipeline #6356

Collect split read and paired end evidence files for GATK-SV pipeline #6356

cwhelan commented Jan 7, 2020

mwalker174 left a comment

mwalker174 Jan 10, 2020

cwhelan Jan 15, 2020

mwalker174 Jan 10, 2020

mwalker174 Jan 10, 2020

cwhelan Jan 15, 2020

mwalker174 Jan 10, 2020

cwhelan Jan 15, 2020

mwalker174 Jan 10, 2020

cwhelan Jan 15, 2020

mwalker174 Jan 10, 2020

cwhelan Jan 15, 2020

cwhelan Jan 15, 2020

mwalker174 Jan 10, 2020

cwhelan Jan 15, 2020

mwalker174 Jan 10, 2020

cwhelan Jan 15, 2020

mwalker174 Jan 10, 2020

cwhelan Jan 15, 2020

mwalker174 Jan 10, 2020

cwhelan Jan 15, 2020

		Mockito.verify(mockSrWriter).write("1" + "\t" + 1099 + "\t" + "left" + "\t" + 1 + "\t" + "sample" + "\n");
		Mockito.verify(mockSrWriter).write("1" + "\t" + 1099 + "\t" + "right" + "\t" + 2 + "\t" + "sample" + "\n");

Collect split read and paired end evidence files for GATK-SV pipeline #6356

Collect split read and paired end evidence files for GATK-SV pipeline #6356

Conversation

cwhelan commented Jan 7, 2020

mwalker174 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment