-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect split read and paired end evidence files for GATK-SV pipeline #6356
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, I have a few minor comments. Some of the classes (like DiscordantPair
and SplitPos
) I'd like to use for other tools such as the clusterer, but we can move them up to separate classes at a later time.
* <ul> | ||
* <li>contig</li> | ||
* <li>clipping position</li> | ||
* <li>direction: either LEFT or RIGHT depending on whether the reads were clipped on the left or right side</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could phrase it this way too: direction: side of read that is clipped ("left" or "right")
. I initially found the use of left
and right
confusing. If we end up sticking with this data format it might make more sense to use strand labels (left
= -
, right
= +
) which also use fewer characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I agree that it would be better to eventually convert to using +/-
or some other one-character indicator in the file format.
) | ||
public class PairedEndAndSplitReadEvidenceCollection extends ReadWalker { | ||
|
||
@Argument(shortName = "p", fullName = "pe-file", doc = "Output file for paired end evidence", optional=false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we migrated GATK4 tools, Geraldine discouraged the use of any "non-standard" short arguments (eg -I
, -O
, -R
) because they are hard to interpret at a glance and could end up conflicting across different tools. I tend to agree, although the PE and SR files may become common for us. I'd suggest using -PE
and -SR
as those shouldn't cause any conflicts and follow the upper-case convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also do we have a place for defining SV arguments? I think at the very least you should have them as public constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to -PE
and -SR
, and moved argument strings to public static constants.
@Argument(fullName = "sample-name", doc = "Sample name") | ||
String sampleName = null; | ||
|
||
Set<String> observedDiscordantNames = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, and made splitPosBuffer
and discordantPairs
final as well by move initialization from onTraversalStart
.
.../org/broadinstitute/hellbender/tools/walkers/sv/PairedEndAndSplitReadEvidenceCollection.java
Outdated
Show resolved
Hide resolved
|
||
/** | ||
* Adds read information to the counts in splitCounts. | ||
* @return the new prevClippedReadEndPos after counting this read, which is the rightmost aligned position of the read |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adds to split read counts list the new prevClippedReadEndPos...
(nothing is actually returned)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
private void flushDiscordantReadPairs() { | ||
final Comparator<DiscordantRead> discReadComparator = | ||
Comparator.comparing((DiscordantRead r) -> getBestAvailableSequenceDictionary().getSequenceIndex(r.getContig())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could replace getBestAvailableSequenceDictionary ()
with sequenceDictionary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
@VisibleForTesting | ||
public DiscordantRead getReportableDiscordantReadPair(final GATKRead read, final Set<String> observedDiscordantNamesAtThisLocus, | ||
final SAMSequenceDictionary samSequenceDictionary) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could replace samSequenceDictionary
with sequenceDictionary
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is this way for ease of testing so I might leave it as is..
private String description; | ||
|
||
POSITION(final String description) { | ||
this.description = description; | ||
} | ||
|
||
public String getDescription() { | ||
return description; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to store the string that will actually make it into the file VS the name of the enum element. In this case it's just "left"
vs LEFT
, etc..
Mockito.verify(mockSrWriter).write("1" + "\t" + 1099 + "\t" + "left" + "\t" + 1 + "\t" + "sample" + "\n"); | ||
Mockito.verify(mockSrWriter).write("1" + "\t" + 1099 + "\t" + "right" + "\t" + 2 + "\t" + "sample" + "\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so cool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love Mockito
} | ||
|
||
private void flushDiscordantReadPairs() { | ||
final Comparator<DiscordantRead> discReadComparator = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make a class out of this? Could be useful in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrapped this up in a static inner class for now.
c6e2ea5
to
0b36d11
Compare
This PR creates a tool for generating split read and paired end SV evidence files from an input WGS CRAM or BAM file for use in the GATK-SV pipeline.
This tool emulates the behavior of
svtk collect-pesr
, which is the tool used in the current version of the pipeline.Briefly, it creates two tab-delimited, tabix-able output files. The first stores information about discordant read pairs -- the positions and orientations of a read and its mate, for each read pair marked "not properly paired" in the input file. Records are reported only for the upstream read in the pair. The second file contains the locations of all soft clips in the input file, including the coordinate and "direction" (right or left clipping) and the count of the number of reads clipped at that position and direction.
The integration test expected results file was generated using
svtk collect-pesr
to help ensure that the results are identical. We hope to eventually replace this component of the SV pipeline with this GATK tool.