Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect split read and paired end evidence files for GATK-SV pipeline #6356

Merged
merged 18 commits into from
Jan 16, 2020

Conversation

cwhelan
Copy link
Member

@cwhelan cwhelan commented Jan 7, 2020

This PR creates a tool for generating split read and paired end SV evidence files from an input WGS CRAM or BAM file for use in the GATK-SV pipeline.

This tool emulates the behavior of svtk collect-pesr, which is the tool used in the current version of the pipeline.

Briefly, it creates two tab-delimited, tabix-able output files. The first stores information about discordant read pairs -- the positions and orientations of a read and its mate, for each read pair marked "not properly paired" in the input file. Records are reported only for the upstream read in the pair. The second file contains the locations of all soft clips in the input file, including the coordinate and "direction" (right or left clipping) and the count of the number of reads clipped at that position and direction.

The integration test expected results file was generated using svtk collect-pesr to help ensure that the results are identical. We hope to eventually replace this component of the SV pipeline with this GATK tool.

Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, I have a few minor comments. Some of the classes (like DiscordantPair and SplitPos) I'd like to use for other tools such as the clusterer, but we can move them up to separate classes at a later time.

* <ul>
* <li>contig</li>
* <li>clipping position</li>
* <li>direction: either LEFT or RIGHT depending on whether the reads were clipped on the left or right side</li>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could phrase it this way too: direction: side of read that is clipped ("left" or "right"). I initially found the use of left and right confusing. If we end up sticking with this data format it might make more sense to use strand labels (left = -, right = +) which also use fewer characters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I agree that it would be better to eventually convert to using +/- or some other one-character indicator in the file format.

)
public class PairedEndAndSplitReadEvidenceCollection extends ReadWalker {

@Argument(shortName = "p", fullName = "pe-file", doc = "Output file for paired end evidence", optional=false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we migrated GATK4 tools, Geraldine discouraged the use of any "non-standard" short arguments (eg -I, -O, -R) because they are hard to interpret at a glance and could end up conflicting across different tools. I tend to agree, although the PE and SR files may become common for us. I'd suggest using -PE and -SR as those shouldn't cause any conflicts and follow the upper-case convention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also do we have a place for defining SV arguments? I think at the very least you should have them as public constants.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to -PE and -SR, and moved argument strings to public static constants.

@Argument(fullName = "sample-name", doc = "Sample name")
String sampleName = null;

Set<String> observedDiscordantNames = new HashSet<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be final

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and made splitPosBuffer and discordantPairs final as well by move initialization from onTraversalStart.


/**
* Adds read information to the counts in splitCounts.
* @return the new prevClippedReadEndPos after counting this read, which is the rightmost aligned position of the read
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adds to split read counts list the new prevClippedReadEndPos... (nothing is actually returned)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


private void flushDiscordantReadPairs() {
final Comparator<DiscordantRead> discReadComparator =
Comparator.comparing((DiscordantRead r) -> getBestAvailableSequenceDictionary().getSequenceIndex(r.getContig()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could replace getBestAvailableSequenceDictionary () with sequenceDictionary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


@VisibleForTesting
public DiscordantRead getReportableDiscordantReadPair(final GATKRead read, final Set<String> observedDiscordantNamesAtThisLocus,
final SAMSequenceDictionary samSequenceDictionary) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could replace samSequenceDictionary with sequenceDictionary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is this way for ease of testing so I might leave it as is..

Comment on lines +281 to +290
private String description;

POSITION(final String description) {
this.description = description;
}

public String getDescription() {
return description;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to store the string that will actually make it into the file VS the name of the enum element. In this case it's just "left" vs LEFT, etc..

Comment on lines +122 to +123
Mockito.verify(mockSrWriter).write("1" + "\t" + 1099 + "\t" + "left" + "\t" + 1 + "\t" + "sample" + "\n");
Mockito.verify(mockSrWriter).write("1" + "\t" + 1099 + "\t" + "right" + "\t" + 2 + "\t" + "sample" + "\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so cool

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love Mockito

}

private void flushDiscordantReadPairs() {
final Comparator<DiscordantRead> discReadComparator =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make a class out of this? Could be useful in the future

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapped this up in a static inner class for now.

@cwhelan cwhelan force-pushed the cw_svpipe_pesr_collection branch from c6e2ea5 to 0b36d11 Compare January 15, 2020 18:57
@cwhelan cwhelan merged commit 9bca511 into master Jan 16, 2020
@cwhelan cwhelan deleted the cw_svpipe_pesr_collection branch October 8, 2020 12:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants