-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BWA-MEM as an option to ReadsPipelineSpark #3666
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3666 +/- ##
==============================================
+ Coverage 79.481% 79.491% +0.01%
- Complexity 17287 17291 +4
==============================================
Files 1131 1132 +1
Lines 61514 61529 +15
Branches 9738 9740 +2
==============================================
+ Hits 48892 48910 +18
+ Misses 8850 8848 -2
+ Partials 3772 3771 -1
|
a6c8d5c
to
46f725d
Compare
I think that there's too much flexibility to choose arguments that will result in a corrupt product in ReadsPipelineSpark: Whether or not to align, and whether to align in single-ended or paired mode ought to be dictated by the input data, not by some data-jockey's choice of arguments. I'm prepared to be corrected on this, but I think that paired alignment will only work on queryname sorted, paired input reads. If the data has already been aligned and coordinate sorted, I'd expect things to blow up because the pairs won't be adjacent, and I don't think we do any input sorting. I think we ought to be inferring whether to align, and whether to align singly or paired, from the input. If we want to force realignment of an already-aligned input, we should just have the user run a tool that strips alignment info and produces unaligned, queryname-sorted input to this tool. (Or we could incorporate that functionality into this tool, I suppose.) Second crazy obscure issue: if the input file hasn't been aligned, and therefore doesn't contain a dictionary, we're going to try to borrow one from the reference when we do alignment. That's going to lead to insanity if the reference happens to be a 2bit file in which the order of the contigs is scrambled. We can use such a source for the contig metadata, but the order of the contigs must be prescribed by the reference image file. (The list of reference contig names in correct order is available from the bwa image.) Third issue maybe isn't really an issue: the BwaSparkEngine code doesn't copy tags from input to output, except for read group. (That's because in the case that you're realigning, you'd want to strip all tags associated with the previous alignment, but there's no comprehensive list of tags that are associated with alignment vs. those that have to do with read itself, and there can't be due to the possibility of custom tags that might go into either category.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks fine, but I think we need to think about the design of ReadsPipelineSpark a little more.
Sorry for the delay in reviewing this PR. |
@tedsharpe thanks for your comments! Adding @droazen and @lbergelson for thoughts too.
|
I defer to @droazen and @lbergelson on what the design should actually be, but my recommendation would be as follows:
|
Oh, and on point 3 about attribute copying: I don't know what we want to do, either. One possibility would be to allow the user to specify a list of tag names that should be preserved. Maybe it could default to RG and OQ? |
Would be nice to have also the |
@lbergelson & I think @tedsharpe 's proposal above re: attributes sounds reasonable (including a whitelist of attributes to copy, optionally configurable via the command line). The rest of the proposal sounds reasonable as well, modulo one or two comments which @lbergelson will add below. |
@@ -116,42 +118,66 @@ protected void runTool(final JavaSparkContext ctx) { | |||
//TOOO: should this use getUnfilteredReads? getReads will apply default and command line filters | |||
final JavaRDD<GATKRead> initialReads = getReads(); | |||
|
|||
final JavaRDD<GATKRead> markedReadsWithOD = MarkDuplicatesSpark.mark(initialReads, getHeaderForReads(), duplicatesScoringStrategy, new OpticalDuplicateFinder(), getRecommendedNumReducers()); | |||
List<GATKRead> firstReads = initialReads.take(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a kind of sketchy check. It's very plausible that you'd just randomly have 2 unmapped reads at the start of your file.
I think that if someone specifies an image file, then they are saying they want to align, and if they don't pass it in they don't want to align.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbergelson I agree that this isn't reliable: I added it as a demonstration really. The problem is that specifying an image file is optional - it's inferred from the name of the reference since #3643.
So perhaps we do need an --align
flag, although @tedsharpe doesn't like that... Is there a reliable way to determine if the reads have been aligned by looking at the contents?
Similarly if unmapped reads can pop up anywhere, then determining single or paired alignment is unreliable.
Any suggestions on how to move forward on this?
ee452e5
to
12af3de
Compare
I don’t have strong feelings about an —align flag. |
Thanks @tedsharpe. Perhaps the best thing to do at this point is have the existing flags (plus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After #3475, conflicts are expected in the test classes, because they are now extending GATKBaseTest - a rebase should solve most of the problems without need of change of the code (except comments below)
@@ -70,6 +70,8 @@ public void setTestVerbosity(){ | |||
|
|||
public static final String b37_2bit_reference_20_21 = largeFileTestDir + "human_g1k_v37.20.21.2bit"; | |||
|
|||
public static final String b37_reference_20_21_img = largeFileTestDir + "human_g1k_v37.20.21.fasta.img"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be moved to GATKBaseTest
, because BaseTest
should be clean from any repo-specific test resource...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed - thanks for letting me know
12af3de
to
64c3990
Compare
I've rebased the branch and re-instated the |
final BwaSparkEngine engine; | ||
if (align) { | ||
engine = new BwaSparkEngine(ctx, referenceArguments.getReferenceFileName(), bwaArgs.indexImageFile, getHeaderForReads(), getReferenceSequenceDictionary()); | ||
alignedReads = !bwaArgs.singleEndAlignment ? engine.alignPaired(getReads()) : engine.alignUnpaired(getReads()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're calling getReads()
again here -- can you just use initialReads
above? Is this expensive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, getReads()
does filtering -- did you want to pass in the unfiltered reads (getUnfilteredReads()
) to BWA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It doesn't make a difference, since it is lazy. But I've changed the code to use
initialReads
to make it less confusing. - Good question! What do you think is best, @droazen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think that you'd want to pass BWA the unfiltered reads, and then do filtering after alignment. Do you agree, @tedsharpe ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the one hand, it seems wasteful to spend time aligning reads that you're about to discard. On the other hand, if you're doing pairwise alignment and your filters don't either pass or reject both members of the pair consistently, you'll have a royal mess. Since filters are designed to operate on reads, not on read pairs, it would be hard (but not impossible) to guarantee correct filter behavior in this context.
All of which is a long-winded way of saying that you probably want to align unfiltered reads like @droazen said.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments @tedsharpe and @droazen. I've updated the code to align on unfiltered reads then do the alignment. I've updated BwaSpark
and BwaAndMarkDuplicatesPipelineSpark
too. (BTW do we still need the latter do you think?)
final JavaRDD<GATKRead> markedReadsWithOD = MarkDuplicatesSpark.mark(initialReads, getHeaderForReads(), duplicatesScoringStrategy, new OpticalDuplicateFinder(), getRecommendedNumReducers()); | ||
final JavaRDD<GATKRead> alignedReads; | ||
final SAMFileHeader header; | ||
final BwaSparkEngine engine; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rename to bwaEngine
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of comments to address before we can merge -- back to @tomwhite
e7ba3b3
to
8b25cb1
Compare
8b25cb1
to
450d1ff
Compare
@droazen @lbergelson are you ok for this to go in now? |
@tomwhite This looks good to me. |
looks good to me, feel free to review after merge if you want
No description provided.