Add BWA-MEM as an option to ReadsPipelineSpark #3666

tomwhite · 2017-10-05T16:18:22Z

No description provided.

codecov-io · 2017-10-05T17:10:21Z

Codecov Report

Merging #3666 into master will increase coverage by 0.01%.
The diff coverage is 88.372%.

@@              Coverage Diff               @@
##              master     #3666      +/-   ##
==============================================
+ Coverage     79.481%   79.491%   +0.01%     
- Complexity     17287     17291       +4     
==============================================
  Files           1131      1132       +1     
  Lines          61514     61529      +15     
  Branches        9738      9740       +2     
==============================================
+ Hits           48892     48910      +18     
+ Misses          8850      8848       -2     
+ Partials        3772      3771       -1

Impacted Files	Coverage Δ	Complexity Δ
...bender/tools/spark/pathseq/PathSeqFilterSpark.java	`82.609% <ø> (+5.942%)`	`4 <0> (-1)`	⬇️
...lbender/tools/spark/bwa/BwaArgumentCollection.java	`100% <100%> (ø)`	`1 <1> (?)`
...stitute/hellbender/engine/spark/GATKSparkTool.java	`85.437% <100%> (+0.437%)`	`57 <2> (+2)`	⬆️
...k/pipelines/BwaAndMarkDuplicatesPipelineSpark.java	`78.947% <83.333%> (+2.477%)`	`4 <1> (ø)`	⬇️
...nder/tools/spark/pipelines/ReadsPipelineSpark.java	`89.362% <86.957%> (-2.067%)`	`12 <0> (+2)`
...institute/hellbender/tools/spark/bwa/BwaSpark.java	`75% <87.5%> (+5.769%)`	`5 <2> (ø)`	⬇️
...bender/tools/walkers/annotator/StrandBiasTest.java	`85% <0%> (+1.25%)`	`32% <0%> (+1%)`	⬆️
...utils/smithwaterman/SmithWatermanIntelAligner.java	`90% <0%> (+10%)`	`3% <0%> (ø)`	⬇️

tedsharpe · 2017-10-10T18:52:47Z

I think that there's too much flexibility to choose arguments that will result in a corrupt product in ReadsPipelineSpark:

Whether or not to align, and whether to align in single-ended or paired mode ought to be dictated by the input data, not by some data-jockey's choice of arguments. I'm prepared to be corrected on this, but I think that paired alignment will only work on queryname sorted, paired input reads. If the data has already been aligned and coordinate sorted, I'd expect things to blow up because the pairs won't be adjacent, and I don't think we do any input sorting. I think we ought to be inferring whether to align, and whether to align singly or paired, from the input. If we want to force realignment of an already-aligned input, we should just have the user run a tool that strips alignment info and produces unaligned, queryname-sorted input to this tool. (Or we could incorporate that functionality into this tool, I suppose.)

Second crazy obscure issue: if the input file hasn't been aligned, and therefore doesn't contain a dictionary, we're going to try to borrow one from the reference when we do alignment. That's going to lead to insanity if the reference happens to be a 2bit file in which the order of the contigs is scrambled. We can use such a source for the contig metadata, but the order of the contigs must be prescribed by the reference image file. (The list of reference contig names in correct order is available from the bwa image.)

Third issue maybe isn't really an issue: the BwaSparkEngine code doesn't copy tags from input to output, except for read group. (That's because in the case that you're realigning, you'd want to strip all tags associated with the previous alignment, but there's no comprehensive list of tags that are associated with alignment vs. those that have to do with read itself, and there can't be due to the possibility of custom tags that might go into either category.)

tedsharpe

Code looks fine, but I think we need to think about the design of ReadsPipelineSpark a little more.

tedsharpe · 2017-10-10T18:54:35Z

Sorry for the delay in reviewing this PR.

tomwhite · 2017-10-13T11:43:47Z

@tedsharpe thanks for your comments! Adding @droazen and @lbergelson for thoughts too.

Is it sufficient to look at the first two reads in the BAM file to determine whether the file is aligned or not? Scanning the whole file to see if it's aligned seems like overkill. I've updated this PR to look at the first two reads. Also, should we have a way of overriding these settings, to force alignment or single/paired mode? (BTW the single/paired read thing is an existing argument in BwaSpark.)
Is this an intrinsic problem with 2bit, or just the implementation? [ADAM-1502] Preserve contig ordering in TwoBitFile sequence dictionary. bigdatagenomics/adam#1508 will fix the implementation (when ADAM 0.23 is out). Perhaps the best thing is to use the BWA image to sort the contig names correctly?
Not sure what is needed here.

tedsharpe · 2017-10-13T16:53:51Z

I defer to @droazen and @lbergelson on what the design should actually be, but my recommendation would be as follows:
Have an argument that forces single-ended mode -- that's always a valid possibility regardless of whether the input data is single-ended or paired, and regardless of how it's sorted.
If that flag is set, then just pass the reads to the BwaEngine and align single-ended without further ado.
If that flag is not set, then check the first read to see if it's paired or not (GATKRead isPaired method). We'll assume all the rest of the reads are similarly either paired or unpaired.
If the first read is unpaired, just pass the reads to the BwaEngine and align single-ended.
If the first read is paired:

check the sort order in the header to make sure it's queryname ordered.
check that the first read is the first read of the pair (read.isFirstOfPair).
get the 2nd read and check that it's also paired, that it's the 2nd read of the pair, and that it has the same name as the first read.
If any of these checks fail, whine and bail. Otherwise, pass the reads to the BwaEngine and align in paired mode.

tedsharpe · 2017-10-13T16:57:10Z

Oh, and on point 3 about attribute copying: I don't know what we want to do, either. One possibility would be to allow the user to specify a list of tag names that should be preserved. Maybe it could default to RG and OQ?
But I'd definitely check with @lbergelson and or @droazen before implementing this.

magicDGS · 2017-10-15T08:07:49Z

Would be nice to have also the BC tag included by default in point 3, to keep barcode information if present.

droazen · 2017-10-16T19:26:00Z

@lbergelson & I think @tedsharpe 's proposal above re: attributes sounds reasonable (including a whitelist of attributes to copy, optionally configurable via the command line).

The rest of the proposal sounds reasonable as well, modulo one or two comments which @lbergelson will add below.

lbergelson · 2017-10-16T19:31:57Z

src/main/java/org/broadinstitute/hellbender/tools/spark/pipelines/ReadsPipelineSpark.java

@@ -116,42 +118,66 @@ protected void runTool(final JavaSparkContext ctx) {
        //TOOO: should this use getUnfilteredReads? getReads will apply default and command line filters
        final JavaRDD<GATKRead> initialReads = getReads();

-        final JavaRDD<GATKRead> markedReadsWithOD = MarkDuplicatesSpark.mark(initialReads, getHeaderForReads(), duplicatesScoringStrategy, new OpticalDuplicateFinder(), getRecommendedNumReducers());
+        List<GATKRead> firstReads = initialReads.take(2);


This seems like a kind of sketchy check. It's very plausible that you'd just randomly have 2 unmapped reads at the start of your file.

I think that if someone specifies an image file, then they are saying they want to align, and if they don't pass it in they don't want to align.

@lbergelson I agree that this isn't reliable: I added it as a demonstration really. The problem is that specifying an image file is optional - it's inferred from the name of the reference since #3643.

So perhaps we do need an --align flag, although @tedsharpe doesn't like that... Is there a reliable way to determine if the reads have been aligned by looking at the contents?

Similarly if unmapped reads can pop up anywhere, then determining single or paired alignment is unreliable.

Any suggestions on how to move forward on this?

tedsharpe · 2017-10-23T15:21:10Z

I don’t have strong feelings about an —align flag.
If this is supposed to be a re-alignment tool as well as an alignment tool, I think we’ll need it.
The “paired read” flag is a property of the read, and not of the alignment. It should be present even in an unaligned BAM. The only assumption I think we need to make is that the pairedness of all reads is consistent with that of the first read.

tomwhite · 2017-10-23T16:38:02Z

Thanks @tedsharpe. Perhaps the best thing to do at this point is have the existing flags (plus --align) and add some checks that look for inconsistencies and fail or warn as appropriate.

magicDGS

After #3475, conflicts are expected in the test classes, because they are now extending GATKBaseTest - a rebase should solve most of the problems without need of change of the code (except comments below)

magicDGS · 2017-11-02T09:57:21Z

src/main/java/org/broadinstitute/hellbender/utils/test/BaseTest.java

@@ -70,6 +70,8 @@ public void setTestVerbosity(){

    public static final String b37_2bit_reference_20_21 = largeFileTestDir + "human_g1k_v37.20.21.2bit";

+    public static final String b37_reference_20_21_img = largeFileTestDir + "human_g1k_v37.20.21.fasta.img";


This should be moved to GATKBaseTest, because BaseTest should be clean from any repo-specific test resource...

Fixed - thanks for letting me know

tomwhite · 2017-11-02T10:24:00Z

I've rebased the branch and re-instated the --align flag.

droazen · 2017-11-20T20:34:23Z

src/main/java/org/broadinstitute/hellbender/tools/spark/pipelines/ReadsPipelineSpark.java

+        final BwaSparkEngine engine;
+        if (align) {
+                engine = new BwaSparkEngine(ctx, referenceArguments.getReferenceFileName(), bwaArgs.indexImageFile, getHeaderForReads(), getReferenceSequenceDictionary());
+                alignedReads = !bwaArgs.singleEndAlignment ? engine.alignPaired(getReads()) : engine.alignUnpaired(getReads());


You're calling getReads() again here -- can you just use initialReads above? Is this expensive?

Also, getReads() does filtering -- did you want to pass in the unfiltered reads (getUnfilteredReads()) to BWA?

It doesn't make a difference, since it is lazy. But I've changed the code to use initialReads to make it less confusing.

Good question! What do you think is best, @droazen?

I would think that you'd want to pass BWA the unfiltered reads, and then do filtering after alignment. Do you agree, @tedsharpe ?

On the one hand, it seems wasteful to spend time aligning reads that you're about to discard. On the other hand, if you're doing pairwise alignment and your filters don't either pass or reject both members of the pair consistently, you'll have a royal mess. Since filters are designed to operate on reads, not on read pairs, it would be hard (but not impossible) to guarantee correct filter behavior in this context.
All of which is a long-winded way of saying that you probably want to align unfiltered reads like @droazen said.

Thanks for the comments @tedsharpe and @droazen. I've updated the code to align on unfiltered reads then do the alignment. I've updated BwaSpark and BwaAndMarkDuplicatesPipelineSpark too. (BTW do we still need the latter do you think?)

droazen · 2017-11-20T20:35:38Z

src/main/java/org/broadinstitute/hellbender/tools/spark/pipelines/ReadsPipelineSpark.java

-        final JavaRDD<GATKRead> markedReadsWithOD = MarkDuplicatesSpark.mark(initialReads, getHeaderForReads(), duplicatesScoringStrategy, new OpticalDuplicateFinder(), getRecommendedNumReducers());
+        final JavaRDD<GATKRead> alignedReads;
+        final SAMFileHeader header;
+        final BwaSparkEngine engine;


Can you rename to bwaEngine?

droazen

A couple of comments to address before we can merge -- back to @tomwhite

tomwhite · 2017-12-14T09:25:47Z

@droazen @lbergelson are you ok for this to go in now?

lbergelson · 2017-12-21T22:59:38Z

@tomwhite This looks good to me.

looks good to me, feel free to review after merge if you want

tomwhite force-pushed the tw_bwa_reads_pipeline branch from a6c8d5c to 46f725d Compare October 6, 2017 09:01

tomwhite requested review from tedsharpe and lbergelson October 6, 2017 10:07

tedsharpe approved these changes Oct 10, 2017

View reviewed changes

droazen assigned tomwhite Oct 16, 2017

lbergelson reviewed Oct 16, 2017

View reviewed changes

tomwhite force-pushed the tw_bwa_reads_pipeline branch from ee452e5 to 12af3de Compare October 23, 2017 14:48

magicDGS reviewed Nov 2, 2017

View reviewed changes

tomwhite force-pushed the tw_bwa_reads_pipeline branch from 12af3de to 64c3990 Compare November 2, 2017 10:22

tomwhite mentioned this pull request Nov 6, 2017

hook up BWA to ReadsPipelineSpark #1904

Closed

lbergelson assigned lbergelson and unassigned tomwhite Nov 20, 2017

droazen reviewed Nov 20, 2017

View reviewed changes

droazen previously requested changes Nov 20, 2017

View reviewed changes

droazen assigned tomwhite and unassigned lbergelson Nov 20, 2017

tomwhite assigned droazen Nov 21, 2017

Add BWA-MEM as an option to ReadsPipelineSpark

4a6421e

Address review feedback.

f6a838f

tomwhite force-pushed the tw_bwa_reads_pipeline branch from e7ba3b3 to 8b25cb1 Compare December 13, 2017 16:17

Apply filtering after alignment for paired reads.

450d1ff

tomwhite force-pushed the tw_bwa_reads_pipeline branch from 8b25cb1 to 450d1ff Compare December 13, 2017 17:19

lbergelson merged commit 0017b7a into master Dec 21, 2017

lbergelson deleted the tw_bwa_reads_pipeline branch December 21, 2017 23:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BWA-MEM as an option to ReadsPipelineSpark #3666

Add BWA-MEM as an option to ReadsPipelineSpark #3666

tomwhite commented Oct 5, 2017

codecov-io commented Oct 5, 2017 •

edited

Loading

tedsharpe commented Oct 10, 2017

tedsharpe left a comment

tedsharpe commented Oct 10, 2017

tomwhite commented Oct 13, 2017

tedsharpe commented Oct 13, 2017

tedsharpe commented Oct 13, 2017

magicDGS commented Oct 15, 2017

droazen commented Oct 16, 2017 •

edited

Loading

lbergelson Oct 16, 2017

tomwhite Oct 23, 2017

tedsharpe commented Oct 23, 2017

tomwhite commented Oct 23, 2017

magicDGS left a comment

magicDGS Nov 2, 2017

tomwhite Nov 2, 2017

tomwhite commented Nov 2, 2017

droazen Nov 20, 2017

droazen Nov 20, 2017

tomwhite Nov 21, 2017

droazen Nov 21, 2017

tedsharpe Nov 21, 2017

tomwhite Nov 22, 2017

droazen Nov 20, 2017

tomwhite Nov 22, 2017

droazen left a comment

tomwhite commented Dec 14, 2017

lbergelson commented Dec 21, 2017

		@@ -70,6 +70,8 @@ public void setTestVerbosity(){

		public static final String b37_2bit_reference_20_21 = largeFileTestDir + "human_g1k_v37.20.21.2bit";

		public static final String b37_reference_20_21_img = largeFileTestDir + "human_g1k_v37.20.21.fasta.img";

Add BWA-MEM as an option to ReadsPipelineSpark #3666

Add BWA-MEM as an option to ReadsPipelineSpark #3666

Conversation

tomwhite commented Oct 5, 2017

codecov-io commented Oct 5, 2017 • edited Loading

Codecov Report

tedsharpe commented Oct 10, 2017

tedsharpe left a comment

Choose a reason for hiding this comment

tedsharpe commented Oct 10, 2017

tomwhite commented Oct 13, 2017

tedsharpe commented Oct 13, 2017

tedsharpe commented Oct 13, 2017

magicDGS commented Oct 15, 2017

droazen commented Oct 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tedsharpe commented Oct 23, 2017

tomwhite commented Oct 23, 2017

magicDGS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomwhite commented Nov 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droazen left a comment

Choose a reason for hiding this comment

tomwhite commented Dec 14, 2017

lbergelson commented Dec 21, 2017

codecov-io commented Oct 5, 2017 •

edited

Loading

droazen commented Oct 16, 2017 •

edited

Loading