Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BWA-MEM as an option to ReadsPipelineSpark #3666

Merged
merged 3 commits into from
Dec 21, 2017

Conversation

tomwhite
Copy link
Contributor

@tomwhite tomwhite commented Oct 5, 2017

No description provided.

@codecov-io
Copy link

codecov-io commented Oct 5, 2017

Codecov Report

Merging #3666 into master will increase coverage by 0.01%.
The diff coverage is 88.372%.

@@              Coverage Diff               @@
##              master     #3666      +/-   ##
==============================================
+ Coverage     79.481%   79.491%   +0.01%     
- Complexity     17287     17291       +4     
==============================================
  Files           1131      1132       +1     
  Lines          61514     61529      +15     
  Branches        9738      9740       +2     
==============================================
+ Hits           48892     48910      +18     
+ Misses          8850      8848       -2     
+ Partials        3772      3771       -1
Impacted Files Coverage Δ Complexity Δ
...bender/tools/spark/pathseq/PathSeqFilterSpark.java 82.609% <ø> (+5.942%) 4 <0> (-1) ⬇️
...lbender/tools/spark/bwa/BwaArgumentCollection.java 100% <100%> (ø) 1 <1> (?)
...stitute/hellbender/engine/spark/GATKSparkTool.java 85.437% <100%> (+0.437%) 57 <2> (+2) ⬆️
...k/pipelines/BwaAndMarkDuplicatesPipelineSpark.java 78.947% <83.333%> (+2.477%) 4 <1> (ø) ⬇️
...nder/tools/spark/pipelines/ReadsPipelineSpark.java 89.362% <86.957%> (-2.067%) 12 <0> (+2)
...institute/hellbender/tools/spark/bwa/BwaSpark.java 75% <87.5%> (+5.769%) 5 <2> (ø) ⬇️
...bender/tools/walkers/annotator/StrandBiasTest.java 85% <0%> (+1.25%) 32% <0%> (+1%) ⬆️
...utils/smithwaterman/SmithWatermanIntelAligner.java 90% <0%> (+10%) 3% <0%> (ø) ⬇️

@tedsharpe
Copy link
Contributor

I think that there's too much flexibility to choose arguments that will result in a corrupt product in ReadsPipelineSpark:

Whether or not to align, and whether to align in single-ended or paired mode ought to be dictated by the input data, not by some data-jockey's choice of arguments. I'm prepared to be corrected on this, but I think that paired alignment will only work on queryname sorted, paired input reads. If the data has already been aligned and coordinate sorted, I'd expect things to blow up because the pairs won't be adjacent, and I don't think we do any input sorting. I think we ought to be inferring whether to align, and whether to align singly or paired, from the input. If we want to force realignment of an already-aligned input, we should just have the user run a tool that strips alignment info and produces unaligned, queryname-sorted input to this tool. (Or we could incorporate that functionality into this tool, I suppose.)

Second crazy obscure issue: if the input file hasn't been aligned, and therefore doesn't contain a dictionary, we're going to try to borrow one from the reference when we do alignment. That's going to lead to insanity if the reference happens to be a 2bit file in which the order of the contigs is scrambled. We can use such a source for the contig metadata, but the order of the contigs must be prescribed by the reference image file. (The list of reference contig names in correct order is available from the bwa image.)

Third issue maybe isn't really an issue: the BwaSparkEngine code doesn't copy tags from input to output, except for read group. (That's because in the case that you're realigning, you'd want to strip all tags associated with the previous alignment, but there's no comprehensive list of tags that are associated with alignment vs. those that have to do with read itself, and there can't be due to the possibility of custom tags that might go into either category.)

Copy link
Contributor

@tedsharpe tedsharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks fine, but I think we need to think about the design of ReadsPipelineSpark a little more.

@tedsharpe
Copy link
Contributor

Sorry for the delay in reviewing this PR.

@tomwhite
Copy link
Contributor Author

@tedsharpe thanks for your comments! Adding @droazen and @lbergelson for thoughts too.

  1. Is it sufficient to look at the first two reads in the BAM file to determine whether the file is aligned or not? Scanning the whole file to see if it's aligned seems like overkill. I've updated this PR to look at the first two reads. Also, should we have a way of overriding these settings, to force alignment or single/paired mode? (BTW the single/paired read thing is an existing argument in BwaSpark.)

  2. Is this an intrinsic problem with 2bit, or just the implementation? [ADAM-1502] Preserve contig ordering in TwoBitFile sequence dictionary. bigdatagenomics/adam#1508 will fix the implementation (when ADAM 0.23 is out). Perhaps the best thing is to use the BWA image to sort the contig names correctly?

  3. Not sure what is needed here.

@tedsharpe
Copy link
Contributor

I defer to @droazen and @lbergelson on what the design should actually be, but my recommendation would be as follows:
Have an argument that forces single-ended mode -- that's always a valid possibility regardless of whether the input data is single-ended or paired, and regardless of how it's sorted.
If that flag is set, then just pass the reads to the BwaEngine and align single-ended without further ado.
If that flag is not set, then check the first read to see if it's paired or not (GATKRead isPaired method). We'll assume all the rest of the reads are similarly either paired or unpaired.
If the first read is unpaired, just pass the reads to the BwaEngine and align single-ended.
If the first read is paired:

  1. check the sort order in the header to make sure it's queryname ordered.
  2. check that the first read is the first read of the pair (read.isFirstOfPair).
  3. get the 2nd read and check that it's also paired, that it's the 2nd read of the pair, and that it has the same name as the first read.
    If any of these checks fail, whine and bail. Otherwise, pass the reads to the BwaEngine and align in paired mode.

@tedsharpe
Copy link
Contributor

Oh, and on point 3 about attribute copying: I don't know what we want to do, either. One possibility would be to allow the user to specify a list of tag names that should be preserved. Maybe it could default to RG and OQ?
But I'd definitely check with @lbergelson and or @droazen before implementing this.

@magicDGS
Copy link
Contributor

Would be nice to have also the BC tag included by default in point 3, to keep barcode information if present.

@droazen
Copy link
Contributor

droazen commented Oct 16, 2017

@lbergelson & I think @tedsharpe 's proposal above re: attributes sounds reasonable (including a whitelist of attributes to copy, optionally configurable via the command line).

The rest of the proposal sounds reasonable as well, modulo one or two comments which @lbergelson will add below.

@@ -116,42 +118,66 @@ protected void runTool(final JavaSparkContext ctx) {
//TOOO: should this use getUnfilteredReads? getReads will apply default and command line filters
final JavaRDD<GATKRead> initialReads = getReads();

final JavaRDD<GATKRead> markedReadsWithOD = MarkDuplicatesSpark.mark(initialReads, getHeaderForReads(), duplicatesScoringStrategy, new OpticalDuplicateFinder(), getRecommendedNumReducers());
List<GATKRead> firstReads = initialReads.take(2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a kind of sketchy check. It's very plausible that you'd just randomly have 2 unmapped reads at the start of your file.

I think that if someone specifies an image file, then they are saying they want to align, and if they don't pass it in they don't want to align.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbergelson I agree that this isn't reliable: I added it as a demonstration really. The problem is that specifying an image file is optional - it's inferred from the name of the reference since #3643.

So perhaps we do need an --align flag, although @tedsharpe doesn't like that... Is there a reliable way to determine if the reads have been aligned by looking at the contents?

Similarly if unmapped reads can pop up anywhere, then determining single or paired alignment is unreliable.

Any suggestions on how to move forward on this?

@tedsharpe
Copy link
Contributor

I don’t have strong feelings about an —align flag.
If this is supposed to be a re-alignment tool as well as an alignment tool, I think we’ll need it.
The “paired read” flag is a property of the read, and not of the alignment. It should be present even in an unaligned BAM. The only assumption I think we need to make is that the pairedness of all reads is consistent with that of the first read.

@tomwhite
Copy link
Contributor Author

Thanks @tedsharpe. Perhaps the best thing to do at this point is have the existing flags (plus --align) and add some checks that look for inconsistencies and fail or warn as appropriate.

Copy link
Contributor

@magicDGS magicDGS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After #3475, conflicts are expected in the test classes, because they are now extending GATKBaseTest - a rebase should solve most of the problems without need of change of the code (except comments below)

@@ -70,6 +70,8 @@ public void setTestVerbosity(){

public static final String b37_2bit_reference_20_21 = largeFileTestDir + "human_g1k_v37.20.21.2bit";

public static final String b37_reference_20_21_img = largeFileTestDir + "human_g1k_v37.20.21.fasta.img";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved to GATKBaseTest, because BaseTest should be clean from any repo-specific test resource...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - thanks for letting me know

@tomwhite
Copy link
Contributor Author

tomwhite commented Nov 2, 2017

I've rebased the branch and re-instated the --align flag.

final BwaSparkEngine engine;
if (align) {
engine = new BwaSparkEngine(ctx, referenceArguments.getReferenceFileName(), bwaArgs.indexImageFile, getHeaderForReads(), getReferenceSequenceDictionary());
alignedReads = !bwaArgs.singleEndAlignment ? engine.alignPaired(getReads()) : engine.alignUnpaired(getReads());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're calling getReads() again here -- can you just use initialReads above? Is this expensive?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, getReads() does filtering -- did you want to pass in the unfiltered reads (getUnfilteredReads()) to BWA?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It doesn't make a difference, since it is lazy. But I've changed the code to use initialReads to make it less confusing.
  2. Good question! What do you think is best, @droazen?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think that you'd want to pass BWA the unfiltered reads, and then do filtering after alignment. Do you agree, @tedsharpe ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the one hand, it seems wasteful to spend time aligning reads that you're about to discard. On the other hand, if you're doing pairwise alignment and your filters don't either pass or reject both members of the pair consistently, you'll have a royal mess. Since filters are designed to operate on reads, not on read pairs, it would be hard (but not impossible) to guarantee correct filter behavior in this context.
All of which is a long-winded way of saying that you probably want to align unfiltered reads like @droazen said.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments @tedsharpe and @droazen. I've updated the code to align on unfiltered reads then do the alignment. I've updated BwaSpark and BwaAndMarkDuplicatesPipelineSpark too. (BTW do we still need the latter do you think?)

final JavaRDD<GATKRead> markedReadsWithOD = MarkDuplicatesSpark.mark(initialReads, getHeaderForReads(), duplicatesScoringStrategy, new OpticalDuplicateFinder(), getRecommendedNumReducers());
final JavaRDD<GATKRead> alignedReads;
final SAMFileHeader header;
final BwaSparkEngine engine;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename to bwaEngine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments to address before we can merge -- back to @tomwhite

@tomwhite
Copy link
Contributor Author

@droazen @lbergelson are you ok for this to go in now?

@lbergelson
Copy link
Member

@tomwhite This looks good to me.

@lbergelson lbergelson dismissed droazen’s stale review December 21, 2017 23:00

looks good to me, feel free to review after merge if you want

@lbergelson lbergelson merged commit 0017b7a into master Dec 21, 2017
@lbergelson lbergelson deleted the tw_bwa_reads_pipeline branch December 21, 2017 23:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants