-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for a Sequence Dictionary from BAM files #9
Adding support for a Sequence Dictionary from BAM files #9
Conversation
Can one of the admins verify this patch? |
Jenkins, add to whitelist and test this please. |
Tim, this looks great! However, we like new features to come in to the repository as a single commit. Can you squash the commits down to a single commit? There are instructions here: (http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html). |
Sure, I can rebase it all down. |
Also, we try to avoid checking in data files to the repository. Is there a site where the *.bam can be downloaded for testing? If so, then we can script the tests that use this *.bam file so that they run and convert it before executing the tests. |
No, I created these small files by hand -- there's not an easy place to grab them, since they are synthetic and aligned to only two chromosomes. Would you like me to take them out altogether? They're not used by any automated tests at the moment, although I think they should be. |
/** | ||
* Copyright (c) 2013 Genome Bridge LLC | ||
*/ | ||
package edu.berkeley.cs.amplab.adam.predicates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tim, it's not clear that this file is in the correct place, as it isn't a predicate. I believe it should be moved over to the projections bin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving it over...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw that after I commented. I think the ADAMRecordField is also in the wrong place and have submitted a new PR to fix that.
For the test files, I think we have an area on our side where we can place them so that we can "grab" them for testing. It'd be best to remove them for now; perhaps you can get them to us offline? |
I've removed the BAM and parquet files from adam-commands/src/test/resources |
I've moved ADAMSequenceRecordField to the projections package, as you suggested Frank. |
Thanks Tim! |
I know I'll need to rebase (again) off of master, but I'll let your ADAMRecordField move request go through first, before I do that. |
Tim, I've just merged in the ADAMRecordField request. Can you rebase? |
Yeah, I will. There are also some changes that Matt suggested in the phone call today that should go in as well. |
Frank, I've rebased onto the latest master. I've also included the changes that Matt had suggested in the phone meeting yesterday. Mostly this centers around removing the extra ".dict" file that represents a sequence dictionary, and instead flattening the fields of the sequence dictionary into the ADAMRecords themselves. |
val dictReads1 : RDD[ADAMRecord] = sc.adamLoad(args.input1Path, projection=Some(dictProjection)) | ||
val dictReads2 : RDD[ADAMRecord] = sc.adamLoad(args.input2Path, projection=Some(dictProjection)) | ||
|
||
val dict1 = SequenceDictionary(dictReads1.collect().map(SequenceRecord(_)) : _*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way we can keep from using collect() here?
What if we combine the projections into a single projection and load each file once (instead of twice). Then we can use the "aggregate" method on the input1 and input2 rdds to create a SequenceRecord. This will push the processing the cluster and reduce the amount of data transferred to the driver (only the SequenceDictionaries will be returned).
Once you have the two SequenceDictionary objects, the rest of the code proceeds as you have it.
There are examples of using the aggregate method in a few places in the ADAM code for reference (e.g. FlagStat). One nice property of the aggregate method is that you can reuse the "left" object as you fold in the right which saves on object creation and improves performance (the FlagStat command doesn't use the feature right now).
I really like how clean this patch is. Easy to read and well-formatted. Thanks for that! On the phone we discussed converting your test BAMs to SAMs to drop into the git repo. I don't see them in the commit now, are you still planning on adding them later? |
throw new IllegalArgumentException("If you're reading a BAM/SAM file, the record type must be ADAMRecord") | ||
} else { | ||
val projected : RDD[T] = adamParquetLoad(filePath, predicate, projection=Some(projection)) | ||
SequenceDictionary(projected.distinct().map(SequenceRecord.fromSpecificRecord(_)).collect() : _*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment on a old commit of yours (that I think github dropped?). In any case, you can ignore it now and I'll repeat it here.
It would be best to avoid calling "collect()" if possible since the pulls all the data into the driver for processing. It's better to push the processing out the executors. I would recommend using "aggregate" here similarlu to how we use "aggregate" in the FlagStat object. The aggregate call would return a SequenceDictionary. The aggregate command allows you to reuse the object for better performance (which the current FlagStat doesn't -- we can process the flags from a whole genome in less than a minute so I haven't taken the time to rewrite it yet).
/**
* Aggregate the elements of each partition, and then the results for all the partitions, using
* given combine functions and a neutral "zero value". This function can return a different result
* type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U
* and one operation for merging two U's, as in scala.TraversableOnce. Both of these functions are
* allowed to modify and return their first argument instead of creating a new U to avoid memory
* allocation.
*/
It would be nice to test this functionality on a whole genome if you have a few lying around. If not, I might be able to.
Jenkins, test this please. |
Merged build triggered. |
Merged build started. |
Merged build finished. |
All automated tests passed. |
Matt, I think I've made the changes you're suggesting, including Definitely could use another pair of eyes on all those changes. Also, (1) still suggests that there should be a test which uses those sam files, and I haven't written that test yet. I'll work on that today. |
... rebased off the latest master. I'm going to add a test for CompareAdam, and then I think this might almost be ready to go? |
Okay, my list of changes that I think still need to be made is empty. Did I miss anything? |
I fixed some of the problems I had noted, above -- but in a second commit so that those comments can still be read in their original setting. When we're ready to merge, let me know and I'll rebase this all down to one commit again first. |
read1.getStart != read2.getStart | ||
}.foreach { | ||
case (readName, (read1, read2)) => | ||
println("%s => %s:%d:%d (%s) and %s:%d:%d (%s)".format(readName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make printing this optional? If someone runs this against a whole genome, they'll get nearly a billion lines of output.
Two options for make it optional: (1) add a commandline option or (2), if this is useful for debugging only, extend the "logging" class and use log.info("...") or log.debug("...").
This is your feature so you know it better than I. If users need this output then please ignore this suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now I'm the only user -- and I should make it clear, I'm planning on expanding the way that CompareAdam works in future pull requests. So my requirements are, for the moment, the only ones that I'm trying to satisfy; and I think you're right, this logging should definitely be either taken out or moved into a proper logging call. (I originally was wondering whether CompareAdam should actually just be a test.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I'm ok with leaving it as is. I just wanted to make sure that it was a debugging statement.
I should add that I think we're really close to being able to commit this. Looking very good. |
|
||
val countSamePosition = named1.join(named2).filter { | ||
case (readName, (read1, read2)) => | ||
read1.getReadMapped == read2.getReadMapped && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there's actually a bug in this predicate -- I'm missing a paren and getting the operator precedence/ordering wrong. Ugh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad to hear that you've tracked down the issue. I'll keep an eye out for the fix. Once you have it, this work is ready to merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix should be coming within the next hour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a deeper issue here, with the way I'm (not) handling paired reads. I need to think about this more, before I have a fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take all the time you need. If you need to bounce ideas off anyone, feel free to reach out.
For what it's worth, there's an ADAM RDD method, "adamSingleReadBuckets", which returns an RDD of "SingleReadBucket" objects. A SingleReadBucket has all the Single/PE reads, primary and secondary alignments for a single read, e.g. a PE mapped read would have two ADAMRecords in the primaryMapped array. Basically, all the information for a single read in a single object. Btw, I'd love any feedback you have on a better name for it if you have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for pointing that out. I had seen it, but hadn't connected it to the need I was experiencing here -- but it's exactly what I want. (In re: naming, the only thing I could think of was s/bucket/cluster/, but that's a matter of taste perhaps.)
Matt, after talking about it for a while, Carl and I decided that there's still an un-addressed issue in the code as it currently stands. We've flattened the sequence dictionary into the reference{Name, Id, Length, Url} set of fields in ADAMRecord, but (we think) we will also need to flatten it into mateReference{Name, Id, Length, Url} as well. Does that make sense? Otherwise, you could have a BAM with a sequence dictionary with two sequences -- and if there were a sequence to which only the mate of every paired read ever aligned, then you'd lose sight of that sequence when converting from BAM to ADAM. Does that make sense? So I'm going to change the request to flatten the dictionary into both the reference and mateReference fields of ADAMRecord, and then reconstruct it from these pairs downstream. Let me know if that doesn't make sense, or if you think there's a better way to do it. |
(1) Adding referenceLength and referenceUrl to the ADAMRecord schema. ADAMRecord now contains a (flattened) version of the sequence dictionary element corresponding to each read. An entire sequence dictionary will be able to be read out of the records associated with a single BAM or SAM file. (2) Added SequenceDictionary class Aggregates a consistent set of SequenceRecord objects into a single dictionary. SequenceRecords are derived from the reference{Id, Name, Length, Url} fields of the ADAMRecord. Most of the work here goes in to providing methods for combining and mapping between sequence dictionaries, which will be used for cross-BAM or cross-ADAM comparisons. (3) Extended Bam2Adam to create a SequenceDictionary .dict parquet file When bam2adam is run on a BAM, we extract the sequence dictionary from the BAM's header and insert its values into each ADAMRecord. Also updated the adamBamLoad method in ADAMContext to do something similar, even without the explicit parquet file. (4) Adding the ListDict command. ListDict reads a SequenceDictionary out of a set of ADAMRecords and prints out all the component SequenceRecord objects contained in it. (5) Adding the CompareAdam command CompareAdam is a command that takes two ADAM files as arguments. It assumes that the same read names are present in *both* input ADAM files -- for example, if the two files were produced by different alignment tools run on the same set of reads. It calculates the number of reads that are unique to each ADAM file, as well as the number of shared reads which have the same alignment position across the two BAMS. This position is computed irrespective of the ordering of the Sequences in the sequence dictionaries of the two ADAM files, and therefore is a useful end-to-end test of the SequenceDictionary code, as well as being the basis for more useful pairwise comparisons of ADAM files in the future. (6) Extending AdamContext with methods to automate the loading of SequenceDictionary objects out of any conformant Avro schema.
Matt, the latest commit should fix the issue I noted, above -- it should also handle the comparisons of your mouse example bam, as well as the example SAMs I had posted before. |
…quenceRecord in comments.
Once again, I'll rebase this down to a single commit before merging. |
Thanks, Timothy and Carl. I've manually merged this to master now (rebasing to a single commit). |
merge upstream, upgrade hammerlab deps
One of the things that's missing from the current ADAM format and tools is the ability to represent and manipulate the "sequence dictionaries" (from the @sq lines in the SAM / BAM header) which determine the ordering and indexing of the referenceName fields within the records of a BAM file.
We're going to need to represent and handle these dictionaries for anything that works with more than one ADAM file at a time -- two examples would be ADAM comparison tools, and multi-sample methods for analyzing read data.
This request includes an extension to the ADAMRecord schema to include the referenceLength and referenceUrl fields (from the BAM Sequence dictionary), and the definition of a class (SequenceDictionary) which handles multiple sequence records, and mapping between different dictionaries. I've extended the bam2adam command to extract the sequence dictionary from the BAM header and insert its fields into the ADAMRecords (through a new argument to SamRecordConverter), and I've included two new commands: ListDict (for listing the contents of one of these files) and CompareAdam, for counting the number of identically-aligned reads between two ADAM files even if they have different sequence dictionaries.
Places that I know of, that probably need review and editing: