Adding support for a Sequence Dictionary from BAM files #9

tdanford · 2013-11-27T16:17:26Z

One of the things that's missing from the current ADAM format and tools is the ability to represent and manipulate the "sequence dictionaries" (from the @sq lines in the SAM / BAM header) which determine the ordering and indexing of the referenceName fields within the records of a BAM file.

We're going to need to represent and handle these dictionaries for anything that works with more than one ADAM file at a time -- two examples would be ADAM comparison tools, and multi-sample methods for analyzing read data.

This request includes an extension to the ADAMRecord schema to include the referenceLength and referenceUrl fields (from the BAM Sequence dictionary), and the definition of a class (SequenceDictionary) which handles multiple sequence records, and mapping between different dictionaries. I've extended the bam2adam command to extract the sequence dictionary from the BAM header and insert its fields into the ADAMRecords (through a new argument to SamRecordConverter), and I've included two new commands: ListDict (for listing the contents of one of these files) and CompareAdam, for counting the number of identically-aligned reads between two ADAM files even if they have different sequence dictionaries.

Places that I know of, that probably need review and editing:

CompareAdam doesn't handle paired end reads, and is a little underpowered at the moment. We (GenomeBridge) are planning on building a fuller set of tools, using ADAM, to compare and validate the output of multiple pipelines on the same data, but that's not part of CompareAdam (yet). In the meantime, perhaps CompareAdam would be better as a test?
We're leaving undetermined, here, whether there should be a "master" SequenceDictionary which is maintained and saved across all ADAM files, and to which each ADAM file's dictionary can be mapped or whether every analysis should also take the sequence dictionary for each component ADAM file as an argument (what we call the "local dictionaries only" approach).

AmplabJenkins · 2013-11-27T16:20:50Z

Can one of the admins verify this patch?

fnothaft · 2013-11-27T16:29:06Z

Jenkins, add to whitelist and test this please.

fnothaft · 2013-11-27T16:30:40Z

Tim, this looks great! However, we like new features to come in to the repository as a single commit. Can you squash the commits down to a single commit? There are instructions here: (http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html).

tdanford · 2013-11-27T16:31:15Z

Sure, I can rebase it all down.

fnothaft · 2013-11-27T16:33:35Z

Also, we try to avoid checking in data files to the repository. Is there a site where the *.bam can be downloaded for testing? If so, then we can script the tests that use this *.bam file so that they run and convert it before executing the tests.

tdanford · 2013-11-27T16:34:33Z

No, I created these small files by hand -- there's not an easy place to grab them, since they are synthetic and aligned to only two chromosomes.

Would you like me to take them out altogether? They're not used by any automated tests at the moment, although I think they should be.

fnothaft · 2013-11-27T16:44:08Z

...commands/src/main/scala/edu/berkeley/cs/amplab/adam/predicates/ADAMSequenceRecordField.scala

+/**
+ * Copyright (c) 2013 Genome Bridge LLC
+ */
+package edu.berkeley.cs.amplab.adam.predicates


Tim, it's not clear that this file is in the correct place, as it isn't a predicate. I believe it should be moved over to the projections bin.

Moving it over...

I saw that after I commented. I think the ADAMRecordField is also in the wrong place and have submitted a new PR to fix that.

fnothaft · 2013-11-27T16:50:36Z

For the test files, I think we have an area on our side where we can place them so that we can "grab" them for testing. It'd be best to remove them for now; perhaps you can get them to us offline?

tdanford · 2013-11-27T17:29:17Z

I've removed the BAM and parquet files from adam-commands/src/test/resources

tdanford · 2013-11-27T17:43:23Z

I've moved ADAMSequenceRecordField to the projections package, as you suggested Frank.

fnothaft · 2013-11-27T17:54:56Z

Thanks Tim!

tdanford · 2013-11-27T17:57:49Z

I know I'll need to rebase (again) off of master, but I'll let your ADAMRecordField move request go through first, before I do that.

fnothaft · 2013-11-27T21:58:53Z

Tim, I've just merged in the ADAMRecordField request. Can you rebase?

tdanford · 2013-11-28T00:39:39Z

Yeah, I will. There are also some changes that Matt suggested in the phone call today that should go in as well.

tdanford · 2013-11-28T11:43:46Z

Frank, I've rebased onto the latest master. I've also included the changes that Matt had suggested in the phone meeting yesterday. Mostly this centers around removing the extra ".dict" file that represents a sequence dictionary, and instead flattening the fields of the sequence dictionary into the ADAMRecords themselves.

massie · 2013-11-29T20:32:44Z

adam-commands/src/main/scala/edu/berkeley/cs/amplab/adam/commands/CompareAdam.scala

+    val dictReads1 : RDD[ADAMRecord] = sc.adamLoad(args.input1Path, projection=Some(dictProjection))
+    val dictReads2 : RDD[ADAMRecord] = sc.adamLoad(args.input2Path, projection=Some(dictProjection))
+
+    val dict1 = SequenceDictionary(dictReads1.collect().map(SequenceRecord(_)) : _*)


Is there a way we can keep from using collect() here?

What if we combine the projections into a single projection and load each file once (instead of twice). Then we can use the "aggregate" method on the input1 and input2 rdds to create a SequenceRecord. This will push the processing the cluster and reduce the amount of data transferred to the driver (only the SequenceDictionaries will be returned).

Once you have the two SequenceDictionary objects, the rest of the code proceeds as you have it.

There are examples of using the aggregate method in a few places in the ADAM code for reference (e.g. FlagStat). One nice property of the aggregate method is that you can reuse the "left" object as you fold in the right which saves on object creation and improves performance (the FlagStat command doesn't use the feature right now).

massie · 2013-11-29T20:36:31Z

I really like how clean this patch is. Easy to read and well-formatted. Thanks for that!

On the phone we discussed converting your test BAMs to SAMs to drop into the git repo. I don't see them in the commit now, are you still planning on adding them later?

massie · 2013-11-29T20:50:34Z

adam-commands/src/main/scala/edu/berkeley/cs/amplab/adam/rdd/AdamContext.scala

+        throw new IllegalArgumentException("If you're reading a BAM/SAM file, the record type must be ADAMRecord")
+    } else {
+      val projected : RDD[T] = adamParquetLoad(filePath, predicate, projection=Some(projection))
+      SequenceDictionary(projected.distinct().map(SequenceRecord.fromSpecificRecord(_)).collect() : _*)


I left a comment on a old commit of yours (that I think github dropped?). In any case, you can ignore it now and I'll repeat it here.

It would be best to avoid calling "collect()" if possible since the pulls all the data into the driver for processing. It's better to push the processing out the executors. I would recommend using "aggregate" here similarlu to how we use "aggregate" in the FlagStat object. The aggregate call would return a SequenceDictionary. The aggregate command allows you to reuse the object for better performance (which the current FlagStat doesn't -- we can process the flags from a whole genome in less than a minute so I haven't taken the time to rewrite it yet).

/** * Aggregate the elements of each partition, and then the results for all the partitions, using * given combine functions and a neutral "zero value". This function can return a different result * type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U * and one operation for merging two U's, as in scala.TraversableOnce. Both of these functions are * allowed to modify and return their first argument instead of creating a new U to avoid memory * allocation. */

It would be nice to test this functionality on a whole genome if you have a few lying around. If not, I might be able to.

massie · 2013-11-29T21:23:02Z

Jenkins, test this please.

AmplabJenkins · 2013-11-29T21:25:57Z

Merged build triggered.

AmplabJenkins · 2013-11-29T21:25:58Z

Merged build started.

AmplabJenkins · 2013-11-29T21:29:07Z

Merged build finished.

AmplabJenkins · 2013-11-29T21:29:07Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM/44/

tdanford · 2013-11-30T14:37:04Z

Matt, I think I've made the changes you're suggesting, including
(1) adding in those test SAM files (thanks for the reminder!)
(2) removing that extraneous import,
(3) re-ordering the referenceUrl and referenceLength fields to the end of the ADAMRecord schema, and
(4) rewriting the SequenceDictionary loading code to use aggregate rather than collect.

Definitely could use another pair of eyes on all those changes.

Also, (1) still suggests that there should be a test which uses those sam files, and I haven't written that test yet. I'll work on that today.

tdanford · 2013-11-30T16:37:10Z

... rebased off the latest master. I'm going to add a test for CompareAdam, and then I think this might almost be ready to go?

tdanford · 2013-11-30T17:56:13Z

Okay, my list of changes that I think still need to be made is empty. Did I miss anything?

tdanford · 2013-12-01T11:52:21Z

I fixed some of the problems I had noted, above -- but in a second commit so that those comments can still be read in their original setting. When we're ready to merge, let me know and I'll rebase this all down to one commit again first.

massie · 2013-12-02T00:20:33Z

adam-commands/src/main/scala/edu/berkeley/cs/amplab/adam/commands/CompareAdam.scala

+          read1.getStart != read2.getStart
+    }.foreach {
+      case (readName, (read1, read2)) =>
+        println("%s => %s:%d:%d (%s) and %s:%d:%d (%s)".format(readName,


Should we make printing this optional? If someone runs this against a whole genome, they'll get nearly a billion lines of output.

Two options for make it optional: (1) add a commandline option or (2), if this is useful for debugging only, extend the "logging" class and use log.info("...") or log.debug("...").

This is your feature so you know it better than I. If users need this output then please ignore this suggestion.

Right now I'm the only user -- and I should make it clear, I'm planning on expanding the way that CompareAdam works in future pull requests. So my requirements are, for the moment, the only ones that I'm trying to satisfy; and I think you're right, this logging should definitely be either taken out or moved into a proper logging call. (I originally was wondering whether CompareAdam should actually just be a test.)

Makes sense. I'm ok with leaving it as is. I just wanted to make sure that it was a debugging statement.

massie · 2013-12-02T00:42:51Z

I should add that I think we're really close to being able to commit this. Looking very good.

tdanford · 2013-12-02T02:02:18Z

adam-commands/src/main/scala/edu/berkeley/cs/amplab/adam/commands/CompareAdam.scala

+
+    val countSamePosition = named1.join(named2).filter {
+      case (readName, (read1, read2)) =>
+        read1.getReadMapped == read2.getReadMapped &&


I believe there's actually a bug in this predicate -- I'm missing a paren and getting the operator precedence/ordering wrong. Ugh.

Glad to hear that you've tracked down the issue. I'll keep an eye out for the fix. Once you have it, this work is ready to merge.

The fix should be coming within the next hour.

There's a deeper issue here, with the way I'm (not) handling paired reads. I need to think about this more, before I have a fix.

Take all the time you need. If you need to bounce ideas off anyone, feel free to reach out.

For what it's worth, there's an ADAM RDD method, "adamSingleReadBuckets", which returns an RDD of "SingleReadBucket" objects. A SingleReadBucket has all the Single/PE reads, primary and secondary alignments for a single read, e.g. a PE mapped read would have two ADAMRecords in the primaryMapped array. Basically, all the information for a single read in a single object. Btw, I'd love any feedback you have on a better name for it if you have it.

Ah, thanks for pointing that out. I had seen it, but hadn't connected it to the need I was experiencing here -- but it's exactly what I want. (In re: naming, the only thing I could think of was s/bucket/cluster/, but that's a matter of taste perhaps.)

tdanford · 2013-12-02T19:36:33Z

Matt, after talking about it for a while, Carl and I decided that there's still an un-addressed issue in the code as it currently stands. We've flattened the sequence dictionary into the reference{Name, Id, Length, Url} set of fields in ADAMRecord, but (we think) we will also need to flatten it into mateReference{Name, Id, Length, Url} as well.

Does that make sense? Otherwise, you could have a BAM with a sequence dictionary with two sequences -- and if there were a sequence to which only the mate of every paired read ever aligned, then you'd lose sight of that sequence when converting from BAM to ADAM. Does that make sense?

So I'm going to change the request to flatten the dictionary into both the reference and mateReference fields of ADAMRecord, and then reconstruct it from these pairs downstream. Let me know if that doesn't make sense, or if you think there's a better way to do it.

(1) Adding referenceLength and referenceUrl to the ADAMRecord schema. ADAMRecord now contains a (flattened) version of the sequence dictionary element corresponding to each read. An entire sequence dictionary will be able to be read out of the records associated with a single BAM or SAM file. (2) Added SequenceDictionary class Aggregates a consistent set of SequenceRecord objects into a single dictionary. SequenceRecords are derived from the reference{Id, Name, Length, Url} fields of the ADAMRecord. Most of the work here goes in to providing methods for combining and mapping between sequence dictionaries, which will be used for cross-BAM or cross-ADAM comparisons. (3) Extended Bam2Adam to create a SequenceDictionary .dict parquet file When bam2adam is run on a BAM, we extract the sequence dictionary from the BAM's header and insert its values into each ADAMRecord. Also updated the adamBamLoad method in ADAMContext to do something similar, even without the explicit parquet file. (4) Adding the ListDict command. ListDict reads a SequenceDictionary out of a set of ADAMRecords and prints out all the component SequenceRecord objects contained in it. (5) Adding the CompareAdam command CompareAdam is a command that takes two ADAM files as arguments. It assumes that the same read names are present in *both* input ADAM files -- for example, if the two files were produced by different alignment tools run on the same set of reads. It calculates the number of reads that are unique to each ADAM file, as well as the number of shared reads which have the same alignment position across the two BAMS. This position is computed irrespective of the ordering of the Sequences in the sequence dictionaries of the two ADAM files, and therefore is a useful end-to-end test of the SequenceDictionary code, as well as being the basis for more useful pairwise comparisons of ADAM files in the future. (6) Extending AdamContext with methods to automate the loading of SequenceDictionary objects out of any conformant Avro schema.

tdanford · 2013-12-02T21:00:25Z

Matt, the latest commit should fix the issue I noted, above -- it should also handle the comparisons of your mouse example bam, as well as the example SAMs I had posted before.

…quenceRecord in comments.

tdanford · 2013-12-03T00:31:26Z

Once again, I'll rebase this down to a single commit before merging.

massie · 2013-12-03T15:17:17Z

Thanks, Timothy and Carl. I've manually merged this to master now (rebasing to a single commit).

merge upstream, upgrade hammerlab deps

fnothaft reviewed Nov 27, 2013
View reviewed changes

massie reviewed Nov 29, 2013
View reviewed changes

massie reviewed Dec 2, 2013
View reviewed changes

tdanford reviewed Dec 2, 2013
View reviewed changes

Made Carl's suggested comment change; eliminated other uses of ADAMSe…

3ee0b55

…quenceRecord in comments.

massie closed this Dec 3, 2013

tdanford deleted the sequence-dictionary branch December 3, 2013 15:25

ryan-williams added a commit to ryan-williams/adam that referenced this pull request Feb 5, 2017

Merge pull request bigdatagenomics#9 from ryan-williams/sbt

a42906b

merge upstream, upgrade hammerlab deps

Adding support for a Sequence Dictionary from BAM files #9

Adding support for a Sequence Dictionary from BAM files #9

Conversation

tdanford commented Nov 27, 2013

AmplabJenkins commented Nov 27, 2013

fnothaft commented Nov 27, 2013

fnothaft commented Nov 27, 2013

tdanford commented Nov 27, 2013

fnothaft commented Nov 27, 2013

tdanford commented Nov 27, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fnothaft commented Nov 27, 2013

tdanford commented Nov 27, 2013

tdanford commented Nov 27, 2013

fnothaft commented Nov 27, 2013

tdanford commented Nov 27, 2013

fnothaft commented Nov 27, 2013

tdanford commented Nov 28, 2013

tdanford commented Nov 28, 2013

Choose a reason for hiding this comment

massie commented Nov 29, 2013

Choose a reason for hiding this comment

massie commented Nov 29, 2013

AmplabJenkins commented Nov 29, 2013

AmplabJenkins commented Nov 29, 2013

AmplabJenkins commented Nov 29, 2013

AmplabJenkins commented Nov 29, 2013

tdanford commented Nov 30, 2013

tdanford commented Nov 30, 2013

tdanford commented Nov 30, 2013

tdanford commented Dec 1, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

massie commented Dec 2, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdanford commented Dec 2, 2013

tdanford commented Dec 2, 2013

tdanford commented Dec 3, 2013

massie commented Dec 3, 2013