Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for a Sequence Dictionary from BAM files #9

Closed

Conversation

tdanford
Copy link
Contributor

One of the things that's missing from the current ADAM format and tools is the ability to represent and manipulate the "sequence dictionaries" (from the @sq lines in the SAM / BAM header) which determine the ordering and indexing of the referenceName fields within the records of a BAM file.

We're going to need to represent and handle these dictionaries for anything that works with more than one ADAM file at a time -- two examples would be ADAM comparison tools, and multi-sample methods for analyzing read data.

This request includes an extension to the ADAMRecord schema to include the referenceLength and referenceUrl fields (from the BAM Sequence dictionary), and the definition of a class (SequenceDictionary) which handles multiple sequence records, and mapping between different dictionaries. I've extended the bam2adam command to extract the sequence dictionary from the BAM header and insert its fields into the ADAMRecords (through a new argument to SamRecordConverter), and I've included two new commands: ListDict (for listing the contents of one of these files) and CompareAdam, for counting the number of identically-aligned reads between two ADAM files even if they have different sequence dictionaries.

Places that I know of, that probably need review and editing:

  1. CompareAdam doesn't handle paired end reads, and is a little underpowered at the moment. We (GenomeBridge) are planning on building a fuller set of tools, using ADAM, to compare and validate the output of multiple pipelines on the same data, but that's not part of CompareAdam (yet). In the meantime, perhaps CompareAdam would be better as a test?
  2. We're leaving undetermined, here, whether there should be a "master" SequenceDictionary which is maintained and saved across all ADAM files, and to which each ADAM file's dictionary can be mapped or whether every analysis should also take the sequence dictionary for each component ADAM file as an argument (what we call the "local dictionaries only" approach).

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@fnothaft
Copy link
Member

Jenkins, add to whitelist and test this please.

@fnothaft
Copy link
Member

Tim, this looks great! However, we like new features to come in to the repository as a single commit. Can you squash the commits down to a single commit? There are instructions here: (http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html).

@tdanford
Copy link
Contributor Author

Sure, I can rebase it all down.

@fnothaft
Copy link
Member

Also, we try to avoid checking in data files to the repository. Is there a site where the *.bam can be downloaded for testing? If so, then we can script the tests that use this *.bam file so that they run and convert it before executing the tests.

@tdanford
Copy link
Contributor Author

No, I created these small files by hand -- there's not an easy place to grab them, since they are synthetic and aligned to only two chromosomes.

Would you like me to take them out altogether? They're not used by any automated tests at the moment, although I think they should be.

/**
* Copyright (c) 2013 Genome Bridge LLC
*/
package edu.berkeley.cs.amplab.adam.predicates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tim, it's not clear that this file is in the correct place, as it isn't a predicate. I believe it should be moved over to the projections bin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving it over...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that after I commented. I think the ADAMRecordField is also in the wrong place and have submitted a new PR to fix that.

@fnothaft
Copy link
Member

For the test files, I think we have an area on our side where we can place them so that we can "grab" them for testing. It'd be best to remove them for now; perhaps you can get them to us offline?

@tdanford
Copy link
Contributor Author

I've removed the BAM and parquet files from adam-commands/src/test/resources

@tdanford
Copy link
Contributor Author

I've moved ADAMSequenceRecordField to the projections package, as you suggested Frank.

@fnothaft
Copy link
Member

Thanks Tim!

@tdanford
Copy link
Contributor Author

I know I'll need to rebase (again) off of master, but I'll let your ADAMRecordField move request go through first, before I do that.

@fnothaft
Copy link
Member

Tim, I've just merged in the ADAMRecordField request. Can you rebase?

@tdanford
Copy link
Contributor Author

Yeah, I will. There are also some changes that Matt suggested in the phone call today that should go in as well.

@tdanford
Copy link
Contributor Author

Frank, I've rebased onto the latest master. I've also included the changes that Matt had suggested in the phone meeting yesterday. Mostly this centers around removing the extra ".dict" file that represents a sequence dictionary, and instead flattening the fields of the sequence dictionary into the ADAMRecords themselves.

val dictReads1 : RDD[ADAMRecord] = sc.adamLoad(args.input1Path, projection=Some(dictProjection))
val dictReads2 : RDD[ADAMRecord] = sc.adamLoad(args.input2Path, projection=Some(dictProjection))

val dict1 = SequenceDictionary(dictReads1.collect().map(SequenceRecord(_)) : _*)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we can keep from using collect() here?

What if we combine the projections into a single projection and load each file once (instead of twice). Then we can use the "aggregate" method on the input1 and input2 rdds to create a SequenceRecord. This will push the processing the cluster and reduce the amount of data transferred to the driver (only the SequenceDictionaries will be returned).

Once you have the two SequenceDictionary objects, the rest of the code proceeds as you have it.

There are examples of using the aggregate method in a few places in the ADAM code for reference (e.g. FlagStat). One nice property of the aggregate method is that you can reuse the "left" object as you fold in the right which saves on object creation and improves performance (the FlagStat command doesn't use the feature right now).

@massie
Copy link
Member

massie commented Nov 29, 2013

I really like how clean this patch is. Easy to read and well-formatted. Thanks for that!

On the phone we discussed converting your test BAMs to SAMs to drop into the git repo. I don't see them in the commit now, are you still planning on adding them later?

throw new IllegalArgumentException("If you're reading a BAM/SAM file, the record type must be ADAMRecord")
} else {
val projected : RDD[T] = adamParquetLoad(filePath, predicate, projection=Some(projection))
SequenceDictionary(projected.distinct().map(SequenceRecord.fromSpecificRecord(_)).collect() : _*)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment on a old commit of yours (that I think github dropped?). In any case, you can ignore it now and I'll repeat it here.

It would be best to avoid calling "collect()" if possible since the pulls all the data into the driver for processing. It's better to push the processing out the executors. I would recommend using "aggregate" here similarlu to how we use "aggregate" in the FlagStat object. The aggregate call would return a SequenceDictionary. The aggregate command allows you to reuse the object for better performance (which the current FlagStat doesn't -- we can process the flags from a whole genome in less than a minute so I haven't taken the time to rewrite it yet).

  /**
   * Aggregate the elements of each partition, and then the results for all the partitions, using
   * given combine functions and a neutral "zero value". This function can return a different result
   * type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U
   * and one operation for merging two U's, as in scala.TraversableOnce. Both of these functions are
   * allowed to modify and return their first argument instead of creating a new U to avoid memory
   * allocation.
   */

It would be nice to test this functionality on a whole genome if you have a few lying around. If not, I might be able to.

@massie
Copy link
Member

massie commented Nov 29, 2013

Jenkins, test this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM/44/

@tdanford
Copy link
Contributor Author

Matt, I think I've made the changes you're suggesting, including
(1) adding in those test SAM files (thanks for the reminder!)
(2) removing that extraneous import,
(3) re-ordering the referenceUrl and referenceLength fields to the end of the ADAMRecord schema, and
(4) rewriting the SequenceDictionary loading code to use aggregate rather than collect.

Definitely could use another pair of eyes on all those changes.

Also, (1) still suggests that there should be a test which uses those sam files, and I haven't written that test yet. I'll work on that today.

@tdanford
Copy link
Contributor Author

... rebased off the latest master. I'm going to add a test for CompareAdam, and then I think this might almost be ready to go?

@tdanford
Copy link
Contributor Author

Okay, my list of changes that I think still need to be made is empty. Did I miss anything?

@tdanford
Copy link
Contributor Author

tdanford commented Dec 1, 2013

I fixed some of the problems I had noted, above -- but in a second commit so that those comments can still be read in their original setting. When we're ready to merge, let me know and I'll rebase this all down to one commit again first.

read1.getStart != read2.getStart
}.foreach {
case (readName, (read1, read2)) =>
println("%s => %s:%d:%d (%s) and %s:%d:%d (%s)".format(readName,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make printing this optional? If someone runs this against a whole genome, they'll get nearly a billion lines of output.

Two options for make it optional: (1) add a commandline option or (2), if this is useful for debugging only, extend the "logging" class and use log.info("...") or log.debug("...").

This is your feature so you know it better than I. If users need this output then please ignore this suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now I'm the only user -- and I should make it clear, I'm planning on expanding the way that CompareAdam works in future pull requests. So my requirements are, for the moment, the only ones that I'm trying to satisfy; and I think you're right, this logging should definitely be either taken out or moved into a proper logging call. (I originally was wondering whether CompareAdam should actually just be a test.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I'm ok with leaving it as is. I just wanted to make sure that it was a debugging statement.

@massie
Copy link
Member

massie commented Dec 2, 2013

I should add that I think we're really close to being able to commit this. Looking very good.


val countSamePosition = named1.join(named2).filter {
case (readName, (read1, read2)) =>
read1.getReadMapped == read2.getReadMapped &&
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there's actually a bug in this predicate -- I'm missing a paren and getting the operator precedence/ordering wrong. Ugh.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to hear that you've tracked down the issue. I'll keep an eye out for the fix. Once you have it, this work is ready to merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix should be coming within the next hour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a deeper issue here, with the way I'm (not) handling paired reads. I need to think about this more, before I have a fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take all the time you need. If you need to bounce ideas off anyone, feel free to reach out.

For what it's worth, there's an ADAM RDD method, "adamSingleReadBuckets", which returns an RDD of "SingleReadBucket" objects. A SingleReadBucket has all the Single/PE reads, primary and secondary alignments for a single read, e.g. a PE mapped read would have two ADAMRecords in the primaryMapped array. Basically, all the information for a single read in a single object. Btw, I'd love any feedback you have on a better name for it if you have it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for pointing that out. I had seen it, but hadn't connected it to the need I was experiencing here -- but it's exactly what I want. (In re: naming, the only thing I could think of was s/bucket/cluster/, but that's a matter of taste perhaps.)

@tdanford
Copy link
Contributor Author

tdanford commented Dec 2, 2013

Matt, after talking about it for a while, Carl and I decided that there's still an un-addressed issue in the code as it currently stands. We've flattened the sequence dictionary into the reference{Name, Id, Length, Url} set of fields in ADAMRecord, but (we think) we will also need to flatten it into mateReference{Name, Id, Length, Url} as well.

Does that make sense? Otherwise, you could have a BAM with a sequence dictionary with two sequences -- and if there were a sequence to which only the mate of every paired read ever aligned, then you'd lose sight of that sequence when converting from BAM to ADAM. Does that make sense?

So I'm going to change the request to flatten the dictionary into both the reference and mateReference fields of ADAMRecord, and then reconstruct it from these pairs downstream. Let me know if that doesn't make sense, or if you think there's a better way to do it.

(1) Adding referenceLength and referenceUrl to the ADAMRecord schema.

ADAMRecord now contains a (flattened) version of the sequence dictionary element
corresponding to each read.

An entire sequence dictionary will be able to be read out of the records associated
with a single BAM or SAM file.

(2) Added SequenceDictionary class

Aggregates a consistent set of SequenceRecord objects into a single dictionary.

SequenceRecords are derived from the reference{Id, Name, Length, Url} fields of the
ADAMRecord.

Most of the work here goes in to providing methods for combining and mapping between
sequence dictionaries, which will be used for cross-BAM or cross-ADAM comparisons.

(3) Extended Bam2Adam to create a SequenceDictionary .dict parquet file

When bam2adam is run on a BAM, we extract the sequence dictionary from the BAM's header
and insert its values into each ADAMRecord.

Also updated the adamBamLoad method in ADAMContext to do something similar, even without
the explicit parquet file.

(4) Adding the ListDict command.

ListDict reads a SequenceDictionary out of a set of ADAMRecords and prints out all the
component SequenceRecord objects contained in it.

(5) Adding the CompareAdam command

CompareAdam is a command that takes two ADAM files as arguments.

It assumes that the same read names are present in *both* input ADAM files --
for example, if the two files were produced by different alignment tools run
on the same set of reads.

It calculates the number of reads that are unique to each ADAM file, as well as
the number of shared reads which have the same alignment position across the two
BAMS.

This position is computed irrespective of the ordering of the Sequences in the
sequence dictionaries of the two ADAM files, and therefore is a useful end-to-end
test of the SequenceDictionary code, as well as being the basis for more useful
pairwise comparisons of ADAM files in the future.

(6) Extending AdamContext with methods to automate the loading of SequenceDictionary
objects out of any conformant Avro schema.
@tdanford
Copy link
Contributor Author

tdanford commented Dec 2, 2013

Matt, the latest commit should fix the issue I noted, above -- it should also handle the comparisons of your mouse example bam, as well as the example SAMs I had posted before.

@tdanford
Copy link
Contributor Author

tdanford commented Dec 3, 2013

Once again, I'll rebase this down to a single commit before merging.

@massie
Copy link
Member

massie commented Dec 3, 2013

Thanks, Timothy and Carl. I've manually merged this to master now (rebasing to a single commit).

@massie massie closed this Dec 3, 2013
@tdanford tdanford deleted the sequence-dictionary branch December 3, 2013 15:25
ryan-williams added a commit to ryan-williams/adam that referenced this pull request Feb 5, 2017
merge upstream, upgrade hammerlab deps
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants