Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix read groups mapping and add Y as base type #48

Merged
merged 1 commit into from
Jan 22, 2014

Conversation

arahuja
Copy link
Contributor

@arahuja arahuja commented Jan 15, 2014

The current readGroups map is broken in that individual workers develop their own recordGroupId -> ID mapping. which leads to incorrect aggregation.

Example:

Worker 1:
2014-01-14 17:29:13 INFO QualByRG:33 - QUALBRG : RG : SRR062635 --> 0
2014-01-14 17:29:13 INFO QualByRG:33 - QUALBRG : RG : SRR062634 --> 1
2014-01-14 17:29:13 INFO QualByRG:33 - QUALBRG : RG : SRR062641 --> 2

Worker 2:

2014-01-14 17:29:13 INFO QualByRG:33 - QUALBRG : RG : SRR062635 --> 0
2014-01-14 17:29:13 INFO QualByRG:33 - QUALBRG : RG : SRR062641 --> 1
2014-01-14 17:29:14 INFO QualByRG:33 - QUALBRG : RG : SRR062634 --> 2

There was a note that the .distinct().collect() operation was slow, but we saw that this finished in ~30secs (on HG00096, 16GB file, w/ 96 worker tasks) to generate a proper readGroup map

Also added Y as a base that appears in MD tags, examples: MD:Z:96Y3, MD:Z:27Y72, MD:Z:25Y74

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/31/

@@ -22,13 +22,9 @@ import org.apache.spark.rdd.RDD
// this class is required, not just standard. Baked in to recalibration.
class QualByRG(rdd: RDD[ADAMRecord]) extends Serializable {
// need to get the unique read groups todo --- this is surprisingly slow
//val readGroups = rdd.map(_.getRecordGroupId.toString).distinct().collect().sorted.zipWithIndex.toMap
var readGroups = Map[String, Int]()
val readGroups = rdd.map(_.getRecordGroupId.toString).distinct().collect().sorted.zipWithIndex.toMap
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious if you need to rerun the sort. Doesn't distinct already sort the collection?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability, can we split this to several lines? This line is a bit long.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/32/

val readGroups = rdd.map(_.getRecordGroupId.toString)
.distinct()
.collect()
.sorted.zipWithIndex.toMap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the sort necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No not particularly, at least not here. It was there before so I left it.

It would allow us to build the same map every time, so if the RecalTable was ever saved and we wanted to recreate the readgroups from it, having them in sorted order would allow us to do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking ~30sec on a small BAM file is a bit worrisome since that implies it will take >10 minutes on a whole genome.

Maybe a better solution here is to:

  1. Rename RecordGroupId to RecordGroupName
  2. Have the RecordGroupId be an Int instead of the String type
  3. Use the BAM header to map the RecordGroupName:String -> RecordGroupId:Int
  4. Write the new RecordGroupId:Int for each ADAMRecord (RLE will make this very inexpensive in terms of storage)

With these changes, there will be no need to generate the intermediate map because the RecordGroupId:Int would be available for all records.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matt, this is starting to feel (to me, and to Carl) a little like it might be isomorphic to the problem we think we've solved in the SequenceDictionary work.

Is there any chance that we should be looking at sharing some code, or at least a general approach, to that situation? Just wondering out loud here, that's a real (not rhetorical) question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdanford Yes I think it is, but much more low weight. I put together something that mirrors a piece of what you did there, though not as expansive since we already pushing all the record group information separately into the ADAM file. I'll commit it shortly, my question (and what I'm testing out) is does changing the type of recordGroupId break already created ADAM files?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another case I'd like us to keep in mind is the "fragmented BAM" case -- I'm not sure if your changes here, or the ones you're about to commit, are able to handle it or not but I thought I'd spell it out just in case.

I suspect most of us are used to dealing with "unitary" BAMs, where all the reads (and read groups) for a single sample are in a single BAM. Maybe I'm wrong on that assumption?

But one of the things Carl and I are starting to understand is that a lot of "custom" processing pipelines, which we (at least, I) would like to interact with using ADAM, are producing fragmented BAMs where different BAMs hold different read group data (and even cases where the BAMs are sharded from completely shuffled data, and there may be read groups that are scattered across multiple BAMs). For the pipeline comparison and validation work we're doing (again, work that still needs to show up here soon as a pull request), we pay attention to whether (e.g.) the Sequence Dictionaries are the same across the fragments and we use the SequenceDictionary code to unify them when they're not, as we build RDD[ADAMRecords] that are really unions of multiple underlying ADAM directories.

Just wanted to put that out here, because if we're looking at building a similar structure for read groups, we might need to think about this situation too. Just want to be sure that we're not implicitly assuming that (say) read groups with teh same name will have the same ID across different adam files, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would have to think about it more, but I believe the current setup would work (as it creates the readGroup Id mapping on the fly from given ADAM records) but the new change would not since it fixes a mapping based on whats in the header and different ADAM records could have overlapping ids from different read groups.

But my question is, right now this is being used in BQSR to optimize the way we store counts per read group. Is BQSR a well defined operation across multiple BAMs/ADAM directories?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't see why it wouldn't be. You run BQSR over a single ADAM directory (we need a different name for this, maybe "ADAM RDD" or "ADAM set"?) -- and you get particular model and a particular re-calibrations of the QSes.

Now, if you arbitrarily split the BAM N ways, and then recombined them before running BQSR, I don't see why that should affect the recalibration.

I don't have any reason to think that your solution here doesn't work, I just was wondering out loud. It seems like one of these kinds of requirements that Carl and I have been coming to grips with, and I wasn't sure if it had already been part of everyone else's thinking or if it deserved a space to be made more explicit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the type of a schema field will definitely break backward compatibility.

I gave @tdanford misinformation about schema compatibility before. To keep the schema compatible by making sure the same field is not defined with different types or incompatible repetition (meaning the field needs to be at the same nesting level).

Since ADAM is still such a new project, we should feel free to break things to make them better for the long-term. Once we mature and our used more broadly in production, we won't have the luxury.

I agree that we're starting to see a theme emerging here about how to best (de-)normalize the data. On the ADAM mailing list, I sent the following:

We could address this limitation by creating a separate sequence dictionary file during the conversion from
BAM to ADAM. I think that Timothy and Carl at GenomeBridge would have a more informed idea of how to
manage sequence dictionaries than I do. As far as I see it, we can: (1) keep records fully de-normalized, (2)
add a sequence dictionary file separate from the de-normalized records with its own schema or (3) switch to
normalized records with a separate sequence dictionary or (4) have a separate dictionary and allow users
decide whether to normalize records.
Which approach do you guys like best?

But it was buried deep inside a thread.

My preference is for (1) right now but I'm always happy to be told I'm wrong. Even if we keep everything de-normalized, we'll still be the same size as BAM. I don't think that compression is as important as correctness, performance and ease-of-use. If compression becomes an issue, it would be easy to add an ADAM "archive" format later that normalizes all data and uses maximum compression for the most compact size.

Re: performance. Having a unique "RecordGroupId" will make the Mark Duplicates code much faster (not that it's very slow now). Instead of doing a groupBy on Strings, we can do a groupBy on Ints. Any other transforms that group by record group will also be much faster (e.g. BQSR).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order words, the order of the schema isn't important. The name, type and nesting of each field is.

@arahuja
Copy link
Contributor Author

arahuja commented Jan 16, 2014

I checked in the commit just to show what I was planning, but this does not address Timothy's point around multiple fragmented ADAM sets, since each would have their own read group mapping and that would be incompatible. Right now BQSR is dependent on sequential ids for 0 to number of read groups, but we can change that as well.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/39/

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/40/

@massie
Copy link
Member

massie commented Jan 16, 2014

Assuming that @tdanford agrees with this approach, I think this is ready to merge. Really clean commit @arahuja, I love it.

One small thing, it appears that this will not cleanly merge. It is possible you can rebase this commit to latest master? If you don't feel like doing it, I'm happy to do it for you.

@@ -126,7 +126,8 @@ class SAMRecordConverter extends Serializable {
case None =>
}
recordGroup.getId
builder.setRecordGroupId(recordGroup.getReadGroupId)
builder.setRecordGroupId(readGroups(recordGroup.getReadGroupId))
.setRecordGroupName(recordGroup.getReadGroupId)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this isn't correct. Shouldn't this be

.setRecordGroupName(recordGroup.getReadGroupName)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to do a "git grep setRecordGroupName" on the ADAM repo to make sure that this is correct elsewhere too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no .getRecordGroupName, maybe I'm not understanding the available fields, but recordGroud.getRecordGroupId, currently returns things of the form: SRR062634, is there something that should be referred to as the name. Or we should leave Id as is as have a new field that is ... something named better

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake. I thought that was being pulled from an ADAMRecord (which has a recordGroupName).

This does raise a question about naming. In general, an ID tends to imply an Integer type whereas a NAME tends to imply a string. I think it's better to be consistent with the other fields in the ADAM schema (e.g. referenceName/referenceId, mateReferenceName/mateReferenceId, etc). It's just something we'll have to remember moving forward.

Sorry for the comment noise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meaning: let's keep things as they are. *Id is the Int and *Name is the String and understand that the BAM specification isn't as consistent on this.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/42/

@massie
Copy link
Member

massie commented Jan 21, 2014

This pull request looks good to me but I want to have a second pair of eyes on it. @tdanford ? @carlyeks ? This is somewhat related to your sequence dictionary work so I want to make sure it look ok to you.

@arahuja
Copy link
Contributor Author

arahuja commented Jan 21, 2014

@tdanford had some comments earlier which I can't seem to find now after the rebase, but I moved around the constructor w/ Picard dependency and pulled out the mdtag change to another pull request.

On the issue, of fragmented adam processing, this doesn't address that. Right now BQSR is the only case for readGroupId (the int version), but for example bqsr on multiple adam dir would not give the correct result. This needs to be addressed by either (1) saving more information on the mapping up front, not sure exactly what this would look like (2) probably reworking BQSR a bit

@tdanford
Copy link
Contributor

Arun, I think they were over here: #48 (comment)

Matt, Carl and I are reviewing now...

@massie
Copy link
Member

massie commented Jan 21, 2014

Having the readGroupId (the int version) is also very helpful for the Mark Duplicates code as well. Any code that needs to group by record group will benefit from your work, Arun.

@tdanford
Copy link
Contributor

Carl and I are cool with this request and think (as long as it's rebased off of master and the tests are passing) it's ready to merge.

I additionally have two requests --

  1. I think that we should focus on supporting fragmented BAMs for this kind of code sooner rather than later, and
  2. Carl and I are still working on a pull request which supports merging multiple ADAM files into a single RDD[ADAMRecord] with one, coherent SequenceDictionary. I suspect that we'll have to write similar code for RecordGroupDictionary and, when we do, I think it'll become apparent that there's a lot of logic shared between the two. We'll want to compare and probably pull out common code as a single parent class that both can inherit.

Do those points make sense? If they do, someone should make them into issues so that we can keep track of them in the future...

@arahuja
Copy link
Contributor Author

arahuja commented Jan 21, 2014

Makes sense - though (1) might be many issues - curious to know if there are other areas you have seen lacking support for fragmented BAMs?

Agree on (2) as well, happy to refactor around that once you have the changes in.

Looking at the diff I seem to have inadvertently botched MdTag though ... let me fix that.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/45/

@tdanford
Copy link
Contributor

Can you rebase off the latest master, Arun?

@arahuja
Copy link
Contributor Author

arahuja commented Jan 21, 2014

Sure, just did, unless you are seeing issues?

@massie
Copy link
Member

massie commented Jan 22, 2014

Jenkins, test this please.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/46/

massie added a commit that referenced this pull request Jan 22, 2014
Parse read groups and create mapping from bam/sam header
@massie massie merged commit 23446dd into bigdatagenomics:master Jan 22, 2014
@massie
Copy link
Member

massie commented Jan 22, 2014

Thanks, Arun!

Completely agree with you @tdanford that we'll likely have code overlap between the SequenceDictionary and RecordGroup work moving forward. Not sure, though, what exact issues we should create to track it. I'll (and I'm sure others) will definitely be keeping our eyes out for it moving forward.

@tdanford
Copy link
Contributor

@massie I guess I don't know how to turn my (1) into an issue -- you're right, we'll just have to keep an eye on it -- but I took a shot at (2). Let me know what you think.

@arahuja arahuja deleted the readgroupsmapping branch February 23, 2014 03:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants