-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix read groups mapping and add Y as base type #48
Conversation
All automated tests passed. |
@@ -22,13 +22,9 @@ import org.apache.spark.rdd.RDD | |||
// this class is required, not just standard. Baked in to recalibration. | |||
class QualByRG(rdd: RDD[ADAMRecord]) extends Serializable { | |||
// need to get the unique read groups todo --- this is surprisingly slow | |||
//val readGroups = rdd.map(_.getRecordGroupId.toString).distinct().collect().sorted.zipWithIndex.toMap | |||
var readGroups = Map[String, Int]() | |||
val readGroups = rdd.map(_.getRecordGroupId.toString).distinct().collect().sorted.zipWithIndex.toMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was curious if you need to rerun the sort. Doesn't distinct already sort the collection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability, can we split this to several lines? This line is a bit long.
All automated tests passed. |
val readGroups = rdd.map(_.getRecordGroupId.toString) | ||
.distinct() | ||
.collect() | ||
.sorted.zipWithIndex.toMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the sort necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No not particularly, at least not here. It was there before so I left it.
It would allow us to build the same map every time, so if the RecalTable was ever saved and we wanted to recreate the readgroups from it, having them in sorted order would allow us to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking ~30sec on a small BAM file is a bit worrisome since that implies it will take >10 minutes on a whole genome.
Maybe a better solution here is to:
- Rename RecordGroupId to RecordGroupName
- Have the RecordGroupId be an Int instead of the String type
- Use the BAM header to map the RecordGroupName:String -> RecordGroupId:Int
- Write the new RecordGroupId:Int for each ADAMRecord (RLE will make this very inexpensive in terms of storage)
With these changes, there will be no need to generate the intermediate map because the RecordGroupId:Int would be available for all records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matt, this is starting to feel (to me, and to Carl) a little like it might be isomorphic to the problem we think we've solved in the SequenceDictionary work.
Is there any chance that we should be looking at sharing some code, or at least a general approach, to that situation? Just wondering out loud here, that's a real (not rhetorical) question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tdanford Yes I think it is, but much more low weight. I put together something that mirrors a piece of what you did there, though not as expansive since we already pushing all the record group information separately into the ADAM file. I'll commit it shortly, my question (and what I'm testing out) is does changing the type of recordGroupId break already created ADAM files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another case I'd like us to keep in mind is the "fragmented BAM" case -- I'm not sure if your changes here, or the ones you're about to commit, are able to handle it or not but I thought I'd spell it out just in case.
I suspect most of us are used to dealing with "unitary" BAMs, where all the reads (and read groups) for a single sample are in a single BAM. Maybe I'm wrong on that assumption?
But one of the things Carl and I are starting to understand is that a lot of "custom" processing pipelines, which we (at least, I) would like to interact with using ADAM, are producing fragmented BAMs where different BAMs hold different read group data (and even cases where the BAMs are sharded from completely shuffled data, and there may be read groups that are scattered across multiple BAMs). For the pipeline comparison and validation work we're doing (again, work that still needs to show up here soon as a pull request), we pay attention to whether (e.g.) the Sequence Dictionaries are the same across the fragments and we use the SequenceDictionary code to unify them when they're not, as we build RDD[ADAMRecords] that are really unions of multiple underlying ADAM directories.
Just wanted to put that out here, because if we're looking at building a similar structure for read groups, we might need to think about this situation too. Just want to be sure that we're not implicitly assuming that (say) read groups with teh same name will have the same ID across different adam files, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would have to think about it more, but I believe the current setup would work (as it creates the readGroup Id mapping on the fly from given ADAM records) but the new change would not since it fixes a mapping based on whats in the header and different ADAM records could have overlapping ids from different read groups.
But my question is, right now this is being used in BQSR to optimize the way we store counts per read group. Is BQSR a well defined operation across multiple BAMs/ADAM directories?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't see why it wouldn't be. You run BQSR over a single ADAM directory (we need a different name for this, maybe "ADAM RDD" or "ADAM set"?) -- and you get particular model and a particular re-calibrations of the QSes.
Now, if you arbitrarily split the BAM N ways, and then recombined them before running BQSR, I don't see why that should affect the recalibration.
I don't have any reason to think that your solution here doesn't work, I just was wondering out loud. It seems like one of these kinds of requirements that Carl and I have been coming to grips with, and I wasn't sure if it had already been part of everyone else's thinking or if it deserved a space to be made more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the type of a schema field will definitely break backward compatibility.
I gave @tdanford misinformation about schema compatibility before. To keep the schema compatible by making sure the same field is not defined with different types or incompatible repetition (meaning the field needs to be at the same nesting level).
Since ADAM is still such a new project, we should feel free to break things to make them better for the long-term. Once we mature and our used more broadly in production, we won't have the luxury.
I agree that we're starting to see a theme emerging here about how to best (de-)normalize the data. On the ADAM mailing list, I sent the following:
We could address this limitation by creating a separate sequence dictionary file during the conversion from
BAM to ADAM. I think that Timothy and Carl at GenomeBridge would have a more informed idea of how to
manage sequence dictionaries than I do. As far as I see it, we can: (1) keep records fully de-normalized, (2)
add a sequence dictionary file separate from the de-normalized records with its own schema or (3) switch to
normalized records with a separate sequence dictionary or (4) have a separate dictionary and allow users
decide whether to normalize records.
Which approach do you guys like best?
But it was buried deep inside a thread.
My preference is for (1) right now but I'm always happy to be told I'm wrong. Even if we keep everything de-normalized, we'll still be the same size as BAM. I don't think that compression is as important as correctness, performance and ease-of-use. If compression becomes an issue, it would be easy to add an ADAM "archive" format later that normalizes all data and uses maximum compression for the most compact size.
Re: performance. Having a unique "RecordGroupId" will make the Mark Duplicates code much faster (not that it's very slow now). Instead of doing a groupBy on Strings, we can do a groupBy on Ints. Any other transforms that group by record group will also be much faster (e.g. BQSR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order words, the order of the schema isn't important. The name, type and nesting of each field is.
I checked in the commit just to show what I was planning, but this does not address Timothy's point around multiple fragmented ADAM sets, since each would have their own read group mapping and that would be incompatible. Right now BQSR is dependent on sequential ids for 0 to number of read groups, but we can change that as well. |
All automated tests passed. |
All automated tests passed. |
Assuming that @tdanford agrees with this approach, I think this is ready to merge. Really clean commit @arahuja, I love it. One small thing, it appears that this will not cleanly merge. It is possible you can rebase this commit to latest master? If you don't feel like doing it, I'm happy to do it for you. |
@@ -126,7 +126,8 @@ class SAMRecordConverter extends Serializable { | |||
case None => | |||
} | |||
recordGroup.getId | |||
builder.setRecordGroupId(recordGroup.getReadGroupId) | |||
builder.setRecordGroupId(readGroups(recordGroup.getReadGroupId)) | |||
.setRecordGroupName(recordGroup.getReadGroupId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this isn't correct. Shouldn't this be
.setRecordGroupName(recordGroup.getReadGroupName)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to do a "git grep setRecordGroupName" on the ADAM repo to make sure that this is correct elsewhere too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no .getRecordGroupName, maybe I'm not understanding the available fields, but recordGroud.getRecordGroupId, currently returns things of the form: SRR062634, is there something that should be referred to as the name. Or we should leave Id as is as have a new field that is ... something named better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake. I thought that was being pulled from an ADAMRecord (which has a recordGroupName).
This does raise a question about naming. In general, an ID tends to imply an Integer type whereas a NAME tends to imply a string. I think it's better to be consistent with the other fields in the ADAM schema (e.g. referenceName/referenceId, mateReferenceName/mateReferenceId, etc). It's just something we'll have to remember moving forward.
Sorry for the comment noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meaning: let's keep things as they are. *Id is the Int and *Name is the String and understand that the BAM specification isn't as consistent on this.
All automated tests passed. |
@tdanford had some comments earlier which I can't seem to find now after the rebase, but I moved around the constructor w/ Picard dependency and pulled out the mdtag change to another pull request. On the issue, of fragmented adam processing, this doesn't address that. Right now BQSR is the only case for readGroupId (the int version), but for example bqsr on multiple adam dir would not give the correct result. This needs to be addressed by either (1) saving more information on the mapping up front, not sure exactly what this would look like (2) probably reworking BQSR a bit |
Arun, I think they were over here: #48 (comment) Matt, Carl and I are reviewing now... |
Having the readGroupId (the int version) is also very helpful for the Mark Duplicates code as well. Any code that needs to group by record group will benefit from your work, Arun. |
Carl and I are cool with this request and think (as long as it's rebased off of master and the tests are passing) it's ready to merge. I additionally have two requests --
Do those points make sense? If they do, someone should make them into issues so that we can keep track of them in the future... |
Makes sense - though (1) might be many issues - curious to know if there are other areas you have seen lacking support for fragmented BAMs? Agree on (2) as well, happy to refactor around that once you have the changes in. Looking at the diff I seem to have inadvertently botched MdTag though ... let me fix that. |
All automated tests passed. |
Can you rebase off the latest master, Arun? |
Sure, just did, unless you are seeing issues? |
Jenkins, test this please. |
All automated tests passed. |
Parse read groups and create mapping from bam/sam header
Thanks, Arun! Completely agree with you @tdanford that we'll likely have code overlap between the SequenceDictionary and RecordGroup work moving forward. Not sure, though, what exact issues we should create to track it. I'll (and I'm sure others) will definitely be keeping our eyes out for it moving forward. |
@massie I guess I don't know how to turn my (1) into an issue -- you're right, we'll just have to keep an eye on it -- but I took a shot at (2). Let me know what you think. |
The current readGroups map is broken in that individual workers develop their own recordGroupId -> ID mapping. which leads to incorrect aggregation.
Example:
Worker 1:
2014-01-14 17:29:13 INFO QualByRG:33 - QUALBRG : RG : SRR062635 --> 0
2014-01-14 17:29:13 INFO QualByRG:33 - QUALBRG : RG : SRR062634 --> 1
2014-01-14 17:29:13 INFO QualByRG:33 - QUALBRG : RG : SRR062641 --> 2
Worker 2:
2014-01-14 17:29:13 INFO QualByRG:33 - QUALBRG : RG : SRR062635 --> 0
2014-01-14 17:29:13 INFO QualByRG:33 - QUALBRG : RG : SRR062641 --> 1
2014-01-14 17:29:14 INFO QualByRG:33 - QUALBRG : RG : SRR062634 --> 2
There was a note that the .distinct().collect() operation was slow, but we saw that this finished in ~30secs (on HG00096, 16GB file, w/ 96 worker tasks) to generate a proper readGroup map
Also added Y as a base that appears in MD tags, examples: MD:Z:96Y3, MD:Z:27Y72, MD:Z:25Y74