-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't flatten optional SAM tags into a string #240
Comments
Well, we've already moved a number of the technically "optional" fields in the SAM/BAM format into the ADAMRecord itself -- e.g. PU, LB, RG, MD -- for reasons probably having mostly to do with performance. So I think there's still a discussion to be had here. Personally, I think the fields that should most obviously be made "top-level" are those which will be repeated, identically, many times across the ADAMRecords, such as RG, PU, and LB. These'll be the values which Parquet will do the best with, compression-wise... |
Do people still think this is important? Is the Map[String,String] approach something that we want, due to incompatibility with Shark? |
I would prefer the Map[String, String] approach @iskandr suggested over the current approach. |
So we're ok moving from a flat to nested schema for the ADAMRecord? I'm ok with that if everyone else is. |
@massie The ADAMRecord schema is already nested via the inclusion of ADAMContig. |
Want me to take this one? |
Yes, thanks. On Saturday, January 10, 2015, calvertj notifications@github.com wrote:
|
+1, that'd be wonderful. |
Cool, working on it. |
Sorry for the delay, will work on it this weekend. |
I'm done shoveling, back in progress. |
Hey guys, (@tdanford, @fnothaft) From my (limited) understanding of the SAM Format specification ( http://samtools.github.io/hts-specs/SAMv1.pdf ) there can be multiple lines with same tag, does that seem correct to you? A Map[String,String] could have collisions, I am guessing you don't want to throw an exception if there are collisions? There are plenty of options: Also are you sure you don't want to keep the type information that was so nicely parse out? I am looking at this bit of code in org.bdgenomics.adam.converters.SAMRecordConverter: adam/adam-core/src/main/scala/org/bdgenomics/adam/converters/SAMRecordConverter.scala Line 165 in cd10066
I assume these are the tags I'm looking for. Thanks and sorry for the delay, J |
I wrote some version of this code back in the day, I remember thinking, "we should probably be keeping the type information around." It's required at least insofar as we will ever need to go back to BAM/SAM format. Also, J, I'm confused by what your question is about tag-uniqueness. A TAG (e.g. "RG" or "MD" etc) can definitely be repeated across lines -- for example, every record for every alignment of a read from a single lane of an Illumina sequencer will typically have an RG (= "Read Group") tag and the same value for that tag. On the other hand, the spec says: "Each TAG can only appear once in one alignment line." -- so I interpret that as saying that we won't ever find two "RG" tags (for instance) on the same line. |
oops, I was thinking the SAMRecord was an iterator over the file, not a line. I guess I should not watch house of cards while looking at code. As for what to store it as, are you sure you don't just want to keep the List[Attribute]?
Finally should I change the attributes in the format or create a samTags or something? Thanks, J |
Jason, I'm confused by what you mean, when you write
|
Currently the List[Attribute] is turned into a tab delimited string: adam/adam-core/src/main/scala/org/bdgenomics/adam/converters/SAMRecordConverter.scala Line 165 in cd10066
At least I think I am looking at the right bit of code. Is this the "tags" I am looking to update? [Edit], you don't have to give me an answer to the "why not", just let me know what you wish to do. J |
Well, I am just going to use a Map[String,String] as we won't need a custom record in Avro. Path of least resistance. So is something like this ok? |
Jason, I think it sounds fine -- but we'll probably need to see the code :-) Why don't you whip something up, and we'll review it in detail there? |
Yea, less talk, more action. |
SGTM! |
Yeah, that sounds like the right direction -- but again, eager to see code :-) |
Closing as won't fix. See #1080 for most recent discussion. |
When converting from a SAM record to an ADAM record, we flatten the already parsed collection of tags into a string. This incurs a (seemingly significant) memory/performance overhead and leaves the tag info in a less accessible form. It would be preferably to simply store the tags in a Map[String,String]. I think this is also a simpler solution to #37, since we don't need to figure out which tags deserve "top-level" inclusion ahead of time, just present them in a uniform manner.
The text was updated successfully, but these errors were encountered: