Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't flatten optional SAM tags into a string #240

Closed
iskandr opened this issue May 9, 2014 · 22 comments
Closed

Don't flatten optional SAM tags into a string #240

iskandr opened this issue May 9, 2014 · 22 comments

Comments

@iskandr
Copy link

iskandr commented May 9, 2014

When converting from a SAM record to an ADAM record, we flatten the already parsed collection of tags into a string. This incurs a (seemingly significant) memory/performance overhead and leaves the tag info in a less accessible form. It would be preferably to simply store the tags in a Map[String,String]. I think this is also a simpler solution to #37, since we don't need to figure out which tags deserve "top-level" inclusion ahead of time, just present them in a uniform manner.

@tdanford
Copy link
Contributor

Well, we've already moved a number of the technically "optional" fields in the SAM/BAM format into the ADAMRecord itself -- e.g. PU, LB, RG, MD -- for reasons probably having mostly to do with performance. So I think there's still a discussion to be had here.

Personally, I think the fields that should most obviously be made "top-level" are those which will be repeated, identically, many times across the ADAMRecords, such as RG, PU, and LB. These'll be the values which Parquet will do the best with, compression-wise...

@karenfeng
Copy link
Contributor

Do people still think this is important? Is the Map[String,String] approach something that we want, due to incompatibility with Shark?

@fnothaft
Copy link
Member

I would prefer the Map[String, String] approach @iskandr suggested over the current approach.

@massie
Copy link
Member

massie commented Jun 18, 2014

So we're ok moving from a flat to nested schema for the ADAMRecord? I'm ok with that if everyone else is.

@fnothaft
Copy link
Member

@massie The ADAMRecord schema is already nested via the inclusion of ADAMContig.

@calvertj
Copy link
Contributor

Want me to take this one?

@karenfeng
Copy link
Contributor

Yes, thanks.

On Saturday, January 10, 2015, calvertj notifications@github.com wrote:

Want me to take this one?


Reply to this email directly or view it on GitHub
#240 (comment)
.

@fnothaft
Copy link
Member

+1, that'd be wonderful.

@calvertj
Copy link
Contributor

Cool, working on it.

@calvertj
Copy link
Contributor

Sorry for the delay, will work on it this weekend.

@calvertj
Copy link
Contributor

I'm done shoveling, back in progress.

@calvertj
Copy link
Contributor

calvertj commented Mar 1, 2015

Hey guys, (@tdanford, @fnothaft)

From my (limited) understanding of the SAM Format specification ( http://samtools.github.io/hts-specs/SAMv1.pdf ) there can be multiple lines with same tag, does that seem correct to you? A Map[String,String] could have collisions, I am guessing you don't want to throw an exception if there are collisions?

There are plenty of options:
Map[String,String] - after grouping and concatenating like tag values
Map[String,List[String]]
List[(String, String)]
...

Also are you sure you don't want to keep the type information that was so nicely parse out?

I am looking at this bit of code in org.bdgenomics.adam.converters.SAMRecordConverter:

I assume these are the tags I'm looking for.

Thanks and sorry for the delay,

J

@tdanford
Copy link
Contributor

tdanford commented Mar 1, 2015

I wrote some version of this code back in the day, I remember thinking, "we should probably be keeping the type information around." It's required at least insofar as we will ever need to go back to BAM/SAM format.

Also, J, I'm confused by what your question is about tag-uniqueness. A TAG (e.g. "RG" or "MD" etc) can definitely be repeated across lines -- for example, every record for every alignment of a read from a single lane of an Illumina sequencer will typically have an RG (= "Read Group") tag and the same value for that tag.

On the other hand, the spec says: "Each TAG can only appear once in one alignment line." -- so I interpret that as saying that we won't ever find two "RG" tags (for instance) on the same line.

@calvertj
Copy link
Contributor

calvertj commented Mar 1, 2015

@tdanford

oops, I was thinking the SAMRecord was an iterator over the file, not a line. I guess I should not watch house of cards while looking at code.

As for what to store it as, are you sure you don't just want to keep the List[Attribute]?

case class Attribute(tag: String, tagType: TagType.Value, value: Any) {

Finally should I change the attributes in the format or create a samTags or something?
https://github.com/bigdatagenomics/bdg-formats/blob/master/src/main/resources/avro/bdg.avdl#L218

Thanks,

J

@tdanford
Copy link
Contributor

tdanford commented Mar 1, 2015

Jason, I'm confused by what you mean, when you write

As for what to store it as, are you sure you don't just want to keep the List[Attribute]?

@calvertj
Copy link
Contributor

calvertj commented Mar 2, 2015

Currently the List[Attribute] is turned into a tab delimited string:

At least I think I am looking at the right bit of code. Is this the "tags" I am looking to update?
I was wondering why not just keep the List[Attribute] instead of a Map[String,String]? Or maybe a Map[String,Attribute], that is all.

[Edit], you don't have to give me an answer to the "why not", just let me know what you wish to do.

J

@calvertj
Copy link
Contributor

calvertj commented Mar 4, 2015

Well, I am just going to use a Map[String,String] as we won't need a custom record in Avro. Path of least resistance. So is something like this ok?
("tagName -> "tagType:TagValue") ?

@tdanford
Copy link
Contributor

tdanford commented Mar 4, 2015

Jason, I think it sounds fine -- but we'll probably need to see the code :-)

Why don't you whip something up, and we'll review it in detail there?

@calvertj
Copy link
Contributor

calvertj commented Mar 4, 2015

Yea, less talk, more action.

@fnothaft
Copy link
Member

fnothaft commented Mar 4, 2015

("tagName -> "tagType:TagValue") ?

SGTM!

@tdanford
Copy link
Contributor

tdanford commented Mar 4, 2015

Yeah, that sounds like the right direction -- but again, eager to see code :-)

@fnothaft
Copy link
Member

Closing as won't fix. See #1080 for most recent discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants