-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The 'attributes' field of ADAMRecord drops the attribute type information. #92
Labels
Comments
This is related to issue #37 |
Thanks for the reminder, Matt. I'm filing this in advance of my next pull request. |
tdanford
added a commit
to broadinstitute/adam
that referenced
this issue
Feb 13, 2014
…mmand. This commit fixes issue 92 (bigdatagenomics#92). The old style of encoding the "optional fields" from the SAM/BAM was to store them as key=value pairs in the ADAMRecord.attributes string. However, this loses information about the _type_ of the tag/value, which is necessary if we want to reconstruct the original value type (for example, for re-exporting BAM files from ADAM files). This update is non-backwards-compatible, changing the format of the attributes field to tag:type:value and introducing a new Attribute class for parsing and handling these values. It also adds functions to AdamRDDFunctions to allow for filtering and subsetting of reads based on their tags, or to count the number of distinct tags or tag-values across a set of reads.
What is additionally needed to fully fix #37? |
A list of "standard" optional fields from the SAM file format spec, and an encoding of those fields in ADAMRecord's schema (some of them, such as byte array or character, will be annoying -- it's not a 100% trivial task). |
Ah; gotcha! |
tdanford
added a commit
to broadinstitute/adam
that referenced
this issue
Feb 14, 2014
…mmand. This commit fixes issue 92 (bigdatagenomics#92). The old style of encoding the "optional fields" from the SAM/BAM was to store them as key=value pairs in the ADAMRecord.attributes string. However, this loses information about the _type_ of the tag/value, which is necessary if we want to reconstruct the original value type (for example, for re-exporting BAM files from ADAM files). This update is non-backwards-compatible, changing the format of the attributes field to tag:type:value and introducing a new Attribute class for parsing and handling these values. It also adds functions to AdamRDDFunctions to allow for filtering and subsetting of reads based on their tags, or to count the number of distinct tags or tag-values across a set of reads.
Fixed by PR #99 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The 'attributes' field in the ADAMRecord is used to capture the optional fields on a SAMRecord during conversion from SAM/BAM. However, the optional fields in the SAM file format contain type information, along with a tag and a value -- but the encoding into the 'attributes' field only maintains the tag and value, and drops the type information.
This is a problem for two reasons -- first, because the tags aren't canonical or standard, so there may be "new" or unknown tags whose type information we're not already aware of by introspection, and second, because we'd like to be able to convert back to a BAM or SAM from ADAM in the future, and we'll need this type information to do the conversion.
This ticket is for adding that type information back into the attributes field in a reasonable way.
The text was updated successfully, but these errors were encountered: