Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The 'attributes' field of ADAMRecord drops the attribute type information. #92

Closed
tdanford opened this issue Feb 10, 2014 · 7 comments
Closed
Assignees

Comments

@tdanford
Copy link
Contributor

The 'attributes' field in the ADAMRecord is used to capture the optional fields on a SAMRecord during conversion from SAM/BAM. However, the optional fields in the SAM file format contain type information, along with a tag and a value -- but the encoding into the 'attributes' field only maintains the tag and value, and drops the type information.

This is a problem for two reasons -- first, because the tags aren't canonical or standard, so there may be "new" or unknown tags whose type information we're not already aware of by introspection, and second, because we'd like to be able to convert back to a BAM or SAM from ADAM in the future, and we'll need this type information to do the conversion.

This ticket is for adding that type information back into the attributes field in a reasonable way.

@tdanford tdanford self-assigned this Feb 10, 2014
@massie
Copy link
Member

massie commented Feb 10, 2014

This is related to issue #37

@tdanford
Copy link
Contributor Author

Thanks for the reminder, Matt. I'm filing this in advance of my next pull request.

tdanford added a commit to broadinstitute/adam that referenced this issue Feb 13, 2014
…mmand.

This commit fixes issue 92 (bigdatagenomics#92).

The old style of encoding the "optional fields" from the SAM/BAM was to store
them as key=value pairs in the ADAMRecord.attributes string. However, this
loses information about the _type_ of the tag/value, which is necessary if
we want to reconstruct the original value type (for example, for re-exporting
BAM files from ADAM files).

This update is non-backwards-compatible, changing the format of the attributes
field to tag:type:value and introducing a new Attribute class for parsing and
handling these values.  It also adds functions to AdamRDDFunctions to allow for
filtering and subsetting of reads based on their tags, or to count the number of
distinct tags or tag-values across a set of reads.
@tdanford
Copy link
Contributor Author

As discussed on the call yesterday, PR #99 is the first step towards issue #37, but does not fix it...

@fnothaft
Copy link
Member

What is additionally needed to fully fix #37?

@tdanford
Copy link
Contributor Author

A list of "standard" optional fields from the SAM file format spec, and an encoding of those fields in ADAMRecord's schema (some of them, such as byte array or character, will be annoying -- it's not a 100% trivial task).

@fnothaft
Copy link
Member

Ah; gotcha!

tdanford added a commit to broadinstitute/adam that referenced this issue Feb 14, 2014
…mmand.

This commit fixes issue 92 (bigdatagenomics#92).

The old style of encoding the "optional fields" from the SAM/BAM was to store
them as key=value pairs in the ADAMRecord.attributes string. However, this
loses information about the _type_ of the tag/value, which is necessary if
we want to reconstruct the original value type (for example, for re-exporting
BAM files from ADAM files).

This update is non-backwards-compatible, changing the format of the attributes
field to tag:type:value and introducing a new Attribute class for parsing and
handling these values.  It also adds functions to AdamRDDFunctions to allow for
filtering and subsetting of reads based on their tags, or to count the number of
distinct tags or tag-values across a set of reads.
@tdanford
Copy link
Contributor Author

Fixed by PR #99

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants