The 'attributes' field of ADAMRecord drops the attribute type information. #92

tdanford · 2014-02-10T21:27:07Z

The 'attributes' field in the ADAMRecord is used to capture the optional fields on a SAMRecord during conversion from SAM/BAM. However, the optional fields in the SAM file format contain type information, along with a tag and a value -- but the encoding into the 'attributes' field only maintains the tag and value, and drops the type information.

This is a problem for two reasons -- first, because the tags aren't canonical or standard, so there may be "new" or unknown tags whose type information we're not already aware of by introspection, and second, because we'd like to be able to convert back to a BAM or SAM from ADAM in the future, and we'll need this type information to do the conversion.

This ticket is for adding that type information back into the attributes field in a reasonable way.

massie · 2014-02-10T22:14:05Z

This is related to issue #37

tdanford · 2014-02-10T22:31:16Z

Thanks for the reminder, Matt. I'm filing this in advance of my next pull request.

…mmand. This commit fixes issue 92 (bigdatagenomics#92). The old style of encoding the "optional fields" from the SAM/BAM was to store them as key=value pairs in the ADAMRecord.attributes string. However, this loses information about the _type_ of the tag/value, which is necessary if we want to reconstruct the original value type (for example, for re-exporting BAM files from ADAM files). This update is non-backwards-compatible, changing the format of the attributes field to tag:type:value and introducing a new Attribute class for parsing and handling these values. It also adds functions to AdamRDDFunctions to allow for filtering and subsetting of reads based on their tags, or to count the number of distinct tags or tag-values across a set of reads.

tdanford · 2014-02-13T17:24:48Z

As discussed on the call yesterday, PR #99 is the first step towards issue #37, but does not fix it...

fnothaft · 2014-02-13T17:34:06Z

What is additionally needed to fully fix #37?

tdanford · 2014-02-13T17:47:35Z

A list of "standard" optional fields from the SAM file format spec, and an encoding of those fields in ADAMRecord's schema (some of them, such as byte array or character, will be annoying -- it's not a 100% trivial task).

fnothaft · 2014-02-13T17:59:48Z

Ah; gotcha!

…mmand. This commit fixes issue 92 (bigdatagenomics#92). The old style of encoding the "optional fields" from the SAM/BAM was to store them as key=value pairs in the ADAMRecord.attributes string. However, this loses information about the _type_ of the tag/value, which is necessary if we want to reconstruct the original value type (for example, for re-exporting BAM files from ADAM files). This update is non-backwards-compatible, changing the format of the attributes field to tag:type:value and introducing a new Attribute class for parsing and handling these values. It also adds functions to AdamRDDFunctions to allow for filtering and subsetting of reads based on their tags, or to count the number of distinct tags or tag-values across a set of reads.

tdanford · 2014-02-15T15:06:22Z

Fixed by PR #99

tdanford added the enhancement label Feb 10, 2014

tdanford self-assigned this Feb 10, 2014

tdanford mentioned this issue Feb 13, 2014

Encoding tag types in the ADAMRecord attributes, adding the 'tags' command #99

Merged

tdanford closed this as completed Feb 15, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The 'attributes' field of ADAMRecord drops the attribute type information. #92

The 'attributes' field of ADAMRecord drops the attribute type information. #92

tdanford commented Feb 10, 2014

massie commented Feb 10, 2014

tdanford commented Feb 10, 2014

tdanford commented Feb 13, 2014

fnothaft commented Feb 13, 2014

tdanford commented Feb 13, 2014

fnothaft commented Feb 13, 2014

tdanford commented Feb 15, 2014

The 'attributes' field of ADAMRecord drops the attribute type information. #92

The 'attributes' field of ADAMRecord drops the attribute type information. #92

Comments

tdanford commented Feb 10, 2014

massie commented Feb 10, 2014

tdanford commented Feb 10, 2014

tdanford commented Feb 13, 2014

fnothaft commented Feb 13, 2014

tdanford commented Feb 13, 2014

fnothaft commented Feb 13, 2014

tdanford commented Feb 15, 2014