Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New BAM files can support more than 65535 cigar opcodes #12

Closed
ghuls opened this issue May 26, 2018 · 2 comments
Closed

New BAM files can support more than 65535 cigar opcodes #12

ghuls opened this issue May 26, 2018 · 2 comments
Assignees
Milestone

Comments

@ghuls
Copy link

ghuls commented May 26, 2018

New BAM files can support more than 65535 cigar opcodes
samtools/hts-specs#40

There was discussion on the samtools-dev mailing list about this last year: http://sourceforge.net/p/samtools/mailman/message/30672431/ The main crux of the discussion there was to reuse the 16bit bin field to act as the top 16 bits of ncigar, possibly using the top flag bit as an indicator. There are some other discussions (internal) regarding this, including possibly removing bin completely given that it has no meaning with CSI indices, so this issue is largely just a note to track the issue and collate ideas/fixes. Note that the problem is definitely a real one. I have hit this with real projects, caused when a user merged consensus sequences from an assembly into a BAM file, but it is also not too far away with actual sequence reads too.

Newer technologies (PacBio, ONT, maybe more) offer substantially longer read lengths but also with higher indel rates leading to substantially more cigar opcodes. A 320Kb with with a 10% indel rate would lead to 2 changes (D/I to M) every 10 bases, giving 64k cigar ops. (Those aren't figures for real technologies, but it's not inconceivable.)

htslib spec:
samtools/hts-specs#227
samtools/hts-specs@ff8b54d

htslib patch:
samtools/htslib@aea349a

bamtools patch:

Support BAMs with >65535 CIGAR operations

Due to a design flaw, the original BAM format is unable to store an alignment
with >65535 CIGAR operations. The SAM/BAM specification maintainers have
decided to move the actual CIGAR to a CG optional tag and write a fake CIGAR
<readLen>S<refLen>N
at the original CIGAR place.

This PR recognizes the CG tag and seamlessly moves the real CIGAR back to its
right place and update the bin field accordingly. Library users need not take
any actions.

The convert and sort commands of command-line bamtools have been tested on BAMs
containing the CG tag.

pezmaster31/bamtools@cbfd3e8

The CIGAR length problem is specific for the BAM format, not for SAM format (can handle arbitrary long CIGAR strings).

If you add support for long CIGAR strings, you probably should use @hd VN:1.6 in the header too:
jmarshall/hts-specs@3f5a63e

Bump version to VN:1.6 due to >65535 CIGAR strings

The new CG tag field BAM representation for long CIGAR strings
(PR #227, merged as dab57f4)
will be unnoticed by older code. Such code will see the placeholder
CIGAR string, so it needs to be possible to signal the presence of
CG tags via the @HD-VN version number header field.

https://github.com/BioJulia/BioAlignments.jl/blob/master/src/bam/record.jl contains a typo:

fixed-length fields (see BMA specs for the details)

@TransGirlCodes
Copy link
Member

Thanks for flagging this up @ghuls, we certainly should make sure BioAlignments can support newer technologies. I will take a look at implementing this as soon as I am able - I have to deal with some other packages first, but I think this should go into the next release version of BioAlignments.

@TransGirlCodes TransGirlCodes added this to the Version 0.3 milestone May 30, 2018
@TransGirlCodes TransGirlCodes self-assigned this May 30, 2018
@TransGirlCodes
Copy link
Member

@ghuls This has been implemented by #13 and will be in the next release. Thank you for bringing this to my attention. I'll close this now. Feel free to re-open if anything hasn't been addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants