Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BAM version 2 proposal #259

Closed
wants to merge 3 commits into from
Closed

Conversation

daviesrob
Copy link
Member

Two commits here. The first one is just preparatory work on the table that describes the BAM format. The bin_mq_nl and flag_nc are split into their component parts, with appropriate sizes. The BIN calculation and auxiliary tag descriptions are moved out of footnotes and into their own subsubsections. No changes are made to the actual format.

The second commit introduces the version 2 BAM format:

  • The magic number is changed to "BAM\2" to allow software to detect the new version.
  • The size of n_cigar_op is increased to uint32_t.
  • An extra field bam2_hdr_flags is added to the header, and bam2_flags is added to the alignment records. Both of these fields are reserved for future enhancements.
  • The obsolete bin field is removed from version 2 files.
  • A new BV tag is added to the SAM @HD line as a (somewhat ugly, but effective) way of hinting which BAM version should be used.

Split combined fields (bin_mq_nl and flag_nc) into their component
parts.

Move BIN calculation and Auxiliary tag encoding out of footnotes
and into their own sections.
The magic number is changed to "BAM\2" to allow software to detect
the new version.

The size of `n_cigar_op` is increased to uint32_t.

An extra field `bam2_hdr_flags` is added to the header, and
`bam2_flags` is added to the alignment records.  Both of these fields
are reserved for future enhancements.

The obsolete `bin` field is removed from version 2 files.

A new `BV` tag is added to the SAM `@HD` line as a (rather ugly)
way of hinting which BAM version should be used.
@yfarjoun
Copy link
Contributor

yfarjoun commented Oct 26, 2017 via email

@jkbonfield
Copy link
Contributor

jkbonfield commented Oct 26, 2017

I envisage we'll switch all development to v2 if and when agreed upon and v1 is just left as-is. It's highly unlikely to need development time so I don't see that as a driving force for switching. We should only feel the need to switch all output formats to v2 is there is a tangible benefit for most data sets, eg speed or size.

We have seen this already with CRAMv3.0 vs 2.1 infact. When we added 3.0 we made it the default (because it's better for all cases, rather than just needed for a few - so different to BAM2.0), but most of the code is still shared and maintaining CRAM2.1 is trivial provided we still wish to maintain CRAM3.0, almost non-existant effort.

[At some point I'd like a CRAM4 with codec improvements, or maybe 3.2 depending on how they can be achieved. This would be about 10% space saving, but I also see a need to remove some CRAM size limitations too. The alternative is a completely new infrastructure - DAM, or whatever - that is a rewrite from the ground up and nothing to do with BAM nor CRAM except learning from their mistakes, but that's a huge effort.]

@daviesrob
Copy link
Member Author

I put in the bit about recommending v1 because there seemed to be a desire to keep compatibility with as much existing software as possible. I would see it as a transitional position - once enough programs and libraries understand BAM v2, we could change the text to recommend its use.

I think we've had a very good demonstration that BAM is difficult to upgrade in a safe way. In fact, the only route to do this in the existing format is to change the magic number (as done here). So I would expect all future developments to happen in version 2 as it's designed to be upgradable.

It shouldn't be difficult to support the two versions in parallel, at least for a while. The format only changes very slightly, so it's easy to make functions that can read and write both. As BAM version 1 isn't going to change after this, I'm not expecting to have to make many changes to the code that implements it.

@magicDGS
Copy link
Member

Maybe completely unrelated (I am following the conversation in #40, so I know that this PR is high priority so I do not want to stop it): I think that BAM2 is a good opportunity to modify some problems in the SAM format (such as #124) and bump also the SAM version.

But maybe I am completely wrong here, because I am not an expert in formats...

jkbonfield added a commit to jkbonfield/io_lib that referenced this pull request Oct 27, 2017
@daviesrob
Copy link
Member Author

@magicDGS No, #124 is a SAM problem.

BAM is only concerned with how to convert the data in a SAM file into a binary format. It should be able to represent anything that SAM can, although in practice there are limitations due to the use of fixed-length data types (#40 is all about how to relax one of those limits).

What is or is not a valid contig name is solely a concern for SAM. BAM just needs to be able to support whatever SAM does.

@magicDGS
Copy link
Member

Thanks @daviesrob, now I understand the differences between SAM and BAM specifications. Nevertheless, the separation of SAM/BAM in this proposal is not that clear, because the BV tag is a SAM header spec, but it is used only in BAM. That'w why I got confused.

@vadimzalunin
Copy link
Contributor

vadimzalunin commented Oct 30, 2017 via email

@daviesrob
Copy link
Member Author

@magicDGS Yes, the BV tag does leak some BAM implementation details into SAM world. In that respect, it's a bit of an ugly hack. However, it's useful to be able to signal the need for version 2 features somehow, and BV does this in a way that survives conversion to SAM or CRAM format and back again. Basically it's a pragmatic solution to how writers decide which version to output, and less painful than some of the others (e.g. setting command-line parameters everywhere).

@vadimzalunin From a computer science perspective, possibly, but the original specification wasn't written that way. That isn't necessarily a bad thing as the result is fairly easy to understand. Having BAM and SAM in the same document does appear to cause some confusion as to their respective roles, though.

@jkbonfield
Copy link
Contributor

Closing as the bit of this we wish to include is now in PR #274

@jkbonfield jkbonfield closed this Jan 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants