-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BAM version 2 proposal #259
Conversation
Split combined fields (bin_mq_nl and flag_nc) into their component parts. Move BIN calculation and Auxiliary tag encoding out of footnotes and into their own sections.
The magic number is changed to "BAM\2" to allow software to detect the new version. The size of `n_cigar_op` is increased to uint32_t. An extra field `bam2_hdr_flags` is added to the header, and `bam2_flags` is added to the alignment records. Both of these fields are reserved for future enhancements. The obsolete `bin` field is removed from version 2 files. A new `BV` tag is added to the SAM `@HD` line as a (rather ugly) way of hinting which BAM version should be used.
I disagree with the attitude that writers should prefer version 1 to
version 2. I think that it's the wrong path forward. v2 should be the
"better" version and thus we should be encouraging everyone to move to it.
I do not want to have to maintain two version in the long run. Once we have
v2, we should stop developing v1 and move our development efforts to
v2....creating a v2 that is only for some folks to use is the worse option
of all (In my view) if that's the case, this version should be called
something else...not v2.
Our approach to new versions should be, "This is the best way forward and
we recommend that everyone move to using v2", rather than "This is the best
we could come up with that solves a pesky little problem that a few people
have, if you are not one of these people, move along..nothing to see here"
…On Thu, Oct 26, 2017 at 12:34 AM, daviesrob ***@***.***> wrote:
Two commits here. The first one is just preparatory work on the table that
describes the BAM format. The bin_mq_nl and flag_nc are split into their
component parts, with appropriate sizes. The BIN calculation and auxiliary
tag descriptions are moved out of footnotes and into their own
subsubsections. No changes are made to the actual format.
The second commit introduces the version 2 BAM format:
- The magic number is changed to "BAM\2" to allow software to detect
the new version.
- The size of n_cigar_op is increased to uint32_t.
- An extra field bam2_hdr_flags is added to the header, and bam2_flags
is added to the alignment records. Both of these fields are reserved for
future enhancements.
- The obsolete bin field is removed from version 2 files.
- A new BV tag is added to the SAM @hd line as a (somewhat ugly, but
effective) way of hinting which BAM version should be used.
------------------------------
You can view, comment on, or merge this pull request online at:
#259
Commit Summary
- Tidy up BAM definition
- Add BAM version 2
File Changes
- *M* SAMv1.tex
<https://github.com/samtools/hts-specs/pull/259/files#diff-0> (155)
Patch Links:
- https://github.com/samtools/hts-specs/pull/259.patch
- https://github.com/samtools/hts-specs/pull/259.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#259>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACnk0mDPAJnZLm0w0YcDDKb1Worskwk5ks5sv6l7gaJpZM4QGshL>
.
|
I envisage we'll switch all development to v2 if and when agreed upon and v1 is just left as-is. It's highly unlikely to need development time so I don't see that as a driving force for switching. We should only feel the need to switch all output formats to v2 is there is a tangible benefit for most data sets, eg speed or size. We have seen this already with CRAMv3.0 vs 2.1 infact. When we added 3.0 we made it the default (because it's better for all cases, rather than just needed for a few - so different to BAM2.0), but most of the code is still shared and maintaining CRAM2.1 is trivial provided we still wish to maintain CRAM3.0, almost non-existant effort. [At some point I'd like a CRAM4 with codec improvements, or maybe 3.2 depending on how they can be achieved. This would be about 10% space saving, but I also see a need to remove some CRAM size limitations too. The alternative is a completely new infrastructure - DAM, or whatever - that is a rewrite from the ground up and nothing to do with BAM nor CRAM except learning from their mistakes, but that's a huge effort.] |
I put in the bit about recommending v1 because there seemed to be a desire to keep compatibility with as much existing software as possible. I would see it as a transitional position - once enough programs and libraries understand BAM v2, we could change the text to recommend its use. I think we've had a very good demonstration that BAM is difficult to upgrade in a safe way. In fact, the only route to do this in the existing format is to change the magic number (as done here). So I would expect all future developments to happen in version 2 as it's designed to be upgradable. It shouldn't be difficult to support the two versions in parallel, at least for a while. The format only changes very slightly, so it's easy to make functions that can read and write both. As BAM version 1 isn't going to change after this, I'm not expecting to have to make many changes to the code that implements it. |
Maybe completely unrelated (I am following the conversation in #40, so I know that this PR is high priority so I do not want to stop it): I think that BAM2 is a good opportunity to modify some problems in the SAM format (such as #124) and bump also the SAM version. But maybe I am completely wrong here, because I am not an expert in formats... |
@magicDGS No, #124 is a SAM problem. BAM is only concerned with how to convert the data in a SAM file into a binary format. It should be able to represent anything that SAM can, although in practice there are limitations due to the use of fixed-length data types (#40 is all about how to relax one of those limits). What is or is not a valid contig name is solely a concern for SAM. BAM just needs to be able to support whatever SAM does. |
Thanks @daviesrob, now I understand the differences between SAM and BAM specifications. Nevertheless, the separation of SAM/BAM in this proposal is not that clear, because the |
Imo it seems a bit odd to assume a txt serilization as the model. Shouldn't
there be a format agnostic model?
…On Mon, 30 Oct 2017, 10:55 Daniel Gómez-Sánchez, ***@***.***> wrote:
Thanks @daviesrob <https://github.com/daviesrob>, now I understand the
differences between SAM and BAM specifications. Nevertheless, the
separation of SAM/BAM in this proposal is not that clear, because the BV
tag is a SAM header spec, but it is used only in BAM. That'w why I got
confused.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#259 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAoz5IRot7ogLdp4Zi8-wAJo_yFe2lHUks5sxastgaJpZM4QGshL>
.
|
@magicDGS Yes, the @vadimzalunin From a computer science perspective, possibly, but the original specification wasn't written that way. That isn't necessarily a bad thing as the result is fairly easy to understand. Having BAM and SAM in the same document does appear to cause some confusion as to their respective roles, though. |
Closing as the bit of this we wish to include is now in PR #274 |
Two commits here. The first one is just preparatory work on the table that describes the BAM format. The bin_mq_nl and flag_nc are split into their component parts, with appropriate sizes. The BIN calculation and auxiliary tag descriptions are moved out of footnotes and into their own subsubsections. No changes are made to the actual format.
The second commit introduces the version 2 BAM format:
n_cigar_op
is increased to uint32_t.bam2_hdr_flags
is added to the header, andbam2_flags
is added to the alignment records. Both of these fields are reserved for future enhancements.bin
field is removed from version 2 files.BV
tag is added to the SAM@HD
line as a (somewhat ugly, but effective) way of hinting which BAM version should be used.