-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sam: Consider more conventional format versioning #725
Comments
The format itself is versioned in a conventional manner: generally when syntax is introduced that previous implementations would fail to parse, the version number is bumped to indicate that. This is indicated by the bold items in Appendix B. For example, when The SAM format is designed to be an extensible format. Thus it defines syntax for various kinds of optional SAM record fields, and invites tools and research groups to define meanings for particularly named tagged fields. Similarly it defines (simpler untyped) syntax for SAM header fields, and invites users to define meanings for particularly named fields. There are also particular tagged fields and header fields that are predefined by the SAM format authors and maintainers because they (and the shared vocabulary they define) are of general use; for convenience, these are listed in the same specification documents. There is no issue with a particular tagged or header field defined later appearing in a SAM/BAM/etc file that declares an earlier Similar comments apply to other semantic information such as the list of (valid) well-known values for such fields, as mentioned in #718 (comment). This means that the specification document contains a mixture of categories of information: there is the format proper that determines syntax and what implementations need to parse, and semantic definitions of particular tags perhaps accompanied by lists of valid values and their definitions. (The definitions of optional record fields used to be a section within SAMv1.pdf too BTW. SAMtags.pdf was moved to a separate document for two main reasons: so that it could be more reasonably referred to from the CRAM specification; to somewhat decouple it and its maintenance from the SAM document and its versioning. These header field definitions could be moved to SAMtags.pdf too for similar reasons, but IMHO that would be overkill — I have a backburner PR to reformat the multipage header table as not-a-table, which would mean we had more space to expand on what “valid values” really means and clarify this.)
What I think you are suggesting implies that absolutely everything defined in the specification documents would no longer be considered extensible, in the sense that fields and valid values defined in a particular edition of the specification would not be allowed in SAM files that predated that edition (or that declared an The question is who this would be an improvement for. There are at least three categories of people with an interest in these specifications:
It is true that tools that validate things carefully may wish to say things like “I support PL:ULTIMA because I support edition XYZ of the SAM specification”. But SAM philosophy has always been that such statements are not particularly useful for the users working with SAM files, and that where it means that otherwise compatible new fields can't be used in old files it's actively counterproductive for users. TL;DR There is a difference between versioning of the format and versioning of the specification document. This has been discussed ad nauseam before, for example on #194 and #454. In practice, specification documents are not as immutable as you might think. Typo fixes and formatting improvements do not change the format described by a specification. Clarification of unclear text should not and generally does not change the format described by a specification. Thus not all changes in a specification document should be reflected in the version number declared by a file in that format. (These are examples of when specification documents are not usefully considered immutable.) Hence the SAM specification separates the SAM version number (like the current 1.6) from the specification's own versioning. The specification document itself is lightly versioned: its title page mentions the Git commit hash it was built from and the date of that commit. This light versioning has IMHO proved sufficient for readers' needs. That is the status quo. If the status quo were to be revisited, those advocating for a change ought to familiarise themselves with the previous discussions and consider the following:
|
Yes, but this is not extensibility but backwards compatibility. Both examples were already accepted before they were formally defined.
Validations are quite useful when expecting conformance. If a new field is incompatible with an old format, then that compatibility wasn't there in the first place. I suspect this knowledge actually improves productivity for users.
I think these issues are unrelated to this discussion? My proposal is about format versioning, not the standard operating procedure of document releases.
A format is as extensible as it is allowed to be. For example,
No, I consider this all part of the grammar. You're arguing that the lexical analysis overrules all of it. Your first example is a counterexample to your argument:
My proposal still stands from the opening message: "I propose 1.6 be finalized and future changes made to a draft of the next version of SAM." Ambiguity and undefined behavior in a specification is detrimental to interoperability. Barring naive usages and implementations, a lack of guidance makes this bad for both users and implementors. Changing the behavior of the format within the same version is not conventional and confusing for all parties (except the spec maintainers?). |
Well, that is a view that is not held by others. It is IMHO an implementor-centric view, and not one that is especially useful for users of SAM or for most implementors of bioinformatics tools. If it is not clear that things like the lists of valid well-known values for Certainly there is a tension here for implementations that want to validate PL values: it is useful that there exist implementations that produce diagnostics for unknown values, because that is how we actually get consistency in the values written by users; but there is a deployment issue for such implementations when new platforms arrive and are added to the spec. The spec just lists the canonical values (imperfectly) and leaves decisions about whether and how strictly to validate this field value up to the implementations. Users would not be helped by having to set
With our spec maintainers hats on, we need to think about what SAM as a Platonic ideal is and means, and what the spec text means to say and ought to say, not just what the imperfect text as it currently stands actually says. Here In the light of those considerations, it is obvious that TAG:B:VALUE could not already have been accepted. Hence the version bump. Footnotes
|
The spec maintainers would claim that they have not substantially changed the behaviour of the format within the same version. Your views on PL are clear. You do have a point and the description of this field should be expanded and improved in some approximation of the ways talked about. Are there other specific problems you have in mind? (Or is this a storm in a PL-sized teacup…?) |
SAM and BAM are two different specs, with different version numbers, but there is some rather poorly defined tie between them. BAM has its own binary version number, but that is more for the higher encapsulation and encoding of fixed fields. There is also the SAM version number within the text header within BAM. For B, arguably both BAM and SAM could have changed version, but given BAM encapsulation elsewhere doesn't change and the SAM header version can be specified within BAM, I guess we decided to stick with SAM only. An example with B: In SAM, it's just a text string with an array of ASCII numbers. Basically it's just a new type as you say. The syntax doesn't change, and we can still skip from SAM tag to SAM tag trivially by looking for the tabs and recognising TAG:TYPE:VALUE. Even if we get a TYPE we don't know, our parser could be written to be liberal and just skip over the tag, or store it in pure text and regurgitate it back out as-is without any knowledge of the new type code. In BAM this is fundamentally impossible. BAM tags are all squished together without a separator like TAB. We must know what the new TYPE means and how to parse it, otherwise our entire file is unparsable. To that degree, the B tag is a hard stop. It's not just a case of interpretation of data and recognising new key-value names, but of the fundamental data layout. I agree it's nuanced, and maybe not desireable, to have both BAM and SAM intertwined in this way, but it's where we are and we cannot change history. So do remember this when thinking in pure abstract terms of syntax and semantics - we're not just in SAM world. |
I'll also add my views here, which are pretty much a copy of what John has already summarised. You have to be pragmatic and think about the software life cycle. Let us imagine a case where we add OMNICORP as a new platform to the PL field. What should our existing infrastructure do? Should it simply fail? Should it warn? Or should it swallow the data unless we're actively doing something that specifically needs to do platform-specific processing and it's now met one it isn't aware of? Personally I'd say the users will likely accept the latter, and reject the former as being user-hostile. I know from looking at download logs that we get a VAST number of downloads of ancient legacy copies of software. I wish people didn't, but they do. Would we really be serving the community better if our software just flat out broke when given a file with a syntacically valid data stream but having a new field or new value? Obviously not. So what is the point therefore of updating the version number to reflect this field has been added? I cannot see one. We separated out the aux tags into their own document primarily because we wanted to use a shared specification that could be referenced from both SAM/BAM and CRAM. However it served a rather useful purpose in largely divorcing the changes that happen there from the changes in the SAM/BAM and CRAM version numbers. When we added base modification tags (MM/ML/MN) we didn't need to change version. The only tag change that altered version was when the actual encapsulation had to change to accommodate new encoding byte streams - B in BAM basically. Arguably we could be clearer with the SAM header, and this is also shared between formats. However I largely view it in a similar vein as the aux tags. The key:value there (not key:type:value) should be considered as high level syntax within the specification, with the specific keys and values not being deemed to be something that changes the specification versions. If however we do something that we know would seriously break implementations, such as a new Obviously you could, and maybe should, question how adding |
This is in regard to Sequence Alignment/Map Format Specification (2022-08-02) § "Appendix B: SAM Version History".
Specification changes are currently made to the latest version. This is both confusing and unconventional, as versioned formats should be considered immutable.
I propose 1.6 be finalized and future changes made to a draft of the next version of SAM.
The text was updated successfully, but these errors were encountered: