-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sam: Clarify case-sensitivity of read group platform values #679
Comments
This is an important question that, though simple, could render entire swaths of genomics data out of spec. I would encourage us to consider it carefully. I, too, have come across many datasets that use the lowercase version of the platform unit ( There are three arguments I'd use to advocate for this position.
Note that, if this lowercase values are determined not to be spec compliant, then literally tens of thousands (if not more) files will need to be reheadered to become spec-compliant. |
I think it is intended to be case sensitive, and would argue that people treating it as case insensitive are making huge assumptions given there is nothing in the specification to state that. I wouldn't argue that the spec is ambiguous. It states it must be one of these list of words and while not explicitly stating "as written in this case", assuming the spec permits lowercase is reading something which simply doesn't exist. That said, if it's that dominant in the wild already, then the ship has sailed and the horse has bolted, so I don't see any value in strictly enforcing it. It would be very useful if one of the main data archives could trawl their headers to give us some guidance here. Is it purely 1 single project that chose lowercase, or is this a widespread usage? Finally, I think we shouldn't advocate changing the spec to make this case insensitive, but rather state it should be written in uppercase but due to historic lack of compliance tools may wish to perform any checking in a case-insensitive manner. |
I'll start looking for Australian Genomics. From previous runs I know .. which we are of course also seeing downstream. That does not bode well wrt enforcing |
Ok that's plain wrong from Illumina there, and completely unhelpful to users. I'll raise an issue with them to fix this. Thanks for reporting it. |
Certainly the intention is that they be written exactly as shown. Historically samtools/htslib has not enforced anything here, as it leaves that sort of thing to end-user applications that actually use the PL value. OTOH HTSJDK and hence Picard requires these to be one of the listed keywords, case insensitively, and has always done so. [Edited to add: it turns out that this is only enforced in So the practical clarification to be made (if any is warranted) would be to say that these values should be written in uppercase, and on reading may be accepted in lowercase. Files containing that |
@jmarshall I don't think that's right about HTSJDK. It didn't sound right to me, so I just took a small SAM file and edited the PL tag to |
I mostly can't check fully myself at the moment (travelling), so that was just based on a quick check of the code and my recollections. Perhaps the code I saw is only exercised in |
Following the chain of execution, it appears that the validation is occurring in the Tracing is all the way back, the relevant lines are here: https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/samtools/SamFileValidator.java#L616-L623. You can see that EDIT: When using
|
Yes, like I said in my first comment, HTSJDK's check on these values is (unfortunately) case insensitive. The interesting question is what you see with ValidateSamFile for |
With respect to the line of logic and after reviewing the spec again, I agree that it's clear that these should be uppercase. I also agree that something needs to be done to practically address the situation, as rigidity here would mean a significant chunk of data out there is just plain out of spec for little practical gain. I'm mixed on what the solution is here. On the one hand, I think I like the idea in principle about specifying that new values written should be uppercase and reading values should allow lowercase. Practically, I think leaving the guidance at that may be hard to operationalize or may cause further problems down the road. For example, how should a library report back the value to the user if it reads a lowercase There is another potential solution which is that, while The potential downside to this approach is that data changes from underneath the user without their knowing: I don't know if I think that's a good idea or not, and I don't know if there is any precedent for this in the community. |
Without setting stringency to Strict - nothing. At Strict it emits:
|
@claymcleod: Those library API questions are considerations within implementations, but IMHO they are beyond the spec's domain of interest. @tfenne: Thanks. So HTSJDK is permissive by default but enforces (case insensitive) valid PL values in strict mode. So from an interoperability POV, the question is whether people testing their SAM files with wacky PL values also validate in strict mode to do so… |
I grabbed a (mostly random) assortment of 10,000 BAMs from the Australian Genomics' repository. Of those, 2/3 did not set
These come form ~10 different sequencing centres/cores and 20 different flagships so at least for AG this is widespread. |
This could be true, and that would be up for you all to decide. I was thinking it could be within scope if it was a recommendation made within § 2 “Recommended Practice for the SAM Format”, which contains similar implementation-specific recommendations. Beyond allowing the lowercase items to be read, I'm rooting for a consistent user experience (or, at least as much as possible) from the ecosystem; my worry is simply that leaving the guidance at "you must write uppercase, but you can read lowercase" would lead to a situation where every implementation handles this in a different way (not much different than today). |
Ouch! So dominated by out-of-spec tags. I'm guessing it's mainly specific pipelines and software, with some being correct and some not, but the incorrect ones are clearly heavily used in some places. Edit: it's also sad to see most files don't bother documenting this. Data provenance is paramount, but it's pretty woeful in our industry. :( I think that basically nails it; the ship has sailed already. Note above I didn't say "you must write uppercase" but "it should be written in uppercase". While we don't explicitly state that we use the RFC 2119 language terms, I think the common parlance is in line with requirement vs recommendation here. Whether or not a specific tool decides it wants to "fix" incoming data when doing read/write code is an implementation specific detail. Here we simply define the specification and give recommendations. It's up to tools to try and stick to the requirements and make judgement calls on the advisory things, with appropriate documentation. We can of course (and indeed should) give appropriate recommendations where necessary, and this is one of those, but I wouldn't get too bogged down in the fine details of who corrects what and where. If it's at all debateable, it'll just cause arguments and some people won't honour it anyway meaning you get more spec breakage, not less (as we've seen with TLEN in the past, even from the the same author of the spec change and tools!). In summary my proposal would be to add a footnote to valid values: The |
I really like this, FWIW. |
To be somewhat generous to people who have developed pipelines for NGS for many years, the restriction of enumerated values in the In another instance of provenance being lost (or at least hard to find), the addition of the This may be giving non-conformant files too much of a benefit of the doubt (since that was a very long time ago in sequencing technology years), but my 2 cents would also be to be less strict about this in the specification. While I can see the benefit of a controlled vocabulary here, it also leads to the side effect that a group incorporating a new sequencing platform will not only have to validate that platform but also need to incorporate and validate new versions of other software just so they can use a newer library (htsjdk, noodles, or something else) that uses the updated spec version that includes that new So FWIW, as an external observer/user, I'd favor the "should" for uppercase at the very least so that existing files don't need to be re-headered, and would also advocate for consideration that the used of a controlled vocabulary might be a "should" rather than "must". Also, as pointed out in the noodles issue discussion, the current wording is ambiguous and could bear clarification one way or the other: "This field should be omitted when the technology is not in this list (though the PM field may still be present in this case) or is unknown." If this field SHOULD be omitted when the technology is not in this list (as written), are files that incorporate other PL values truly non-conformant? If you really want to restrict to enumerated values, then it would be better to have "This field MUST be omitted when the technology is not in this list..." |
We need to be careful to separate two things here - what the spec says is correct and what the implementations do. (And personally I don't view the spec as ambiguous, but accept if others deem it so then I guess it must be!) For tool writers, it is very useful to have a controlled vocabulary. Is it OXFORD_NANOPORE? OXFORD_NANOPORE_TECHNOLOGY? ONT? ILLUMINA vs SOLEXA? PB vs PACBIO? Etc. It's a total minefield unless an ontology is used. This helps a lot for people trying to fit the incoming data to precomputed training sets (accepting that platform alone may not be perfect, it's still better than no knowledge). So as a data consumer, it's valuable. However that doesn't mean the software tools should strictly enforce these things. Maybe if the tool absolutely needs to know the platform, then sure, but sensibly it'd request the user to specify a command line parameter instead, or fall back to a default. However this repo is focusing on the specification and not specific implementations. It's not really our place to dictate how strict implementations can be, but language such as should can give useful guidance. Regarding adding new terms to the specification, #651 and #662 show it's a simple process if the vendors engage, which we strongly encourage them to do. Getting these added to tools that care (which obviously most won't) is up to the maintainers of those tools, but as you point out it's probably in their best interest to offer some degree of flexibility so they can cope with new instruments arriving on the scene. |
This Git repository contains the *.tex and SAM1.pages history from the old samtools subversion repository. The list of PL values was not present in the SAM1.pages document (from 2009), which just said “PL — Platform/technology used to produce the read”. (The final version of this identified itself as “Version 0.1.2-draft (20090702)” and defined SAMv1.tex, which replaced the Pages document with a LaTeX rewrite, was composed mostly offline by Heng Li in July 2010, as recorded in this samtools-devel thread. Most of this flurry of activity existed only on Heng's laptop of the time. The first version of this that discusses SAM headers appeared in this posting, is committed as 07dc1c6, identified itself as “v1.3 draft” (and an update later the same day defined
So this list of valid PL values has been in the spec since July 2010, for the entire lifetime of the LaTeX SAM specification. As James notes and as mentioned in #454 (comment) (and elsewhere in that thread; see also that thread for further discussion of PL upper/lower case), the reason for controlling the vocabulary here is so that tools can infer the underlying technology's error model etc without having to look for 18 different spellings/capitalisations of I personally agree that it is somewhat unfortunate if implementations contain a list of values from the spec at a particular point in time, and validate against that list. OTOH implementations doing that sort of thing is about the only reason there is any consistency in these values and hence why analysis tools can depend on that PL value being “ILLUMINA” (case insensitively, sigh) rather than “SOLEXA” or some other variant. It is interesting to compare For Re “This field MUST be omitted when […]”-style wording, note that (1) SAM is not a communications protocol and the SAM spec does not explicitly use the RFC 2119 shouty verbs; (2) the spec is a compromise with reality, in which there is a dominant implementation whose |
@jkbonfield -- I completely agree that getting a new platform added to the list of acceptable terms is an easy process. The potential concern with tying the terminology directly to the spec version (and this may be a devil's advocate position is like this: Say a lab has validated an extensive software pipeline that could potentially incorporate several libraries that work with hts data. Now they want to use a different platform in a way that shouldn't affect those early steps of the pipeline, but in order to have their bam/cram files be completely spec-compliant and work with the various libraries they've used, they now need to update htslib from version X to Y, htsjdk from V to W, noodles from T to U... If the only thing that changed in each of those libraries was the acceptable values of the PL field, then re-validation could be relatively straightforward, but if between versions V and W of htsjdk there were more extensive changes in behavior, then that process would be harder. To be fair, the re-validation should be in-depth anyway, but since the libraries themselves typically don't use this info (from my understanding), it's fair to expect them to pass it along. @jmarshall -- thanks for looking into the additional history. I hadn't realized that SAM1.pages was the next step back prior to the tex document.
Again, I completely understand the benefits of having a controlled vocabulary, but would also note that the current specification does not require that the technology, as described by the Regarding the shouty verbs-- I also understand that the SAM/BAM/CRAM formats are not RFCs and aren't explicitly using those verbs-- I was just commenting from the standpoint that those verbs do have a common usage in specifications, such that some readers may expect that the word "should" implies preferred behavior, but that files or tools not following that would not necessarily be out-of-specification. So to suggest another possibility: maybe the spec could define a strict subset (along the lines of what Picard already does in FWIW, for all of the pipelines I'm invovled in, I've checked to make sure that the files we produce going forward will use the shouty ILLUMINA, but am not sure when we might get around to reheadering old files. Or if the decision is to go with @jkbonfield 's suggestion then maybe we won't have to. |
Both SAM maintainers have proposed essentially the same thing (#679 (comment), #679 (comment)), so that is the likely direction of an eventual decision. There are still some details to be thrashed out: for example, I would prefer to say “may/should also accept lowercase when reading” rather than “accept case insensitively” so as to avoid explicitly blessing ILlumInA and the like. (Maybe the people who checked their PL archives could comment on whether they found any such abominations!)
Indeed. In general it is beyond the SAM spec's abilities to adjudicate on such matters. But there is unfinished business here in allowing such things to be specified; see e.g. #454 (comment). |
This is why the libraries probably shouldn't be strictly validating by default. It can be a useful feature to enable in order for people to do data validation, but it's not the file format implementations which are processing this data. They present it to a tool and really it's up to that tool to decide how to interpret the data given to it. In this scenario, you don't update your version of e.g. htsjdk in order to get one that accepts PL:MyNewInstrument, but in the process get a bunch of other changes. Instead in a perfect world you carry on using the same version you validated everything against before and if a validation failure triggers then you ignore that specific known issue and continue as before. It needs a tool chain to be aware of future changes, but that's reasonable. Agreed on error models and PL alone being insufficient. One very recent example of this is PacBio announce two new machines. Revio is essentially Sequel III, with very comparable error models, and SBB which is a totally different underlying technology. Theoretically the former is addressed using the platform model, while the latter would be best dealt with by adding a new platform (or revised as in John's A:B nomenclature). I think it may be useful for someone somewhere to maintain a list of known models (not necessarily hts-specs, but we should link to something if not). The ENA, SRA, etc dictionary tables used for their databases are probably the sensible starting point, assuming they even match. Just knowing what strings to look for is a start, but as far as I know this is totally undocumented and only accessable by people internal to the archives, which is far from desireable. |
ENA's list can be found in https://ftp.ebi.ac.uk/pub/databases/ena/doc/xsd/sra_1_5/SRA.common.xsd (look for |
Or after some hunting, I found this: https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#platform So not totally undocumented :) |
Also, prod to @raskoleinonen - EBI's list of accepted machine types needs to be updated to include the two new platforms SAM supports. Ideally using the same identifiers. |
So update from meeting. We're in agreement basically. We should continue to list them in uppercase and advocate for this usage, but will suggest that implementations should be case insensitive due to the existance of data using lowercase. I'll produce a PR for this. |
This is not changing what is valid / permitted, and indeed this hopefully clarifies it further. However the practicality of dealing with wide-spread non-compliant data with lowercase PL values is that tools may wish to be lenient and use case-insensitive matching. Fixes samtools#679
Thank you @jkbonfield. ENA will add support for ELEMENT (Element AVITI) and ULTIMA (UG 100). |
This is not changing what is valid / permitted, and indeed this hopefully clarifies it further. However the practicality of dealing with wide-spread non-compliant data with lowercase PL values is that tools may wish to be lenient and use case-insensitive matching. Fixes samtools#679
This is not changing what is valid / permitted, and indeed this hopefully clarifies it further. However the practicality of dealing with wide-spread non-compliant data with lowercase PL values is that tools may wish to be lenient and use case-insensitive matching. Fixes samtools#679
This is not changing what is valid / permitted, and indeed this hopefully clarifies it further. However the practicality of dealing with wide-spread non-compliant data with lowercase PL values is that tools may wish to be lenient and use case-insensitive matching. Also removes test/sam/failed/hdr.RG6.sam due to explicitly testing against the use of lower-case PL fields. While strictly not conforming, it's overly harsh if we are advocating a more spec-tolerant testing regime for PL. Fixes samtools#679
This is not changing what is valid / permitted, and indeed this hopefully clarifies it further. However the practicality of dealing with wide-spread non-compliant data with lowercase PL values is that tools may wish to be lenient and use case-insensitive matching. Also removes test/sam/failed/hdr.RG6.sam due to explicitly testing against the use of lower-case PL fields. While strictly not conforming, it's overly harsh if we are advocating a more spec-tolerant testing regime for PL. Fixes samtools#679
This is not changing what is valid / permitted, and indeed this hopefully clarifies it further. However the practicality of dealing with wide-spread non-compliant data with lowercase PL values is that tools may wish to be lenient and use case-insensitive matching. Also removes test/sam/failed/hdr.RG6.sam due to explicitly testing against the use of lower-case PL fields. While strictly not conforming, it's overly harsh if we are advocating a more spec-tolerant testing regime for PL. Fixes #679.
In summary, 94500cf now recommends platform values to be in written in uppercase. They can be parsed as either uppercase or lowercase but not mixed case, e.g., |
This is regarding Sequence Alignment/Map Format Specification (2022-08-22).
§ 1.3 "The header section" lists valid values for read group platforms:
Given the typewriter (
tt
)/monospace font used for these values, I interpret these are to be parsed as is, i.e., they are case-sensitive. However, there seems to be many files and datasets that failed to follow the spec in this way. For example, the 1000 Genomes Project uses the lowercase form ofILLUMINA
:Are read group platform values supposed to be case-insensitive, or are headers using the lowercase form not spec-compliant?
The text was updated successfully, but these errors were encountered: