sam: Clarify case-sensitivity of read group platform values #679

zaeleus · 2022-10-23T18:24:14Z

This is regarding Sequence Alignment/Map Format Specification (2022-08-22).

§ 1.3 "The header section" lists valid values for read group platforms:

Valid values: CAPILLARY, DNBSEQ (MGI/BGI), ELEMENT, HELICOS, ILLUMINA, IONTORRENT, LS454, ONT (Oxford Nanopore), PACBIO (Pacific Biosciences), SOLID, and ULTIMA.

Given the typewriter (tt)/monospace font used for these values, I interpret these are to be parsed as is, i.e., they are case-sensitive. However, there seems to be many files and datasets that failed to follow the spec in this way. For example, the 1000 Genomes Project uses the lowercase form of ILLUMINA:

$ samtools head https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/high_coverage_alignment/HG00096.wgs.ILLUMINA.bwa.GBR.high_cov_pcr_free.20140203.bam | grep "^@RG"
@RG ID:H7AGF.1	PL:illumina	PU:H7AGFADXX131213.1	LB:Solexa-206008	SM:HG00096	CN:BI
@RG ID:H7AGF.2	PL:illumina	PU:H7AGFADXX131213.2	LB:Solexa-206008	SM:HG00096	CN:BI

Are read group platform values supposed to be case-insensitive, or are headers using the lowercase form not spec-compliant?

The text was updated successfully, but these errors were encountered:

claymcleod · 2022-10-23T23:19:13Z

This is an important question that, though simple, could render entire swaths of genomics data out of spec. I would encourage us to consider it carefully.

I, too, have come across many datasets that use the lowercase version of the platform unit (illumina in particular). There is a related discussion on this noodles issue. As I explain in my comment there, unless there is a compelling reason not to, I would recommend not enforcing the case sensitivity on these values.

There are three arguments I'd use to advocate for this position.

Lowercasing the PL is already pervasive in the community. Whereas the spec could be interpreted either way (hence this question), the default behavior of the core libraries in the ecosystem appears to be to allow this behavior. Indeed, I have never run into any issue related to this value until working with @zaeleus's noodles library (of which I am very thankful is spec-compliant and have advocated it remain that way—even at the expensive of much of our data being considered non-compliant within noodles).
Lack of consistency. The other purely descriptive data fields (of which I consider PL) within a @RG header line do not enforce case-sensitivity. I loosely define a data field as purely descriptive if it serves no other purpose than file providence (i.e., the field has no bearing on how to interpret the contents of the file based on the value assigned). Looking through the list, I considered ID, CN, DS, DT, LB, PG, PM, PU, and SM to fall into this category. Note that FO does require case-sensitivity, but I think it makes sense in that case, and I would not consider that purely descriptive.
Lack of value. That I can see, there is no compelling reason to enforce this behavior, as this is merely a descriptive field. Perhaps there is a good reason to enforce this, but I have not been able to think of one. So, from an idealistic standpoint (without the constraints of the ecosystem already allowing this behavior as I describe earlier), I still would not advocate for this behavior.

Note that, if this lowercase values are determined not to be spec compliant, then literally tens of thousands (if not more) files will need to be reheadered to become spec-compliant.

jkbonfield · 2022-10-24T08:26:40Z

I think it is intended to be case sensitive, and would argue that people treating it as case insensitive are making huge assumptions given there is nothing in the specification to state that. I wouldn't argue that the spec is ambiguous. It states it must be one of these list of words and while not explicitly stating "as written in this case", assuming the spec permits lowercase is reading something which simply doesn't exist.

That said, if it's that dominant in the wild already, then the ship has sailed and the horse has bolted, so I don't see any value in strictly enforcing it. It would be very useful if one of the main data archives could trawl their headers to give us some guidance here. Is it purely 1 single project that chose lowercase, or is this a widespread usage?

Finally, I think we shouldn't advocate changing the spec to make this case insensitive, but rather state it should be written in uppercase but due to historic lack of compliance tools may wish to perform any checking in a case-insensitive manner.

ohofmann · 2022-10-24T10:18:18Z

It would be very useful if one of the main data archives could trawl their headers to give us some guidance here. Is it purely 1 single project that chose lowercase, or is this a widespread usage?

I'll start looking for Australian Genomics. From previous runs I know bcbio uses lowercase for illumina; unless someone overrides this in the configuration the majority of SAM/BAM files generated by it will not be spec-compliant. Looking at a re-analysis of Australian Genomics data Michael Franklin (@illusional) kindly pointed out that Dragmap itself sets PL:PL0 as a default:

https://github.com/Illumina/DRAGMAP/blob/2e0ec6856bf88f869e89283b650ada53bac75d80/src/include/sam/SamGenerator.hpp#L173

.. which we are of course also seeing downstream. That does not bode well wrt enforcing PL at all, regardless of the upper/lowercase discussion.

jkbonfield · 2022-10-24T10:26:08Z

Ok that's plain wrong from Illumina there, and completely unhelpful to users. I'll raise an issue with them to fix this. Thanks for reporting it.

jmarshall · 2022-10-24T11:52:40Z

Certainly the intention is that they be written exactly as shown.

Historically samtools/htslib has not enforced anything here, as it leaves that sort of thing to end-user applications that actually use the PL value. OTOH HTSJDK and hence Picard requires these to be one of the listed keywords, case insensitively, and has always done so. [Edited to add: it turns out that this is only enforced in ValidateSamFile; other Picard tools do not validate PL when merely streaming through the headers.]

So the practical clarification to be made (if any is warranted) would be to say that these values should be written in uppercase, and on reading may be accepted in lowercase.

Files containing that PL0 value will already be rejected by HTSJDK-based tools.

tfenne · 2022-10-24T13:53:08Z

@jmarshall I don't think that's right about HTSJDK. It didn't sound right to me, so I just took a small SAM file and edited the PL tag to whatever and ran it through the lastest picard's SortSam, which ran to completion without a problem, and carried the PL value through just fine. It's possible that some GATK tools may fail that need to know the platform, but unless there's a super recent change I think HTSJDK is quite permissive.

jmarshall · 2022-10-24T14:03:20Z

I mostly can't check fully myself at the moment (travelling), so that was just based on a quick check of the code and my recollections. Perhaps the code I saw is only exercised in ValidateSamFile etc. (Search for INVALID_PLATFORM_VALUE.)

claymcleod · 2022-10-24T16:37:15Z

picard ValidateSamFile run against one of these non-compliant illumina values exits cleanly for me, even with --VALIDATION_STRINGENCY STRICT (which is the default). Note that I did have to run the command with --IGNORE INVALID_MAPPING_QUALITY for the files I was looking at from 1000 genomes, but that shouldn't affect the issue at hand.

Following the chain of execution, it appears that the validation is occurring in the SamFileValidator class, which is retreived from htsjdk: https://github.com/broadinstitute/picard/blob/master/src/main/java/picard/sam/ValidateSamFile.java#L267.

Tracing is all the way back, the relevant lines are here: https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/samtools/SamFileValidator.java#L616-L623. You can see that htsjdk actually converts the value to uppercase before doing the check, which is effectively allowing case insensitivity.

EDIT: When using PL:whatever, I indeed get

ERROR::INVALID_PLATFORM_VALUE: The platform (PL) attribute (whatever) + was not one of the valid values for read group

jmarshall · 2022-10-24T16:51:39Z

Yes, like I said in my first comment, HTSJDK's check on these values is (unfortunately) case insensitive.

The interesting question is what you see with ValidateSamFile for PL:whatever.

claymcleod · 2022-10-24T16:53:18Z

With respect to the line of logic and after reviewing the spec again, I agree that it's clear that these should be uppercase. I also agree that something needs to be done to practically address the situation, as rigidity here would mean a significant chunk of data out there is just plain out of spec for little practical gain.

I'm mixed on what the solution is here. On the one hand, I think I like the idea in principle about specifying that new values written should be uppercase and reading values should allow lowercase. Practically, I think leaving the guidance at that may be hard to operationalize or may cause further problems down the road.

For example, how should a library report back the value to the user if it reads a lowercase illumina—should it report it verbatim to the user? You can probably see where this might cause problems: if I wrote a simple program to read a SAM file and output the same SAM file, the library would allow me to read the file but not write it. I personally would find that confusing.

There is another potential solution which is that, while PL values can be read as lowercase, they should be converted to uppercase internally by the library before giving the value to a user. This matches, for instance, what htsjdk does, so no changes would be required to make that library spec compliant. Further, I think this is probably cleaner to implement from a library maintainer's perspective rather than trying to two separate valid values: one when read and one when writing. Last, this would effectively "self-heal" these files as they get passed through libraries that handle them appropriately.

The potential downside to this approach is that data changes from underneath the user without their knowing: I don't know if I think that's a good idea or not, and I don't know if there is any precedent for this in the community.

tfenne · 2022-10-24T16:54:29Z

Without setting stringency to Strict - nothing. At Strict it emits:

ERROR::INVALID_PLATFORM_VALUE:Read name A, The platform (PL) attribute (whatever) + was not one of the valid values for read group

jmarshall · 2022-10-24T17:17:39Z

@claymcleod: Those library API questions are considerations within implementations, but IMHO they are beyond the spec's domain of interest.

@tfenne: Thanks. So HTSJDK is permissive by default but enforces (case insensitive) valid PL values in strict mode. So from an interoperability POV, the question is whether people testing their SAM files with wacky PL values also validate in strict mode to do so…

ohofmann · 2022-10-25T01:21:25Z

That said, if it's that dominant in the wild already, then the ship has sailed and the horse has bolted, so I don't see any value in strictly enforcing it. It would be very useful if one of the main data archives could trawl their headers to give us some guidance here. Is it purely 1 single project that chose lowercase, or is this a widespread usage?

I grabbed a (mostly random) assortment of 10,000 BAMs from the Australian Genomics' repository. Of those, 2/3 did not set RG:PL at all; of the remaining 3,736 BAMs we have:

151 PL:ILLUMINA
3487 PL:illumina
remaining set to invalid entries, usually a combination of sequencer/lane

These come form ~10 different sequencing centres/cores and 20 different flagships so at least for AG this is widespread.

claymcleod · 2022-10-25T01:28:02Z

@claymcleod: Those library API questions are considerations within implementations, but IMHO they are beyond the spec's domain of interest.

This could be true, and that would be up for you all to decide. I was thinking it could be within scope if it was a recommendation made within § 2 “Recommended Practice for the SAM Format”, which contains similar implementation-specific recommendations.

Beyond allowing the lowercase items to be read, I'm rooting for a consistent user experience (or, at least as much as possible) from the ecosystem; my worry is simply that leaving the guidance at "you must write uppercase, but you can read lowercase" would lead to a situation where every implementation handles this in a different way (not much different than today).

jkbonfield · 2022-10-25T08:54:53Z

I grabbed a (mostly random) assortment of 10,000 BAMs from the Australian Genomics' repository. Of those, 2/3 did not set RG:PL at all; of the remaining 3,736 BAMs we have:
* 151 PL:ILLUMINA

* 3487 PL:illumina

* remaining set to invalid entries, usually a combination of sequencer/lane
These come form ~10 different sequencing centres/cores and 20 different flagships so at least for AG this is widespread.

Ouch! So dominated by out-of-spec tags. I'm guessing it's mainly specific pipelines and software, with some being correct and some not, but the incorrect ones are clearly heavily used in some places. Edit: it's also sad to see most files don't bother documenting this. Data provenance is paramount, but it's pretty woeful in our industry. :(

I think that basically nails it; the ship has sailed already.

Note above I didn't say "you must write uppercase" but "it should be written in uppercase". While we don't explicitly state that we use the RFC 2119 language terms, I think the common parlance is in line with requirement vs recommendation here.

Whether or not a specific tool decides it wants to "fix" incoming data when doing read/write code is an implementation specific detail. Here we simply define the specification and give recommendations. It's up to tools to try and stick to the requirements and make judgement calls on the advisory things, with appropriate documentation. We can of course (and indeed should) give appropriate recommendations where necessary, and this is one of those, but I wouldn't get too bogged down in the fine details of who corrects what and where. If it's at all debateable, it'll just cause arguments and some people won't honour it anyway meaning you get more spec breakage, not less (as we've seen with TLEN in the past, even from the the same author of the spec change and tools!).

In summary my proposal would be to add a footnote to valid values:

The PL field is a controlled vocabulary and should be written in uppercase. However any validation should be case insensitive due to the existance of many non-compliant files.

claymcleod · 2022-10-26T04:23:14Z

In summary my proposal would be to add a footnote to valid values:

The PL field is a controlled vocabulary and should be written in uppercase. However any validation should be case insensitive due to the existance of many non-compliant files.

I really like this, FWIW.

bpow · 2022-10-28T18:26:36Z

To be somewhat generous to people who have developed pipelines for NGS for many years, the restriction of enumerated values in the PL field wasn't always part of the specification. So there are SAM/BAM files out there that were compliant with the specification as written at the time they were written.

In another instance of provenance being lost (or at least hard to find), the addition of the PL enumerated values predates the svn -> git migration, so it's hard to find out when. It was probably around v1.2 of the spec, some time in late 2010 from what I can tell from some discussions at the time.

This may be giving non-conformant files too much of a benefit of the doubt (since that was a very long time ago in sequencing technology years), but my 2 cents would also be to be less strict about this in the specification.

While I can see the benefit of a controlled vocabulary here, it also leads to the side effect that a group incorporating a new sequencing platform will not only have to validate that platform but also need to incorporate and validate new versions of other software just so they can use a newer library (htsjdk, noodles, or something else) that uses the updated spec version that includes that new PL enumeration including this technology.

So FWIW, as an external observer/user, I'd favor the "should" for uppercase at the very least so that existing files don't need to be re-headered, and would also advocate for consideration that the used of a controlled vocabulary might be a "should" rather than "must".

Also, as pointed out in the noodles issue discussion, the current wording is ambiguous and could bear clarification one way or the other: "This field should be omitted when the technology is not in this list (though the PM field may still be present in this case) or is unknown."

If this field SHOULD be omitted when the technology is not in this list (as written), are files that incorporate other PL values truly non-conformant? If you really want to restrict to enumerated values, then it would be better to have "This field MUST be omitted when the technology is not in this list..."

jkbonfield · 2022-10-31T09:38:36Z

We need to be careful to separate two things here - what the spec says is correct and what the implementations do. (And personally I don't view the spec as ambiguous, but accept if others deem it so then I guess it must be!)

For tool writers, it is very useful to have a controlled vocabulary. Is it OXFORD_NANOPORE? OXFORD_NANOPORE_TECHNOLOGY? ONT? ILLUMINA vs SOLEXA? PB vs PACBIO? Etc. It's a total minefield unless an ontology is used. This helps a lot for people trying to fit the incoming data to precomputed training sets (accepting that platform alone may not be perfect, it's still better than no knowledge). So as a data consumer, it's valuable.

However that doesn't mean the software tools should strictly enforce these things. Maybe if the tool absolutely needs to know the platform, then sure, but sensibly it'd request the user to specify a command line parameter instead, or fall back to a default. However this repo is focusing on the specification and not specific implementations. It's not really our place to dictate how strict implementations can be, but language such as should can give useful guidance.

Regarding adding new terms to the specification, #651 and #662 show it's a simple process if the vendors engage, which we strongly encourage them to do. Getting these added to tools that care (which obviously most won't) is up to the maintainers of those tools, but as you point out it's probably in their best interest to offer some degree of flexibility so they can cope with new instruments arriving on the scene.

jmarshall · 2022-10-31T10:10:25Z

This Git repository contains the *.tex and SAM1.pages history from the old samtools subversion repository.

The list of PL values was not present in the SAM1.pages document (from 2009), which just said “PL — Platform/technology used to produce the read”. (The final version of this identified itself as “Version 0.1.2-draft (20090702)” and defined VN:1.0.)

SAMv1.tex, which replaced the Pages document with a LaTeX rewrite, was composed mostly offline by Heng Li in July 2010, as recorded in this samtools-devel thread. Most of this flurry of activity existed only on Heng's laptop of the time. The first version of this that discusses SAM headers appeared in this posting, is committed as 07dc1c6, identified itself as “v1.3 draft” (and an update later the same day defined VN:1.3), and says

PL — Platform/technology used to produce the read. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO.

So this list of valid PL values has been in the spec since July 2010, for the entire lifetime of the LaTeX SAM specification.

As James notes and as mentioned in #454 (comment) (and elsewhere in that thread; see also that thread for further discussion of PL upper/lower case), the reason for controlling the vocabulary here is so that tools can infer the underlying technology's error model etc without having to look for 18 different spellings/capitalisations of ILLUMINA/HISEQ/etc.

I personally agree that it is somewhat unfortunate if implementations contain a list of values from the spec at a particular point in time, and validate against that list. OTOH implementations doing that sort of thing is about the only reason there is any consistency in these values and hence why analysis tools can depend on that PL value being “ILLUMINA” (case insensitively, sigh) rather than “SOLEXA” or some other variant.

It is interesting to compare @SQ-PL to @HD-SO, which contains a similar list of valid values — this time listed all in lowercase and similarly without any commentary pointing that out. However I suspect you would find that essentially all SAM files that specify e.g. @HD VN:1.3 SO:coordinate do so in lowercase — because Picard validates this field and accepts only lowercase as written in the spec (in strict mode; in other modes SO:COORDINATE would be turned into unknown).

For @SQ-PL, Picard validates it only in ValidateSamFile and accepts either upper or lowercase (for reasons that are lost in the mists of time; see #454 (comment)). Hence where we are today.

Re “This field MUST be omitted when […]”-style wording, note that (1) SAM is not a communications protocol and the SAM spec does not explicitly use the RFC 2119 shouty verbs; (2) the spec is a compromise with reality, in which there is a dominant implementation whose ValidateSamFile command by default still does not accept @RG headers that do not have a PL field.

bpow · 2022-10-31T14:11:16Z

@jkbonfield -- I completely agree that getting a new platform added to the list of acceptable terms is an easy process. The potential concern with tying the terminology directly to the spec version (and this may be a devil's advocate position is like this: Say a lab has validated an extensive software pipeline that could potentially incorporate several libraries that work with hts data. Now they want to use a different platform in a way that shouldn't affect those early steps of the pipeline, but in order to have their bam/cram files be completely spec-compliant and work with the various libraries they've used, they now need to update htslib from version X to Y, htsjdk from V to W, noodles from T to U... If the only thing that changed in each of those libraries was the acceptable values of the PL field, then re-validation could be relatively straightforward, but if between versions V and W of htsjdk there were more extensive changes in behavior, then that process would be harder. To be fair, the re-validation should be in-depth anyway, but since the libraries themselves typically don't use this info (from my understanding), it's fair to expect them to pass it along.

@jmarshall -- thanks for looking into the additional history. I hadn't realized that SAM1.pages was the next step back prior to the tex document.

... the reason for controlling the vocabulary here is so that tools can infer the underlying technology's error model etc without having to look for 18 different spellings/capitalisations of ILLUMINA/HISEQ/etc.

Again, I completely understand the benefits of having a controlled vocabulary, but would also note that the current specification does not require that the technology, as described by the @SQ-PL alone, uniquely identify an error model. So it is possible that a different PM for the same PL could have a different error model, in which case tools might make invalid assumptions anyway.

Regarding the shouty verbs-- I also understand that the SAM/BAM/CRAM formats are not RFCs and aren't explicitly using those verbs-- I was just commenting from the standpoint that those verbs do have a common usage in specifications, such that some readers may expect that the word "should" implies preferred behavior, but that files or tools not following that would not necessarily be out-of-specification.

So to suggest another possibility: maybe the spec could define a strict subset (along the lines of what Picard already does in ValidateSamFile and specify that libraries should allow users to specify whether the strict or lenient subset is permitted. I'm not sure if that's worth the effort, though, or what other fields might be specified as strict or lenient subsets.

FWIW, for all of the pipelines I'm invovled in, I've checked to make sure that the files we produce going forward will use the shouty ILLUMINA, but am not sure when we might get around to reheadering old files. Or if the decision is to go with @jkbonfield 's suggestion then maybe we won't have to.

jmarshall · 2022-10-31T14:25:49Z

Both SAM maintainers have proposed essentially the same thing (#679 (comment), #679 (comment)), so that is the likely direction of an eventual decision.

There are still some details to be thrashed out: for example, I would prefer to say “may/should also accept lowercase when reading” rather than “accept case insensitively” so as to avoid explicitly blessing ILlumInA and the like. (Maybe the people who checked their PL archives could comment on whether they found any such abominations!)

would also note that the current specification does not require that the technology, as described by the @SQ-PL alone, uniquely identify an error model. So it is possible that a different PM for the same PL could have a different error model, in which case tools might make invalid assumptions anyway.

Indeed. In general it is beyond the SAM spec's abilities to adjudicate on such matters. But there is unfinished business here in allowing such things to be specified; see e.g. #454 (comment).

jkbonfield · 2022-10-31T15:12:15Z

in order to have their bam/cram files be completely spec-compliant and work with the various libraries they've used, they now need to update htslib from version X to Y, htsjdk from V to W, noodles from T to U... If the only thing that changed in each of those libraries was the acceptable values of the PL field, then re-validation could be relatively straightforward, but if between versions V and W of htsjdk there were more extensive changes in behavior, then that process would be harder.

This is why the libraries probably shouldn't be strictly validating by default. It can be a useful feature to enable in order for people to do data validation, but it's not the file format implementations which are processing this data. They present it to a tool and really it's up to that tool to decide how to interpret the data given to it. In this scenario, you don't update your version of e.g. htsjdk in order to get one that accepts PL:MyNewInstrument, but in the process get a bunch of other changes. Instead in a perfect world you carry on using the same version you validated everything against before and if a validation failure triggers then you ignore that specific known issue and continue as before. It needs a tool chain to be aware of future changes, but that's reasonable.

Agreed on error models and PL alone being insufficient.

One very recent example of this is PacBio announce two new machines. Revio is essentially Sequel III, with very comparable error models, and SBB which is a totally different underlying technology. Theoretically the former is addressed using the platform model, while the latter would be best dealt with by adding a new platform (or revised as in John's A:B nomenclature).

I think it may be useful for someone somewhere to maintain a list of known models (not necessarily hts-specs, but we should link to something if not). The ENA, SRA, etc dictionary tables used for their databases are probably the sensible starting point, assuming they even match. Just knowing what strings to look for is a start, but as far as I know this is totally undocumented and only accessable by people internal to the archives, which is far from desireable.

daviesrob · 2022-10-31T15:33:19Z

ENA's list can be found in https://ftp.ebi.ac.uk/pub/databases/ena/doc/xsd/sra_1_5/SRA.common.xsd (look for PlatformType, and scroll down for the various instrument models).

jkbonfield · 2022-10-31T16:16:15Z

Or after some hunting, I found this: https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#platform

So not totally undocumented :)

jkbonfield · 2022-10-31T16:20:17Z

Also, prod to @raskoleinonen - EBI's list of accepted machine types needs to be updated to include the two new platforms SAM supports. Ideally using the same identifiers.

jkbonfield · 2022-11-08T17:00:53Z

So update from meeting.

We're in agreement basically. We should continue to list them in uppercase and advocate for this usage, but will suggest that implementations should be case insensitive due to the existance of data using lowercase.

I'll produce a PR for this.

This is not changing what is valid / permitted, and indeed this hopefully clarifies it further. However the practicality of dealing with wide-spread non-compliant data with lowercase PL values is that tools may wish to be lenient and use case-insensitive matching. Fixes samtools#679

raskoleinonen · 2022-11-21T08:54:40Z

Thank you @jkbonfield. ENA will add support for ELEMENT (Element AVITI) and ULTIMA (UG 100).

This is not changing what is valid / permitted, and indeed this hopefully clarifies it further. However the practicality of dealing with wide-spread non-compliant data with lowercase PL values is that tools may wish to be lenient and use case-insensitive matching. Fixes samtools#679

This is not changing what is valid / permitted, and indeed this hopefully clarifies it further. However the practicality of dealing with wide-spread non-compliant data with lowercase PL values is that tools may wish to be lenient and use case-insensitive matching. Also removes test/sam/failed/hdr.RG6.sam due to explicitly testing against the use of lower-case PL fields. While strictly not conforming, it's overly harsh if we are advocating a more spec-tolerant testing regime for PL. Fixes samtools#679

This is not changing what is valid / permitted, and indeed this hopefully clarifies it further. However the practicality of dealing with wide-spread non-compliant data with lowercase PL values is that tools may wish to be lenient and use case-insensitive matching. Also removes test/sam/failed/hdr.RG6.sam due to explicitly testing against the use of lower-case PL fields. While strictly not conforming, it's overly harsh if we are advocating a more spec-tolerant testing regime for PL. Fixes #679.

zaeleus · 2024-11-05T21:46:29Z

In summary, 94500cf now recommends platform values to be in written in uppercase. They can be parsed as either uppercase or lowercase but not mixed case, e.g., ILLUMINA is always valid, illumina is possibly valid, and Illumina is either invalid or undefined.

jkbonfield mentioned this issue Oct 24, 2022

RG PL field value is not spec compliant Illumina/DRAGMAP#46

Open

jmarshall added the sam label Oct 24, 2022

claymcleod mentioned this issue Oct 29, 2022

sam/alignment_reader: Changes read_alignment_header to return String zaeleus/noodles#119

Closed

jkbonfield mentioned this issue Nov 9, 2022

SAM: add a sentence on case-insensitivity of RG PL (PR #684) #684

Merged

claymcleod mentioned this issue Mar 28, 2023

revise: remove custom header checks stjude-rust-labs/ngs#32

Closed

brainstorm mentioned this issue Apr 28, 2023

sam: Add PL value "MGIG400" as seen in the wild #718

Closed

zaeleus mentioned this issue May 29, 2023

Read group platform unnecessarily strict zaeleus/noodles#171

Closed

zaeleus mentioned this issue Jan 25, 2024

Header parsing too strict? zaeleus/noodles#111

Closed

jmarshall closed this as completed in #684 Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sam: Clarify case-sensitivity of read group platform values #679

sam: Clarify case-sensitivity of read group platform values #679

zaeleus commented Oct 23, 2022

claymcleod commented Oct 23, 2022 •

edited

Loading

jkbonfield commented Oct 24, 2022

ohofmann commented Oct 24, 2022 •

edited

Loading

jkbonfield commented Oct 24, 2022

jmarshall commented Oct 24, 2022 •

edited

Loading

tfenne commented Oct 24, 2022

jmarshall commented Oct 24, 2022 •

edited

Loading

claymcleod commented Oct 24, 2022 •

edited

Loading

jmarshall commented Oct 24, 2022

claymcleod commented Oct 24, 2022 •

edited

Loading

tfenne commented Oct 24, 2022

jmarshall commented Oct 24, 2022

ohofmann commented Oct 25, 2022 •

edited

Loading

claymcleod commented Oct 25, 2022

jkbonfield commented Oct 25, 2022 •

edited

Loading

claymcleod commented Oct 26, 2022

bpow commented Oct 28, 2022

jkbonfield commented Oct 31, 2022 •

edited

Loading

jmarshall commented Oct 31, 2022 •

edited

Loading

bpow commented Oct 31, 2022

jmarshall commented Oct 31, 2022

jkbonfield commented Oct 31, 2022

daviesrob commented Oct 31, 2022

jkbonfield commented Oct 31, 2022

jkbonfield commented Oct 31, 2022

jkbonfield commented Nov 8, 2022 •

edited

Loading

raskoleinonen commented Nov 21, 2022

zaeleus commented Nov 5, 2024

sam: Clarify case-sensitivity of read group platform values #679

sam: Clarify case-sensitivity of read group platform values #679

Comments

zaeleus commented Oct 23, 2022

claymcleod commented Oct 23, 2022 • edited Loading

jkbonfield commented Oct 24, 2022

ohofmann commented Oct 24, 2022 • edited Loading

jkbonfield commented Oct 24, 2022

jmarshall commented Oct 24, 2022 • edited Loading

tfenne commented Oct 24, 2022

jmarshall commented Oct 24, 2022 • edited Loading

claymcleod commented Oct 24, 2022 • edited Loading

jmarshall commented Oct 24, 2022

claymcleod commented Oct 24, 2022 • edited Loading

tfenne commented Oct 24, 2022

jmarshall commented Oct 24, 2022

ohofmann commented Oct 25, 2022 • edited Loading

claymcleod commented Oct 25, 2022

jkbonfield commented Oct 25, 2022 • edited Loading

claymcleod commented Oct 26, 2022

bpow commented Oct 28, 2022

jkbonfield commented Oct 31, 2022 • edited Loading

jmarshall commented Oct 31, 2022 • edited Loading

bpow commented Oct 31, 2022

jmarshall commented Oct 31, 2022

jkbonfield commented Oct 31, 2022

daviesrob commented Oct 31, 2022

jkbonfield commented Oct 31, 2022

jkbonfield commented Oct 31, 2022

jkbonfield commented Nov 8, 2022 • edited Loading

raskoleinonen commented Nov 21, 2022

zaeleus commented Nov 5, 2024

claymcleod commented Oct 23, 2022 •

edited

Loading

ohofmann commented Oct 24, 2022 •

edited

Loading

jmarshall commented Oct 24, 2022 •

edited

Loading

jmarshall commented Oct 24, 2022 •

edited

Loading

claymcleod commented Oct 24, 2022 •

edited

Loading

claymcleod commented Oct 24, 2022 •

edited

Loading

ohofmann commented Oct 25, 2022 •

edited

Loading

jkbonfield commented Oct 25, 2022 •

edited

Loading

jkbonfield commented Oct 31, 2022 •

edited

Loading

jmarshall commented Oct 31, 2022 •

edited

Loading

jkbonfield commented Nov 8, 2022 •

edited

Loading