Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Minimal support for CRAM files with missing @rg headers.
The SAMtags spec states that RG:Z: lines should point match an RG ID if RG headers are present, but doesn't explicitly *require* them to be present. The SAM spec itself recommends that RG headers are present. Sadly this means CRAM may need to cope with this semantically inconsistent edge case. Given CRAM stores RG as an integer data series as an index into the corresponding header, in much the same way that BAM stores chromosomes as numeric "tid" values, this makes things challenging. However CRAM can also store text tags, so it's possible to round-trip with missing headers by claiming RG is -1 (unspecified) and then adding a verbatim RG:Z string tag. This is perhaps a bit of a CRAM spec loop hole so it's questionable if this is the correct solution. This works and is decodable by both htslib and htsjdk, but it'll break things like cram_transcode_rg as used by samtools cat. I think this is a pretty unlikely combination of events. Note picard's SamFormatConverter also drops these RG fields. This code also whinges, *once for each and every problematic alignment record*, when RG is absent in the SAM header. It's considerably more work to track which ones we've warned about before and to track all that meta-data across threads in a robust manner, plus this really could be considered to be a poor SAM file. Were it not for the SAM spec explicitly permitting such things (even if recommending against it) I'd reject it outright. Instead brow-beating the SAM creators into fixing the headers could be considered to be a positive outcome. Fixes samtools#1479
- Loading branch information