-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RG tag gets lost in cram format #1479
Comments
Are the RG tags in your SAM headers? It's possible this is a corner case (I haven't explored it yet) that we didn't think of. In CRAM RG tags are indices to the |
They are no
The RG value appears to be appearing correctly if I print it around within the aforementioned codeblock, so I guess since it doesn't exist in the index (dynamically assigned?) it doesn't get written. It also appears the |
This does sound like a I checked the specifications and this seems to be a weird corner case. The SAM spec states in the recommendations section:
The SAMtags document however states:
I.e. if the An "un-nice" way would be to skip the RG CRAM data series and just add a verbatim "RG:Z:" string to the aux records, and it'll likely just work as intended. However it may break some things which make the assumption that RG is held in its own data series in CRAM, so I don't think this is a good solution. Rather we should just encourage better practices, which in our case means improving samtools merge. |
I can confirm a noddy SAM file with an RG line and no corresponding It's trivial to make it warn, but then it warns a lot in the case of missing headers (once per record). Keeping track of which warnings we've emitted is complex given it's multi-threaded, so many just go with spammage as a way to repeatedly persuade people to fix their headers! :-) I have a simple fix which can also store them verbatim (with additional warning spammage too). I'm still unsure if this is wise or not, but it does work mostly. |
The SAMtags spec states that RG:Z: lines should point match an RG ID if RG headers are present, but doesn't explicitly *require* them to be present. The SAM spec itself recommends that RG headers are present. Sadly this means CRAM may need to cope with this semantically inconsistent edge case. Given CRAM stores RG as an integer data series as an index into the corresponding header, in much the same way that BAM stores chromosomes as numeric "tid" values, this makes things challenging. However CRAM can also store text tags, so it's possible to round-trip with missing headers by claiming RG is -1 (unspecified) and then adding a verbatim RG:Z string tag. This is perhaps a bit of a CRAM spec loop hole so it's questionable if this is the correct solution. This works and is decodable by both htslib and htsjdk, but it'll break things like cram_transcode_rg as used by samtools cat. I think this is a pretty unlikely combination of events. Note picard's SamFormatConverter also drops these RG fields. This code also whinges, *once for each and every problematic alignment record*, when RG is absent in the SAM header. It's considerably more work to track which ones we've warned about before and to track all that meta-data across threads in a robust manner, plus this really could be considered to be a poor SAM file. Were it not for the SAM spec explicitly permitting such things (even if recommending against it) I'd reject it outright. Instead brow-beating the SAM creators into fixing the headers could be considered to be a positive outcome. Fixes samtools#1479
The SAMtags spec states that RG:Z: lines should point match an RG ID if RG headers are present, but doesn't explicitly *require* them to be present. The SAM spec itself recommends that RG headers are present. Sadly this means CRAM may need to cope with this semantically inconsistent edge case. Given CRAM stores RG as an integer data series as an index into the corresponding header, in much the same way that BAM stores chromosomes as numeric "tid" values, this makes things challenging. However CRAM can also store text tags, so it's possible to round-trip with missing headers by claiming RG is -1 (unspecified) and then adding a verbatim RG:Z string tag. This is perhaps a bit of a CRAM spec loop hole so it's questionable if this is the correct solution. This works and is decodable by both htslib and htsjdk, but it'll break things like cram_transcode_rg as used by samtools cat. I think this is a pretty unlikely combination of events. Note picard's SamFormatConverter also drops these RG fields. This code also whinges, *once for each and every problematic alignment record*, when RG is absent in the SAM header. It's considerably more work to track which ones we've warned about before and to track all that meta-data across threads in a robust manner, plus this really could be considered to be a poor SAM file. Were it not for the SAM spec explicitly permitting such things (even if recommending against it) I'd reject it outright. Instead brow-beating the SAM creators into fixing the headers could be considered to be a positive outcome. Fixes samtools#1479
It looks like the @ASLeonard can you confirm that your input files didn't have any |
Correct, there are no RG header tags in the files being merged, only per alignment RG:Z:... tags. |
How were the RG tags originally created on those files? It rather feels like "garbage in garbage out", although that's perhaps a little harsh! (Sorry) It's hard though for merge to be populating RG headers on merged output when they don't exist in the incoming headers either. I don't know if we even have a mechanism to scan through an entire file finding RG tags and populating the header. So really fixing whatever produced the initial tags without adding appropriate header entries is the best fix. Edit: caveat, it could of course be populating a list of RG tags found during merging and write out a header as a separate file at the end, with a note that the user may wish to use the reheader command before streaming into their next part of the pipeline. That's a different feature though I think. |
It was generated with the There are no RG tags in the individual files, but they appear in the merged -o bam file. They don't appear in the merged -o cram file. So samtools itself is adding RG tags to alignments but not adding a header line |
Yes, |
Oh my apologies, I forgot what -r did; it's creating the tags itself, so yes definitely it is the thing at fault here. Looks like @daviesrob is already on the case. (Thanks) |
samtools/samtools#1683 is my fix for the |
I was initially thinking this was an issue with
samtools merge -r
as I could successfully merge several bam files into one bam file, with the respectiveRG:Z:<name of cell>
appearing fine in the merged bam file, but they wouldn't appear if the final output was cram. Likewise converting withsamtools view
also lost the RG tags (and never were restored converting back, so truly lost).I believe I narrowed down the cause to here from the cram encoding
But even trying version 3.1 and version 4.0 didn't record the RG tag. I commented these lines out, and then it seemingly worked fine with my final cram file correctly having the RG tags. So unless I am missing something, this appears to be a safe/legitimate option (if not highly redundant) until RG tags are correctly stored in cram.
The text was updated successfully, but these errors were encountered: