Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RG tag gets lost in cram format #1479

Closed
ASLeonard opened this issue Jul 21, 2022 · 11 comments · Fixed by #1480
Closed

RG tag gets lost in cram format #1479

ASLeonard opened this issue Jul 21, 2022 · 11 comments · Fixed by #1480

Comments

@ASLeonard
Copy link

I was initially thinking this was an issue with samtools merge -r as I could successfully merge several bam files into one bam file, with the respective RG:Z:<name of cell> appearing fine in the merged bam file, but they wouldn't appear if the final output was cram. Likewise converting with samtools view also lost the RG tags (and never were restored converting back, so truly lost).

I believe I narrowed down the cause to here from the cram encoding

       // RG:Z
        if (aux[0] == 'R' && aux[1] == 'G' && aux[2] == 'Z') {
            rg = &aux[3];
            while (*aux++);
            if (CRAM_MAJOR_VERS(fd->version) >= 4)
                BLOCK_APPEND(td_b, "RG*", 3);
            continue;
        }

But even trying version 3.1 and version 4.0 didn't record the RG tag. I commented these lines out, and then it seemingly worked fine with my final cram file correctly having the RG tags. So unless I am missing something, this appears to be a safe/legitimate option (if not highly redundant) until RG tags are correctly stored in cram.

  • Alex
@jkbonfield
Copy link
Contributor

Are the RG tags in your SAM headers?

It's possible this is a corner case (I haven't explored it yet) that we didn't think of. In CRAM RG tags are indices to the @RG headers, in the same way that BAM uses indices to the @SQ for chromosomes.

@ASLeonard
Copy link
Author

They are no @RG in the headers, they are only coming from the -r flag in merge which

The tag value is inferred from file names.

The RG value appears to be appearing correctly if I print it around within the aforementioned codeblock, so I guess since it doesn't exist in the index (dynamically assigned?) it doesn't get written.

It also appears the @RG is not written in the header by samtools merge -r when the output is bam, so maybe this is a samtools merge issue after all

@jkbonfield
Copy link
Contributor

This does sound like a samtools merge bug then. It ought to be updating the headers.

I checked the specifications and this seems to be a weird corner case.

The SAM spec states in the recommendations section:

When a RG tag appears anywhere in the alignment section, there should be a single corresponding @RG line with matching ID tag in the header.

The SAMtags document however states:

RG:Z:readgroup
The read group to which the read belongs. If @RG headers are present, then readgroup must match the RG-ID field of one of the headers.

I.e. if the @RG headers exist at all then they must be complete, but they are permitted to be totally absent (although this is not recommended practice). This is probably a nuance which CRAM has never dealt with correctly since day one. I don't see a nice way to fix it, but it probably should warn about the behaviour.

An "un-nice" way would be to skip the RG CRAM data series and just add a verbatim "RG:Z:" string to the aux records, and it'll likely just work as intended. However it may break some things which make the assumption that RG is held in its own data series in CRAM, so I don't think this is a good solution. Rather we should just encourage better practices, which in our case means improving samtools merge.

@jkbonfield
Copy link
Contributor

I can confirm a noddy SAM file with an RG line and no corresponding @RG header will silently lose the RG tag.

It's trivial to make it warn, but then it warns a lot in the case of missing headers (once per record). Keeping track of which warnings we've emitted is complex given it's multi-threaded, so many just go with spammage as a way to repeatedly persuade people to fix their headers! :-)

I have a simple fix which can also store them verbatim (with additional warning spammage too). I'm still unsure if this is wise or not, but it does work mostly.

jkbonfield added a commit to jkbonfield/htslib that referenced this issue Jul 21, 2022
The SAMtags spec states that RG:Z: lines should point match an RG ID
if RG headers are present, but doesn't explicitly *require* them to be
present.  The SAM spec itself recommends that RG headers are present.
Sadly this means CRAM may need to cope with this semantically
inconsistent edge case.

Given CRAM stores RG as an integer data series as an index into the
corresponding header, in much the same way that BAM stores chromosomes
as numeric "tid" values, this makes things challenging.  However CRAM
can also store text tags, so it's possible to round-trip with missing
headers by claiming RG is -1 (unspecified) and then adding a verbatim
RG:Z string tag.  This is perhaps a bit of a CRAM spec loop hole so
it's questionable if this is the correct solution.

This works and is decodable by both htslib and htsjdk, but it'll break
things like cram_transcode_rg as used by samtools cat.  I think this
is a pretty unlikely combination of events.  Note picard's
SamFormatConverter also drops these RG fields.

This code also whinges, *once for each and every problematic alignment
record*, when RG is absent in the SAM header.  It's considerably more
work to track which ones we've warned about before and to track all
that meta-data across threads in a robust manner, plus this really
could be considered to be a poor SAM file.  Were it not for the SAM
spec explicitly permitting such things (even if recommending against
it) I'd reject it outright.  Instead brow-beating the SAM creators
into fixing the headers could be considered to be a positive outcome.

Fixes samtools#1479
jkbonfield added a commit to jkbonfield/htslib that referenced this issue Jul 21, 2022
The SAMtags spec states that RG:Z: lines should point match an RG ID
if RG headers are present, but doesn't explicitly *require* them to be
present.  The SAM spec itself recommends that RG headers are present.
Sadly this means CRAM may need to cope with this semantically
inconsistent edge case.

Given CRAM stores RG as an integer data series as an index into the
corresponding header, in much the same way that BAM stores chromosomes
as numeric "tid" values, this makes things challenging.  However CRAM
can also store text tags, so it's possible to round-trip with missing
headers by claiming RG is -1 (unspecified) and then adding a verbatim
RG:Z string tag.  This is perhaps a bit of a CRAM spec loop hole so
it's questionable if this is the correct solution.

This works and is decodable by both htslib and htsjdk, but it'll break
things like cram_transcode_rg as used by samtools cat.  I think this
is a pretty unlikely combination of events.  Note picard's
SamFormatConverter also drops these RG fields.

This code also whinges, *once for each and every problematic alignment
record*, when RG is absent in the SAM header.  It's considerably more
work to track which ones we've warned about before and to track all
that meta-data across threads in a robust manner, plus this really
could be considered to be a poor SAM file.  Were it not for the SAM
spec explicitly permitting such things (even if recommending against
it) I'd reject it outright.  Instead brow-beating the SAM creators
into fixing the headers could be considered to be a positive outcome.

Fixes samtools#1479
@daviesrob
Copy link
Member

It looks like the merge problem could be down to an incorrect way of checking for an empty klist here. It shouldn't be too difficult to fix.

@ASLeonard can you confirm that your input files didn't have any RG headers or tags, so merge would be adding them rather than replacing?

@ASLeonard
Copy link
Author

Correct, there are no RG header tags in the files being merged, only per alignment RG:Z:... tags.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jul 21, 2022

How were the RG tags originally created on those files? It rather feels like "garbage in garbage out", although that's perhaps a little harsh! (Sorry)

It's hard though for merge to be populating RG headers on merged output when they don't exist in the incoming headers either. I don't know if we even have a mechanism to scan through an entire file finding RG tags and populating the header.

So really fixing whatever produced the initial tags without adding appropriate header entries is the best fix.

Edit: caveat, it could of course be populating a list of RG tags found during merging and write out a header as a separate file at the end, with a note that the user may wish to use the reheader command before streaming into their next part of the pipeline. That's a different feature though I think.

@ASLeonard
Copy link
Author

It was generated with the -r flag of samtools.

There are no RG tags in the individual files, but they appear in the merged -o bam file. They don't appear in the merged -o cram file. So samtools itself is adding RG tags to alignments but not adding a header line

@daviesrob
Copy link
Member

Yes, merge -r definitely has a problem, no matter what other tools are doing.

@jkbonfield
Copy link
Contributor

Oh my apologies, I forgot what -r did; it's creating the tags itself, so yes definitely it is the thing at fault here. Looks like @daviesrob is already on the case. (Thanks)

@daviesrob
Copy link
Member

samtools/samtools#1683 is my fix for the samtools merge bit of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants