Skip to content

Commit

Permalink
Minimal support for CRAM files with missing @rg headers.
Browse files Browse the repository at this point in the history
The SAMtags spec states that RG:Z: lines should point match an RG ID
if RG headers are present, but doesn't explicitly *require* them to be
present.  The SAM spec itself recommends that RG headers are present.
Sadly this means CRAM may need to cope with this semantically
inconsistent edge case.

Given CRAM stores RG as an integer data series as an index into the
corresponding header, in much the same way that BAM stores chromosomes
as numeric "tid" values, this makes things challenging.  However CRAM
can also store text tags, so it's possible to round-trip with missing
headers by claiming RG is -1 (unspecified) and then adding a verbatim
RG:Z string tag.  This is perhaps a bit of a CRAM spec loop hole so
it's questionable if this is the correct solution.

This works and is decodable by both htslib and htsjdk, but it'll break
things like cram_transcode_rg as used by samtools cat.  I think this
is a pretty unlikely combination of events.  Note picard's
SamFormatConverter also drops these RG fields.

This code also whinges, *once for each and every problematic alignment
record*, when RG is absent in the SAM header.  It's considerably more
work to track which ones we've warned about before and to track all
that meta-data across threads in a robust manner, plus this really
could be considered to be a poor SAM file.  Were it not for the SAM
spec explicitly permitting such things (even if recommending against
it) I'd reject it outright.  Instead brow-beating the SAM creators
into fixing the headers could be considered to be a positive outcome.

Fixes samtools#1479
  • Loading branch information
jkbonfield committed Jul 21, 2022
1 parent b7addd3 commit 31dceef
Showing 1 changed file with 24 additions and 17 deletions.
41 changes: 24 additions & 17 deletions cram/cram_encode.c
Original file line number Diff line number Diff line change
Expand Up @@ -2534,15 +2534,17 @@ static int cram_add_insertion(cram_container *c, cram_slice *s, cram_record *r,
* Encodes auxiliary data. Largely duplicated from above, but done so to
* keep it simple and avoid a myriad of version ifs.
*
* Returns the read-group parsed out of the BAM aux fields on success
* Returns the RG header line pointed to by the BAM aux fields on success,
* NULL on failure or no rg present, also sets "*err" to non-zero
*/
static char *cram_encode_aux(cram_fd *fd, bam_seq_t *b, cram_container *c,
cram_slice *s, cram_record *cr,
int verbatim_NM, int verbatim_MD,
int NM, kstring_t *MD, int cf_tag,
int *err) {
static sam_hrec_rg_t *cram_encode_aux(cram_fd *fd, bam_seq_t *b,
cram_container *c,
cram_slice *s, cram_record *cr,
int verbatim_NM, int verbatim_MD,
int NM, kstring_t *MD, int cf_tag,
int *err) {
char *aux, *orig, *rg = NULL;
sam_hrec_rg_t *brg = NULL;
int aux_size = bam_get_l_aux(b);
cram_block *td_b = c->comp_hdr->TD_blk;
int TD_blk_size = BLOCK_SIZE(td_b), new;
Expand Down Expand Up @@ -2578,10 +2580,16 @@ static char *cram_encode_aux(cram_fd *fd, bam_seq_t *b, cram_container *c,
// RG:Z
if (aux[0] == 'R' && aux[1] == 'G' && aux[2] == 'Z') {
rg = &aux[3];
while (*aux++);
if (CRAM_MAJOR_VERS(fd->version) >= 4)
BLOCK_APPEND(td_b, "RG*", 3);
continue;
brg = sam_hrecs_find_rg(fd->header->hrecs, rg);
if (brg) {
while (*aux++);
if (CRAM_MAJOR_VERS(fd->version) >= 4)
BLOCK_APPEND(td_b, "RG*", 3);
continue;
} else {
// RG:Z tag will be stored verbatim
hts_log_warning("Missing @RG header for RG \"%s\"", rg);
}
}

// MD:Z
Expand Down Expand Up @@ -2938,8 +2946,7 @@ static char *cram_encode_aux(cram_fd *fd, bam_seq_t *b, cram_container *c,

if (err) *err = 0;

// rg from within bam_aux, not rg from our aux copy.
return rg ? (char *)bam_aux(b) + (rg - orig) : NULL;
return brg;

err:
block_err:
Expand Down Expand Up @@ -3426,15 +3433,15 @@ static int process_one_read(cram_fd *fd, cram_container *c,

cr->ntags = 0; //cram_stats_add(c->stats[DS_TC], cr->ntags);
int err = 0;
rg = cram_encode_aux(fd, b, c, s, cr, verbatim_NM, verbatim_MD, NM, MD,
cf_tag, &err);
sam_hrec_rg_t *brg =
cram_encode_aux(fd, b, c, s, cr, verbatim_NM, verbatim_MD, NM, MD,
cf_tag, &err);
if (err)
goto block_err;

/* Read group, identified earlier */
if (rg) {
sam_hrec_rg_t *brg = sam_hrecs_find_rg(fd->header->hrecs, rg);
cr->rg = brg ? brg->id : -1;
if (brg) {
cr->rg = brg->id;
} else if (CRAM_MAJOR_VERS(fd->version) == 1) {
sam_hrec_rg_t *brg = sam_hrecs_find_rg(fd->header->hrecs, "UNKNOWN");
if (!brg) goto block_err;
Expand Down

0 comments on commit 31dceef

Please sign in to comment.