Skip to content

Commit

Permalink
Fix Cram compression container substitution matrix generation.
Browse files Browse the repository at this point in the history
The matrix is meant to turn ref + code into seq. Eg ref C and BS code
1 may mean seq is T.  Instead of writing the codes for the non-ref
bases in order ACGTN, we wrote the Nth base number in numerical order
of the codes.

For ref C + BS code we have 4 alternatives A,G,T and N (C->C is absent
as it's not a substitution).  So e.g. we may have C: 0=G 1=T 2=A 3=N.
We were writing GTAN as 01 10 00 11, from A(c)GTN.  We should have
been writing the code numbers in A(c)GTN order hence 10 00 01 11.

However, we don't actually change or optimise this in htslib, so it's
hard coded in cram_structs.h.

    #define CRAM_SUBST_MATRIX "CGTNAGTNACTNACGNACGT"

Reformatting it's:

    A: CGTN
    C:A GTN
    G:AC TN
    T:ACG N
    N:ACGT

That basically boils down to 0123 (00 01 10 11 or 0x3b) for all rows.
The incorrect order of writing the table made no difference as every
row is sorted by both code 0,1,2,3 and nucleotide A,C,G,T,N.
  • Loading branch information
jkbonfield committed Feb 13, 2023
1 parent 3fe2a59 commit 01dcac3
Showing 1 changed file with 28 additions and 21 deletions.
49 changes: 28 additions & 21 deletions cram/cram_encode.c
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ static int process_one_read(cram_fd *fd, cram_container *c,
static int sub_idx(char *key, char val) {
int i;

for (i = 0; *key && *key++ != val; i++);
for (i = 0; i < 4 && *key++ != val; i++);
return i;
}

Expand Down Expand Up @@ -203,31 +203,38 @@ cram_block *cram_encode_compression_header(cram_fd *fd, cram_container *c,

case CRAM_KEY('S','M'): {
char smat[5], *mp = smat;
// Output format is for order ACGTN (minus ref base)
// to store the code value 0-3 for each symbol.
//
// Note this is different to storing the symbols in order
// that the codes occur from 0-3, which is what we used to
// do. (It didn't matter as we always had a fixed table in
// the order.)
*mp++ =
(sub_idx("CGTN", h->substitution_matrix[0][0]) << 6) |
(sub_idx("CGTN", h->substitution_matrix[0][1]) << 4) |
(sub_idx("CGTN", h->substitution_matrix[0][2]) << 2) |
(sub_idx("CGTN", h->substitution_matrix[0][3]) << 0);
(sub_idx(h->substitution_matrix[0], 'C') << 6) |
(sub_idx(h->substitution_matrix[0], 'G') << 4) |
(sub_idx(h->substitution_matrix[0], 'T') << 2) |
(sub_idx(h->substitution_matrix[0], 'N') << 0);
*mp++ =
(sub_idx("AGTN", h->substitution_matrix[1][0]) << 6) |
(sub_idx("AGTN", h->substitution_matrix[1][1]) << 4) |
(sub_idx("AGTN", h->substitution_matrix[1][2]) << 2) |
(sub_idx("AGTN", h->substitution_matrix[1][3]) << 0);
(sub_idx(h->substitution_matrix[1], 'A') << 6) |
(sub_idx(h->substitution_matrix[1], 'G') << 4) |
(sub_idx(h->substitution_matrix[1], 'T') << 2) |
(sub_idx(h->substitution_matrix[1], 'N') << 0);
*mp++ =
(sub_idx("ACTN", h->substitution_matrix[2][0]) << 6) |
(sub_idx("ACTN", h->substitution_matrix[2][1]) << 4) |
(sub_idx("ACTN", h->substitution_matrix[2][2]) << 2) |
(sub_idx("ACTN", h->substitution_matrix[2][3]) << 0);
(sub_idx(h->substitution_matrix[2], 'A') << 6) |
(sub_idx(h->substitution_matrix[2], 'C') << 4) |
(sub_idx(h->substitution_matrix[2], 'T') << 2) |
(sub_idx(h->substitution_matrix[2], 'N') << 0);
*mp++ =
(sub_idx("ACGN", h->substitution_matrix[3][0]) << 6) |
(sub_idx("ACGN", h->substitution_matrix[3][1]) << 4) |
(sub_idx("ACGN", h->substitution_matrix[3][2]) << 2) |
(sub_idx("ACGN", h->substitution_matrix[3][3]) << 0);
(sub_idx(h->substitution_matrix[3], 'A') << 6) |
(sub_idx(h->substitution_matrix[3], 'C') << 4) |
(sub_idx(h->substitution_matrix[3], 'G') << 2) |
(sub_idx(h->substitution_matrix[3], 'N') << 0);
*mp++ =
(sub_idx("ACGT", h->substitution_matrix[4][0]) << 6) |
(sub_idx("ACGT", h->substitution_matrix[4][1]) << 4) |
(sub_idx("ACGT", h->substitution_matrix[4][2]) << 2) |
(sub_idx("ACGT", h->substitution_matrix[4][3]) << 0);
(sub_idx(h->substitution_matrix[4], 'A') << 6) |
(sub_idx(h->substitution_matrix[4], 'C') << 4) |
(sub_idx(h->substitution_matrix[4], 'G') << 2) |
(sub_idx(h->substitution_matrix[4], 'T') << 0);
BLOCK_APPEND(map, smat, 5);
break;
}
Expand Down

0 comments on commit 01dcac3

Please sign in to comment.