Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix Cram compression container substitution matrix generation.
The matrix is meant to turn ref + code into seq. Eg ref C and BS code 1 may mean seq is T. Instead of writing the codes for the non-ref bases in order ACGTN, we wrote the Nth base number in numerical order of the codes. For ref C + BS code we have 4 alternatives A,G,T and N (C->C is absent as it's not a substitution). So e.g. we may have C: 0=G 1=T 2=A 3=N. We were writing GTAN as 01 10 00 11, from A(c)GTN. We should have been writing the code numbers in A(c)GTN order hence 10 00 01 11. However, we don't actually change or optimise this in htslib, so it's hard coded in cram_structs.h. #define CRAM_SUBST_MATRIX "CGTNAGTNACTNACGNACGT" Reformatting it's: A: CGTN C:A GTN G:AC TN T:ACG N N:ACGT That basically boils down to 0123 (00 01 10 11 or 0x3b) for all rows. The incorrect order of writing the table made no difference as every row is sorted by both code 0,1,2,3 and nucleotide A,C,G,T,N.
- Loading branch information