-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Cram compression container substitution matrix generation. #1562
Conversation
And now we can do e.g. this change to the constant (not even optimised per data set) table, which prefers to keep AT and GC mutations with the same code and is symmetric.
Tested with HiSeqX, NovaSeq, ONT (R10.4), Revio, Ultima and Element this gives improvements to BS compression ratio across the board. It varied between 2.4% and 10.7%. Not a PR yet as I don't know what the general best strategy is and maybe auto-tuning is correct instead, but it's basically zero cost. Overall file size reduction is much smaller though as we're not usually dominanted by BS data series. |
The change looks good, although I see that the back-stop loop termination in |
Can make that trivial change. Should we also consider just changing the #define to a better one? I've been doing some randomised+hill-climb trials this morning and it seems to consistently pick another matrix for working better. I was just about to try it out on a larger range of instrument types. Something like optimising it per container is quite a bit of work, but simply changing the default matrix has absolutely zero CPU cost and if it's on-average a win then it's easy to do. |
I expect that depends on how universal the result turns out to be. |
Some data for human GRCh38 alignments, from a mixture of technologies and a mixture of regions of chr1. So not a full analysis, but a broad spectrum at least (as far as human goes). Almost certainly we could do better on extreme things like malaria, but the previous matrix was totally arbitrary too. CRAM 3.0, normal mode
A more aggressive compression via CRAM 3.1,small,use_arith,level=7
The 3 matrices look like:
So the default is just an ACGT with the ref base absent. It generally clusters by nucleotide so BS=0 is mostly A, BS=1 is mostly C. The second one I tried, which I quoted above, was just a manual attempt at getting A<>T and C<>G together and having complementarity for the other columns too. It ends up with ACGT in every column (ie BS=0, or in BS=1). The third matrix was determined by repeated random trials and hill climbing matrix element swapping on a mixed data set. It appears to have chosen columns that are all A/T or C/G for 2 of the 3 columns (the best it can do as you cannot get this for BS 0, 1 and 2). So it seems likely this is matching regions together that are AT-rich or GC-rich. It's noticable that the CRAM 3.0 compression switches from order-0 across the board to sometimes preferring order-1 codecs (particularly novaseq). I saw that at compress level 9 (not shown) too, but the difference was small so likely the auto-tuning didn't consider it worth the extra CPU. Letting it use the arithmetic coder generally didn't use it universally, often preferring r4x16 rans. I'm thinking of the 3rd option, although AVITI is marginally ahead with the 2nd. It's not a huge difference there though, and novaseq is probably still dominant in quantity and it has a big preference for the 3rd. We're also less likely to make any meaningful difference to file size on the platforms that don't aggressively quantise their quality values, as the BS data series becomes a tiny percentage of overall file size. It's still small fry and icing on the cake frankly. For CRAM v3.1 on Novaseq PCR-free test set, the matrix change saved 9.2% for BS, but that's just 0.06% of the CRAM file size. For CRAM 3.0 it was 11% and 0.07%. Not exactly stellar! |
The matrix is meant to turn ref + code into seq. Eg ref C and BS code 1 may mean seq is T. Instead of writing the codes for the non-ref bases in order ACGTN, we wrote the Nth base number in numerical order of the codes. For ref C + BS code we have 4 alternatives A,G,T and N (C->C is absent as it's not a substitution). So e.g. we may have C: 0=G 1=T 2=A 3=N. We were writing GTAN as 01 10 00 11, from A(c)GTN. We should have been writing the code numbers in A(c)GTN order hence 10 00 01 11. However, we don't actually change or optimise this in htslib, so it's hard coded in cram_structs.h. #define CRAM_SUBST_MATRIX "CGTNAGTNACTNACGNACGT" Reformatting it's: A: CGTN C:A GTN G:AC TN T:ACG N N:ACGT That basically boils down to 0123 (00 01 10 11 or 0x3b) for all rows. The incorrect order of writing the table made no difference as every row is sorted by both code 0,1,2,3 and nucleotide A,C,G,T,N.
The old table equates to: 0 1 2 3 A : C G T N C : A G T N G : A C T N T : A C G N N : A C G T The new one is: 0 1 2 3 A : T C G N C : A G T N G : T C A N T : A G C N N : A C G T This affects the generation of BS codes for Ref/Seq combinations. The idea is we want common substitutions to be sharing the same code value so compression improves. Mostly this is a (tiny) win for compression, across a multitude of technologies and organisms. There are a few exceptions (one of the Streptococcus samples grew, and AVITI had a marginal growth, but generally it's an irrelevance on the platforms that don't have aggressive quality quantisation as the files become dominated elsewhere. Even with this on Illumina, it's generally of the order of a 0.1% to total file size. However it's completely free and has no real CPU impact either.
018b117
to
ba07a7e
Compare
I've checked that picard and old samtools understand the revised matrix, so it should be OK. |
The matrix is meant to turn ref + code into seq. Eg ref C and BS code 1 may mean seq is T. Instead of writing the codes for the non-ref bases in order ACGTN, we wrote the Nth base number in numerical order of the codes.
For ref C + BS code we have 4 alternatives A,G,T and N (C->C is absent as it's not a substitution). So e.g. we may have C: 0=G 1=T 2=A 3=N. We were writing GTAN as 01 10 00 11, from A(c)GTN. We should have been writing the code numbers in A(c)GTN order hence 10 00 01 11.
However, we don't actually change or optimise this in htslib, so it's hard coded in cram_structs.h.
Reformatting it's:
That basically boils down to 0123 (00 01 10 11 or 0x3b) for all rows. The incorrect order of writing the table made no difference as every row is sorted by both code 0,1,2,3 and nucleotide A,C,G,T,N.