Skip to content

Commit

Permalink
Tweak the CRAM_SUBST_MATRIX table.
Browse files Browse the repository at this point in the history
The old table equates to:

        0 1 2 3
    A : C G T N
    C : A G T N
    G : A C T N
    T : A C G N
    N : A C G T

The new one is:

        0 1 2 3
    A : T C G N
    C : A G T N
    G : T C A N
    T : A G C N
    N : A C G T

This affects the generation of BS codes for Ref/Seq combinations.  The
idea is we want common substitutions to be sharing the same code value
so compression improves.

Mostly this is a (tiny) win for compression, across a multitude of
technologies and organisms.  There are a few exceptions (one of the
Streptococcus samples grew, and AVITI had a marginal growth, but
generally it's an irrelevance on the platforms that don't have
aggressive quality quantisation as the files become dominated
elsewhere.  Even with this on Illumina, it's generally of the order of
a 0.1% to total file size.  However it's completely free and has no
real CPU impact either.
  • Loading branch information
jkbonfield committed Feb 13, 2023
1 parent 01dcac3 commit ba07a7e
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion cram/cram_structs.h
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ struct hFILE;
#define BASES_PER_SLICE (SEQS_PER_SLICE*500)
#define SLICE_PER_CNT 1

#define CRAM_SUBST_MATRIX "CGTNAGTNACTNACGNACGT"
#define CRAM_SUBST_MATRIX "CGTNGTANCATNGCANACGT"

#define MAX_STAT_VAL 1024
//#define MAX_STAT_VAL 16
Expand Down

0 comments on commit ba07a7e

Please sign in to comment.