-
Notifications
You must be signed in to change notification settings - Fork 3
BWT Formats
Many different BWT file formats exist. Some of them are mutually incompatible, while others are essentially different encodings of the same information. There are two main sources of incompatibility:
-
Alphabet: Many bioinformatics tools use the alphabet
$ACGTN
(in that order). Tools written by computer scientists often assume that the alphabet is sorted, e.g.$ACGNT
. Because the alphabetic order between charactersN
andT
is different, the BWTs are incompatible. -
Endmarkers: Some BWTs contain explicit endmarkers (usually denoted by
$
) at the end of each sequence, while others handle the end of sequence implicitly. The endmarker is usually the first character in the alphabet, but it can also be the last character.
The native format of BWT-merge uses numeric values 0-5 as its internal alphabet. By default, these comp values are interpreted as $ACGTN
. Other alphabets of size at most 6 can also be supported, as long as the endmarker is the first character in the alphabet. The following formats are currently supported.
Format | Alphabet | Details |
---|---|---|
native |
any | run-length encoded; includes rank/select support |
plain_default |
default | array of characters |
plain_sorted |
sorted | array of characters |
rfm |
sorted |
Relative FM-index: int_vector_buffer<8> of comp values |
ropebwt |
default |
RopeBWT (binary format; option -b ): byte array with 3 bits for the character and 5 bits for the length of the run |
sdsl |
sorted |
SDSL: int_vector_buffer<8> of characters |
sga |
default | SGA: byte array with 3 bits for the character and 5 bits for the length of the run |
A BWT/FMI file in the native format contains the header, the BWT data, the rank/select indexes, and the alphabet.
// Header
NativeHeader header;
// Data
BlockArray data;
// Rank/select indexes
CumulativeArray samples[6];
sdsl::sd_vector<> block_boundaries;
sdsl::sd_vector<>::rank_1_type block_rank;
sdsl::sd_vector<>::select_1_type block_select;
// Alphabet
Alphabet alpha;
struct NativeHeader
{
uint32_t tag;
uint32_t flags;
uint64_t sequences;
uint64_t bases;
};
-
tag
identifies the file format. Its value must be0x54574221
, which translates into!BWT
. -
flags
is used to encode various information. Only 8 bits are used at the moment.- The low 8 bits identify the alphabet. Value
0
indicates the default alphabet, while value1
indicates a sorted alphabet.
- The low 8 bits identify the alphabet. Value
-
sequences
is the number of sequences in the BWT. -
bases
is the number of bases (characters) in the BWT, including the endmarkers.
The BWT is run-length encoded using a byte-level code (struct Run
). The data is stored in 64-byte blocks and 8-megabyte pages (struct BlockArray
). The encoding of a run never crosses block boundaries.
-
uint64_t bytes
is the number of bytes in the encoding. - The pages containing the encoding. The last page must be padded (with any byte values) if it is not full.
Each run is a pair (c
, l
), where c
is a comp value and l
is the length of the run. If l <= 41
, the run is encoded in a single byte as 6 * (l - 1) + c
. Otherwise the first byte is 6 * 41 + c
, and the following bytes encode the remaining run length l - 42
(struct ByteCode
). In that encoding, the low 7 bits of each byte contain data, while the high bit tells whether the encoding continues in the next byte (1
) or not (0
). The encoding starts with the least significant 7 bits and continues upwards.
CumulativeArray
encodes the prefix sums of an array of m_size
unsigned integers. Each element A[i]
is encoded in unary as A[i]
0-bits (items), followed by an 1-bit.
class CumulativeArray
{
sdsl::sd_vector<> data;
sdsl::sd_vector<>::rank_1_type rank;
sdsl::sd_vector<>::select_1_type select_1;
sdsl::sd_vector<>::select_0_type select_0;
size_type m_size;
};
-
samples[c]
encodes the number of occurrences of comp valuec
in the BWT. Each element represents a block. -
block_boundaries
is a bitvector of lengthbases
. The last character encoded in each block is marked with an 1-bit. -
block_rank
andblock_select
are rank/select indexes forblock_boundaries
.
Alphabet
is a mapping between characters (bytes) and comp values (integers from 0
to sigma - 1
). The range of comp values must be contiguous, but more than one character can map to the same comp value.
class Alphabet
{
sdsl::int_vector<8> char2comp, comp2char;
sdsl::int_vector<64> C;
size_type sigma;
};
-
char2comp
is an array of size256
that maps characters to comp values. -
comp2char
is an array of sizesigma
that maps each comp value to the canonical character representing it. -
C
is an array of sizesigma + 1
. ValueC[i]
is the number of comp values from0
toi - 1
in the BWT.