Skip to content
Jouni Siren edited this page Jan 15, 2020 · 3 revisions

BWT file formats

Many different BWT file formats exist. Some of them are mutually incompatible, while others are essentially different encodings of the same information. There are two main sources of incompatibility:

  • Alphabet: Many bioinformatics tools use the alphabet $ACGTN (in that order). Tools written by computer scientists often assume that the alphabet is sorted, e.g. $ACGNT. Because the alphabetic order between characters N and T is different, the BWTs are incompatible.
  • Endmarkers: Some BWTs contain explicit endmarkers (usually denoted by $) at the end of each sequence, while others handle the end of sequence implicitly. The endmarker is usually the first character in the alphabet, but it can also be the last character.

The native format of BWT-merge uses numeric values 0-5 as its internal alphabet. By default, these comp values are interpreted as $ACGTN. Other alphabets of size at most 6 can also be supported, as long as the endmarker is the first character in the alphabet. The following formats are currently supported.

Format Alphabet Details
native any run-length encoded; includes rank/select support
plain_default default array of characters
plain_sorted sorted array of characters
rfm sorted Relative FM-index: int_vector_buffer<8> of comp values
ropebwt default RopeBWT (binary format; option -b): byte array with 3 bits for the character and 5 bits for the length of the run
sdsl sorted SDSL: int_vector_buffer<8> of characters
sga default SGA: byte array with 3 bits for the character and 5 bits for the length of the run

Native format

A BWT/FMI file in the native format contains the header, the BWT data, the rank/select indexes, and the alphabet.

  // Header
  NativeHeader                     header;

  // Data
  BlockArray                       data;

  // Rank/select indexes
  CumulativeArray                  samples[6];
  sdsl::sd_vector<>                block_boundaries;
  sdsl::sd_vector<>::rank_1_type   block_rank;
  sdsl::sd_vector<>::select_1_type block_select;

  // Alphabet
  Alphabet                         alpha;

Header

struct NativeHeader
{
  uint32_t tag;
  uint32_t flags;
  uint64_t sequences;
  uint64_t bases;
};
  • tag identifies the file format. Its value must be 0x54574221, which translates into !BWT.
  • flags is used to encode various information. Only 8 bits are used at the moment.
    • The low 8 bits identify the alphabet. Value 0 indicates the default alphabet, while value 1 indicates a sorted alphabet.
  • sequences is the number of sequences in the BWT.
  • bases is the number of bases (characters) in the BWT, including the endmarkers.

Data

The BWT is run-length encoded using a byte-level code (struct Run). The data is stored in 64-byte blocks and 8-megabyte pages (struct BlockArray). The encoding of a run never crosses block boundaries.

  • uint64_t bytes is the number of bytes in the encoding.
  • The pages containing the encoding. The last page must be padded (with any byte values) if it is not full.

Each run is a pair (c, l), where c is a comp value and l is the length of the run. If l <= 41, the run is encoded in a single byte as 6 * (l - 1) + c. Otherwise the first byte is 6 * 41 + c, and the following bytes encode the remaining run length l - 42 (struct ByteCode). In that encoding, the low 7 bits of each byte contain data, while the high bit tells whether the encoding continues in the next byte (1) or not (0). The encoding starts with the least significant 7 bits and continues upwards.

Rank/select indexes

CumulativeArray encodes the prefix sums of an array of m_size unsigned integers. Each element A[i] is encoded in unary as A[i] 0-bits (items), followed by an 1-bit.

class CumulativeArray
{
  sdsl::sd_vector<>                data;
  sdsl::sd_vector<>::rank_1_type   rank;
  sdsl::sd_vector<>::select_1_type select_1;
  sdsl::sd_vector<>::select_0_type select_0;
  size_type                        m_size;
};
  • samples[c] encodes the number of occurrences of comp value c in the BWT. Each element represents a block.
  • block_boundaries is a bitvector of length bases. The last character encoded in each block is marked with an 1-bit.
  • block_rank and block_select are rank/select indexes for block_boundaries.

Alphabet

Alphabet is a mapping between characters (bytes) and comp values (integers from 0 to sigma - 1). The range of comp values must be contiguous, but more than one character can map to the same comp value.

class Alphabet
{
  sdsl::int_vector<8>  char2comp, comp2char;
  sdsl::int_vector<64> C;
  size_type            sigma;
};
  • char2comp is an array of size 256 that maps characters to comp values.
  • comp2char is an array of size sigma that maps each comp value to the canonical character representing it.
  • C is an array of size sigma + 1. Value C[i] is the number of comp values from 0 to i - 1 in the BWT.
Clone this wiki locally