Beam Search Decoder

  • AM - acoustic model
  • LM - language model
  • WER - Word Error Rate
  • LER - Letter Error Rate


After an AM is trained, one can get the transcription of a audio by running either the greedy path (the greedy best path using only acoustic model predictions, in the code Viterbi name is used) or the beam-search decoding with LM incorporated.

Greedy Path

To get the greedy path, one should use the Test binary in the following way

wav2letter/build/Test \
  --am path/to/train/am.bin \
  --maxload 10 \
  --test path/to/test/list/file 

For this particular example, greedy paths will be computed on 10 random samples (--maxload=10) from the test list file and WER and LER will be printed on the screen. To run on the all samples, set --maxload=-1.

While running the Test binary, the AM is loaded and all the saved flags will be used if you don’t specify them in the command line. For example, tokens and lexicon paths. So, in case you want to overwrite them, you should directly specify them:

wav2letter/build/Test \
  --am path/to/train/am.bin \
  --maxload 10 \
  --test path/to/test/list/file \
  --tokensdir path/to/tokens/dir \
  --tokens tokens.txt \
  --lexicon path/to/the/lexicon/file

The Test binary can be used also to generate an Emission Set including the emission matrix as well as other target-related information for each sample. All flags are also stored in the Emission Set. Specifically, the emission matrix of the CTC/ASG model is the posterior, while for seq2seq models, it is an encoded audio with a series of embeddings. The Emission Set can be fed into the Decode binary directly to generate transcripts without running AM forwarding again. To set the directory where to store Emission Set use the flag --emission_dir path/to/emission/dir (default value is '') and the --test will be used as a file name.

Summarization on flags to run Test binary:

Flags Flag Type Default Value Flag Example Value Reused from the AM training/ Emission Set Description
am string '' --am path/to/am/file N Full path to the acoustic model binary file
emission_dir string '' --emission_dir path/to/emission/dir N Path to the directory where emission set will be stored to prevent running the AM forward pass during beam-search decoding.
datadir string '' --datadir path/to/the/list/file/dir Y This prefix is used to define full path to the test list. Set it to '' in case you specify full path in the --test.
test string '' --test path/to/the/test/list/file Y Path to the test list file (where id path duration transcription are stored, transcription can be empty). --datadir parameter is used as prefix for this path (concatenation of paths is done)
maxload int -1 --maxload 300 N Number of random sample to process (value -1, means all samples)
show bool false --show N To print word transcriptions (target and predicted) for each sample into stdout
showletters bool false --showletters N To print token transcriptions (target and predicted) for each sample into stdout
sclite string '' --sclite path/to/file N Specifies the path to save the logs, including the stdout log and the hypotheses and references in sclite format (trn)

Beam-search Decoders

1. Beam-search Decoder Types

We support lexicon-base beam-search decoder and lexicon-free beam-search decoder for acoustic models trained with CTC, ASG and Seq2Seq criterion.

1.1 Lexicon-based beam-search decoder (uselexicon=true)

For a lexicon-based decoder we restrict our search by the lexicon provided by a user. In other words, generated transcriptions only contain words from the lexicon.

The lexicon is a mapping from words to their tokens sequence, spellings. The tokens set should be identical to the one used in AM training, details see in the Data Preparation. For example, if we train AM with letters as a token set {a-z} then word “world” should have spelling “w o r l d”.

To optimize the decoding performance, spellings of the words are stored in a Trie. Each node in the Trie corresponds to a token. Some nodes, usually the leaf nodes, represent valid words containing the spelling of tokens on the path from Trie root to it. In case we have “hello”, “world“, ”pineapple“, ”pine“ in the lexicon and letters are our tokens set we will have as a trie:

root → h → e → l → l → o ([hello])
root → w → o → r → l → d ([world])
root → p → i → n → e ([pine]) → a → p → p → l → e → ([pineapple])

1.2 Lexicon-free beam-search decoder (uselexicon=false)

The lexicon-free beam-search decoder considers any possible token as candidates and there is no notion of words during decoding. In this case, a word separator should be set by --wordseparator and included into tokens set for AM training. The word separator is treated and predicted as all the other normal tokens. After obtaining the transcription in tokens, word separator is used to split the sequence into words. Usually, when we use word-pieces as target units, the word separator can be part of the token. To correctly handle this case, one should set --usewordpiece=true.

2. Beam-search Optimization

At each decoding step, we preserve only top beamsize hypotheses in the beam according to their accumulated scores. Apart from the beam size, beamthreshold is used to limit the score range of the hypothesis in the beam, i.e. hypothesis, whose score gaps from the best are larger than this threshold, are also removed from the beam. In addition, we can also restrict the number of tokens to propose for each hypotheses. The tokens beam size --beamsizetoken limits the search space to only the top tokens according to AM scores. This is extremely useful for lexicon free decoding, since there is no lexicon constraints.

3. Language Model Incorporation

In the beam-search decoder, a language model trained with external data can be included and its scores (log-probability) will be accumulated together with AM scores.

3.1 Level of language model tokens

LM can be operated on either words or tokens (which is the same as the ones used to train AM). In other words, the LM can be queried each time when a new word or token is proposed. One can set this via --decodertype. Note that word-based LM can be used only with lexicon-based beam-search decoder, i.e. if --decodertype=wrd then uselexicon flag is ignored.

If LM is word-based, the LM score is applied only when a completed word is proposed. In order to maintain the score scale of all the hypothesis in the beam and properly rank the partial words, we approximate the LM score of partial words by their highest possible unigram score. This can be easily computed by recursively smear upward the trie with the real unigram scores on the nodes with valid words. Three types of smearing are supported: logadd (a.k.a logadd(a, b)=log(exp(a) + exp(b)), max (pick the maximum score among children nodes scores and current node score) or none (no smearing). It can be set by --smearing.

3.2 Types of language models

Currently we are supporting decoding with the following language models: ZeroLM, KenLM and ConvLM. To specify LM type use --lmtype [kenlm, convlm]. To use ZeroLM set the --lm=''.

ZeroLM is a fake LM which always returns 0 as score. It is served as a proxy to conduct beam-search on only AM scores without breaking API.

KenLM language model can be trained standalone with KenLM library. The text data should be prepared accordingly to the acoustic model data. For example, in case of word-level LM if your AM token set doesn’t contain punctuation, then remove all punctuation from the data. In case of token-level LM training one should split words into tokens set sequence and only then train LM on such data in a way that LM predicts probability for a token (not for a word). Both of the .arpa and the binarized .bin LM can be used in the wav2letter.

ConvLM models are convolutional neural networks. They are currently trained in the fairseq and then converted into flashlight-serializable models (example how we are doing this) to be able to load in wav2letter. --lm_vocab should be specified as it is a dictionary to map tokens into indices in the ConvLM training. Note that this token set is usually different from the one used in wav2letter AM training. Each line of this file is a single token (char, word, word-piece, etc.) and the token index is exactly its line number.

To efficiently decode with ConvLM, which is pretty expensive on running the forward pass, we design a dynamic cache to hold the probabilities over all the tokens given the candidates generated from the previous frame. This way, when we want to propose new candidates, we can easily check the cache for its pre-computed LM score. In other words, we only need to run the ConvLM forward pass in batches at the end of decoding each frame, when all the possible new candidates are gathered. Thus, the batching and caching can greatly reduce the number of the forward passes we need to run in total.

Usually, the cache has size beam size x number of classes in ConvLM in main memory. If we cannot feed beam size samples to ConvLM in a single batch, --lm_memory is used to limit the size of input batch. --lm_memory is a integer which requires input batch size x LM context size < --lm_memory. For example, if the context size or receptive field of a ConvLM is 50, then no matter what the beam size or the number of new candidates is, we can only feed 100 samples in a single batch if --lm_memory is set to 5000.

Flags ZeroLM KenLM ConvLM
lm '' path/to/lm/model path/to/lm/model
lmtype X kenlm convlm
lm_vocab X X V
lm_memory X X V
decodertype X V V

4. Distributed Decoding

We support decoding a dataset using several threads by setting --nthread_decoder. The samples in the dataset are dispatched equally to each thread. In case of decoding CTC/ASG models with KenLM language model, --nthread_decoder is simply the number of CPU threads to run beam-search decoding. If one wants to decode Seq2Seq models or with ConvLM, we need to use Flashlight to run forward pass in each thread. Since forwarding is not thread-safe, each thread needs to acquire resources for its own and a copy of acoustic model (the seq2seq criterion) and LM will be stored on the device it requested. Specifically, if flashlight is built with CUDA backend, 1 GPU is required per thread and --nthread_decoder should be no larger than the number of visible GPUs.

We are supporting not consumer-producer scheme for parallel computations. --nthread_decoder_am_forward defines the number of threads for AM forward pass: all threads place forward results into the queue to process by beam-search decoder with maximum size of the queue --emission_queue_size. In case of running forward pass on GPUs --nthread_decoder_am_forward defines number of GPUs to use for parallel forward pass. --nthread_decoder threads are reading from the queue and perform beam-search decoding.

5. Online beam-search decoding

Decoders, except Seq2Seq decoder, are now supporting online decoding. It consumes small chunks of emissions of audio as input. At the time we want to have a look at the transcript so far, we may get the best transcript and prune the hypothesis space and keep decoding further.

Full list of flags related to the beam-search decoding

Beam-search decoder options to be specified (with examples)

Flags CTC / ASG criterion (Lexicon-based) Seq2Seq criterion (Lexicon-based) CTC / ASG criterion (Lexicon-free) Seq2Seq criterion (Lexicon-free)
criterion ctc / asg seq2seq ctc / asg seq2seq
lmweight V V V V
beamsize V V V V
beamsizetoken V V V V
beamthreshold V V V V
uselexicon true true false false
lexicon path/to/the/lexicon/file path/to/the/lexicon/file '' ''
smearing none / max / logadd none / max / logadd X X
wordscore V V X X
wordseparator X X V V
unkscore V X V X
silscore V X V X
eosscore X V X V
attentionthreshold X V X V
smoothingtemperature X V X V

Common flags

Flags Flag Type Default Value Flag Example Value Reused from the AM training/ Emission Set Description
am string '' --am path/to/am/file N Full path to the acoustic model binary file. Ignored if emission_dir is specified.
emission_dir string '' --emission_dir path/to/emission/dir N Path to the directory with stored emission data from the Test binary to prevent running the AM forward pass during beam-search decoding
datadir string '' --datadir path/to/the/list/file/dir Y This prefix is used to define full path to the test list. Set it to '' in case you specify full path in the --test.
test string '' --test path/to/the/test/list/file Y Path to the test list file (where id path duration transcription are stored, transcription can be empty). --datadir parameter is used as prefix for this path (concatenation of paths is done)
maxload int -1 --maxload 300 N Number of random sample to process (value -1, means all samples)
show bool false --show N To print word transcriptions (target and predicted) for each sample into stdout
showletters bool false --showletters N To print token transcriptions (target and predicted) for each sample into stdout
nthread_decoder int 1 --nthread_decoder 4 N Number of threads to run beam-search decoding (details in Distributed running section)
nthread_decoder_am_forward int 1 --nthread_decoder_am_forward 2 N Number of threads to run AM forward pass (details in Distributed running section)
emission_queue_size int 3000 --emission_queue_size 1000 N Maximum size of the emission queue (details in Distributed running section)
sclite string '' --sclite path/to/file N Specifies the path to save the logs, including the stdout log and the hypotheses and references in sclite format (trn)

Flags related to the beam-search algorithm

Flags Flag Type Default Value Flag Example Value Reused from the AM training/ Emission Set Description
uselexicon bool true --uselexicon N True to set lexicon-based beam-search decoding, false - to set lexicon-free
lexicon string '' --lexicon path/to/the/lexicon/file Y Path to the lexicon file where mapping of words into tokens is given (is used in case of lexicon-based beam-search decoding)
lm string '' --lm path/to/the/lm/file N Full path to the language model binary file (use '' to use zero LM)
lm_vocab string '' --lm_vocab path/to/lm/vocab/file N Path to vocabulary file defines the mapping between indices and neural-based LM tokens
lm_memory double 5000 --lm_memory 3000 N Total memory to define the batch size used to run forward pass for neural-based LM model
lmtype string: kenlm / convlm kenlm --lmtype kenlm N Language model type
decodertype string: wrd / tkn wrd --decodertype tkn N Language model token type: wrd for word-level LM, tkn - for token-level LM (tokens should be the same as an acoustic model tokens set). If wrd value is set then uselexicon flag is ignored and lexicon-based beam search decoding is used.
wordseparator string | --wordseparator _ Y Token to be used as a separator of words (is used to get word transcription from the token transcription for the lexicon-free beam-search decoder)
usewordpiece bool false --usewordpiece false Y Defines if acoustic model is training with tokens where word separator is not a separate token, default false (for example with word-pieces hello world -> *he llo _world** where * corresponds to word separation).
smoothingtemperature double 1 --smoothingtemperature 1.2 Y Smoothen the posterior distribution of acoustic model (for Seq2Seq criterion only)
attentionthreshold int -infinity --attentionthreshold 30 Y Limit of the distance between the peak attenion locations on the encoded audio for 2 consequtive tokens (for Seq2Seq criterion only)

Parameters to optimize for beam-search decoder

Flags Flag Type Default Value Flag Example Value Description
beamsize int 2500 --beamsize 100 The number of top hypothesis to preserve at each decoding step
beamsizetoken int 250000 --beamsizetoken 10 The number of top by acoustic model scores tokens set to be considered at each decoding step
beamthreshold double 25 --beamthreshold 15 Cut of hypothesis far away by the current score from the best hypothesis
lmweight double 0 --lmweight 1.1 Language model weight to accumulate with acoustic model score
wordscore double 0 --wordscore -0.2 Score to add when word finishes (lexicon-based beam search decoder only)
eosscore double 0 --eosscore 0.5 Score to add when end of sentence is generated (for Seq2Seq criterion)
silscore double 0 --silscore 0.5 Silence appearance score to add (for CTC/ASG models)
unkscore double -infinity --unkscore 0.5 Unknown word appearance score to add (CTC/ASG with lexicon-based beam-search decoder)
smearing string: none / max / logadd none --smearing none Smearing procedure in case of lexicon-based beam-search decoder only

Template to run beam-search decoder

We assume that saved datadir, tokensdir , tokens are stored with existing paths inside AM model (otherwise you should redefine their in the command line command). Also criterion, wordseparator and usewordpiece will be loaded from the model. To use saved previously Emission Set exchange am flag to the emission_dir

wav2letter/build/Decoder \
  --am path/to/train/am.bin \
  --test path/to/test/list/file \  
  --maxload 10 \
  --nthread_decoder 2 \
  --show \
  --showletters \
  --lexicon path/to/the/lexicon/file \
  --uselexicon [true, false] \
  --lm path/to/lm/file \
  --lmtype [kenlm, convlm] \
  --decodertype [wrd, tkn] \
  --beamsize 100 \
  --beamsizetoken 100 \
  --beamthreashould 20 \ 
  --lmweight 1 \
  --wordscore 0 \
  --eosscore 0 \
  --silscore 0 \
  --unkscore 0 \
  --smearing max  

Configuration file support

One can simply put all the flags into file, for example (name of the file decode.cfg)


and then run Decode binary with these flags (also one can add other flags in the command line)

wav2letter/build/Decoder \
  --flagsfile decode.cfg \
  --lmweight 1 \
  --wordscore 0 \
  --eosscore 0 \
  --silscore 0 \
  --unkscore 0 \
  --smearing max