-
Notifications
You must be signed in to change notification settings - Fork 1k
Beam Search Decoder
- Notation
- Introduction
- Greedy Path
- Beam-search Decoders
- Full list of flags related to the beam-search decoding
- Template to run beam-search decoder
- AM - acoustic model
- LM - language model
- WER - Word Error Rate
- LER - Letter Error Rate
After an AM is trained, one can get the transcription of a audio by running either the greedy path (the greedy best path using only acoustic model predictions, in the code Viterbi name is used) or the beam-search decoding with LM incorporated.
To get the greedy path, one should use the Test binary in the following way
wav2letter/build/Test \
--am path/to/train/am.bin \
--maxload 10 \
--test path/to/test/list/file
For this particular example, greedy paths will be computed on 10 random samples (--maxload=10
) from the test list file and WER and LER will be printed on the screen. To run on the all samples, set --maxload=-1
.
While running the Test binary, the AM is loaded and all the saved flags will be used if you don’t specify them in the command line. For example, tokens and lexicon paths. So, in case you want to overwrite them, you should directly specify them:
wav2letter/build/Test \
--am path/to/train/am.bin \
--maxload 10 \
--test path/to/test/list/file \
--tokensdir path/to/tokens/dir \
--tokens tokens.txt \
--lexicon path/to/the/lexicon/file
The Test binary can be used also to generate an Emission Set including the emission matrix as well as other target-related information for each sample. All flags are also stored in the Emission Set. Specifically, the emission matrix of the CTC/ASG model is the posterior, while for seq2seq models, it is an encoded audio with a series of embeddings. The Emission Set can be fed into the Decode binary directly to generate transcripts without running AM forwarding again. To set the directory where to store Emission Set use the flag --emission_dir path/to/emission/dir
(default value is ''
) and the --test
will be used as a file name.
Summarization on flags to run Test binary:
Flags | Flag Type | Default Value | Flag Example Value | Reused from the AM training/ Emission Set | Description |
---|---|---|---|---|---|
am |
string | '' |
--am path/to/am/file |
N | Full path to the acoustic model binary file |
emission_dir |
string | '' |
--emission_dir path/to/emission/dir |
N | Path to the directory where emission set will be stored to prevent running the AM forward pass during beam-search decoding. |
datadir |
string | '' |
--datadir path/to/the/list/file/dir |
Y | This prefix is used to define full path to the test list. Set it to '' in case you specify full path in the --test . |
test |
string | '' |
--test path/to/the/test/list/file |
Y | Path to the test list file (where id path duration transcription are stored, transcription can be empty). --datadir parameter is used as prefix for this path (concatenation of paths is done) |
maxload |
int | -1 | --maxload 300 |
N | Number of random sample to process (value -1, means all samples) |
show |
bool | false |
--show |
N | To print word transcriptions (target and predicted) for each sample into stdout |
showletters |
bool | false |
--showletters |
N | To print token transcriptions (target and predicted) for each sample into stdout |
sclite |
string | '' |
--sclite path/to/file |
N | Specifies the path to save the logs, including the stdout log and the hypotheses and references in sclite format (trn) |
We support lexicon-base beam-search decoder and lexicon-free beam-search decoder for acoustic models trained with CTC, ASG and Seq2Seq criterion.
For a lexicon-based decoder we restrict our search by the lexicon provided by a user. In other words, generated transcriptions only contain words from the lexicon.
The lexicon is a mapping from words to their tokens sequence, spellings. The tokens set should be identical to the one used in AM training, details see in the Data Preparation. For example, if we train AM with letters as a token set {a-z}
then word “world” should have spelling “w o r l d”.
To optimize the decoding performance, spellings of the words are stored in a Trie
. Each node in the Trie
corresponds to a token. Some nodes, usually the leaf nodes, represent valid words containing the spelling of tokens on the path from Trie
root to it. In case we have “hello”, “world“, ”pineapple“, ”pine“ in the lexicon and letters are our tokens set we will have as a trie:
root → h → e → l → l → o ([hello])
root → w → o → r → l → d ([world])
root → p → i → n → e ([pine]) → a → p → p → l → e → ([pineapple])
The lexicon-free beam-search decoder considers any possible token as candidates and there is no notion of words during decoding. In this case, a word separator should be set by --wordseparator
and included into tokens set for AM training. The word separator is treated and predicted as all the other normal tokens. After obtaining the transcription in tokens, word separator is used to split the sequence into words. Usually, when we use word-pieces as target units, the word separator can be part of the token. To correctly handle this case, one should set --usewordpiece=true
.
At each decoding step, we preserve only top beamsize
hypotheses in the beam according to their accumulated scores. Apart from the beam size, beamthreshold
is used to limit the score range of the hypothesis in the beam, i.e. hypothesis, whose score gaps from the best are larger than this threshold, are also removed from the beam. In addition, we can also restrict the number of tokens to propose for each hypotheses. The tokens beam size --beamsizetoken
limits the search space to only the top tokens according to AM scores. This is extremely useful for lexicon free decoding, since there is no lexicon constraints.
In the beam-search decoder, a language model trained with external data can be included and its scores (log-probability) will be accumulated together with AM scores.
LM can be operated on either words or tokens (which is the same as the ones used to train AM). In other words, the LM can be queried each time when a new word or token is proposed. One can set this via --decodertype
. Note that word-based LM can be used only with lexicon-based beam-search decoder, i.e. if --decodertype=wrd
then uselexicon
flag is ignored.
If LM is word-based, the LM score is applied only when a completed word is proposed. In order to maintain the score scale of all the hypothesis in the beam and properly rank the partial words, we approximate the LM score of partial words by their highest
possible unigram score. This can be easily computed by recursively smear upward the trie with the real unigram scores on the nodes with valid words. Three types of smearing are supported: logadd
(a.k.a logadd(a, b)=log(exp(a) + exp(b))
, max
(pick the maximum score among children nodes scores and current node score) or none
(no smearing). It can be set by --smearing.
Currently we are supporting decoding with the following language models: ZeroLM, KenLM and ConvLM. To specify LM type use --lmtype [kenlm, convlm]
. To use ZeroLM set the --lm=''
.
ZeroLM is a fake LM which always returns 0 as score. It is served as a proxy to conduct beam-search on only AM scores without breaking API.
KenLM language model can be trained standalone with KenLM library. The text data should be prepared accordingly to the acoustic model data. For example, in case of word-level LM if your AM token set doesn’t contain punctuation, then remove all punctuation from the data. In case of token-level LM training one should split words into tokens set sequence and only then train LM on such data in a way that LM predicts probability for a token (not for a word). Both of the .arpa
and the binarized .bin
LM can be used in the wav2letter.
ConvLM models are convolutional neural networks. They are currently trained in the fairseq and then converted into flashlight-serializable models (example how we are doing this) to be able to load in wav2letter. --lm_vocab
should be specified as it is a dictionary to map tokens into indices in the ConvLM training. Note that this token set is usually different from the one used in wav2letter AM training. Each line of this file is a single token (char, word, word-piece, etc.) and the token index is exactly its line number.
To efficiently decode with ConvLM, which is pretty expensive on running the forward pass, we design a dynamic cache to hold the probabilities over all the tokens given the candidates generated from the previous frame. This way, when we want to propose new candidates, we can easily check the cache for its pre-computed LM score. In other words, we only need to run the ConvLM forward pass in batches at the end of decoding each frame, when all the possible new candidates are gathered. Thus, the batching and caching can greatly reduce the number of the forward passes we need to run in total.
Usually, the cache has size beam size
x number of classes in ConvLM
in main memory. If we cannot feed beam size
samples to ConvLM in a single batch, --lm_memory
is used to limit the size of input batch. --lm_memory
is a integer
which requires input batch size
x LM context size
< --lm_memory
. For example, if the context size or receptive field of a ConvLM is 50, then no matter what the beam size or the number of new candidates is, we can only feed 100 samples in a single batch if --lm_memory
is set to 5000
.
Flags | ZeroLM | KenLM | ConvLM |
---|---|---|---|
lm |
'' |
path/to/lm/model |
path/to/lm/model |
lmtype |
X | kenlm |
convlm |
lm_vocab |
X | X | V |
lm_memory |
X | X | V |
decodertype |
X | V | V |
We support decoding a dataset using several threads by setting --nthread_decoder
. The samples in the dataset are dispatched equally to each thread. In case of decoding CTC/ASG models with KenLM language model, --nthread_decoder
is simply the number of CPU threads to run beam-search decoding. If one wants to decode Seq2Seq models or with ConvLM, we need to use Flashlight to run forward pass in each thread. Since forwarding is not thread-safe, each thread needs to acquire resources for its own and a copy of acoustic model (the seq2seq criterion) and LM will be stored on the device it requested. Specifically, if flashlight
is built with CUDA backend, 1 GPU is required per thread and --nthread_decoder
should be no larger than the number of visible GPUs.
We are supporting not consumer-producer scheme for parallel computations. --nthread_decoder_am_forward
defines the number of threads for AM forward pass: all threads place forward results into the queue to process by beam-search decoder with maximum size of the queue --emission_queue_size
. In case of running forward pass on GPUs --nthread_decoder_am_forward
defines number of GPUs to use for parallel forward pass. --nthread_decoder
threads are reading from the queue and perform beam-search decoding.
Decoders, except Seq2Seq decoder, are now supporting online decoding. It consumes small chunks of emissions of audio as input. At the time we want to have a look at the transcript so far, we may get the best transcript and prune the hypothesis space and keep decoding further.
Flags | CTC / ASG criterion (Lexicon-based) | Seq2Seq criterion (Lexicon-based) | CTC / ASG criterion (Lexicon-free) | Seq2Seq criterion (Lexicon-free) |
---|---|---|---|---|
criterion |
ctc / asg
|
seq2seq |
ctc / asg
|
seq2seq |
lmweight |
V | V | V | V |
beamsize |
V | V | V | V |
beamsizetoken |
V | V | V | V |
beamthreshold |
V | V | V | V |
uselexicon |
true |
true |
false |
false |
lexicon |
path/to/the/lexicon/file |
path/to/the/lexicon/file |
'' |
'' |
smearing |
none / max / logadd
|
none / max / logadd
|
X | X |
wordscore |
V | V | X | X |
wordseparator |
X | X | V | V |
unkscore |
V | X | V | X |
silscore |
V | X | V | X |
eosscore |
X | V | X | V |
attentionthreshold |
X | V | X | V |
smoothingtemperature |
X | V | X | V |
Flags | Flag Type | Default Value | Flag Example Value | Reused from the AM training/ Emission Set | Description |
---|---|---|---|---|---|
am |
string | '' |
--am path/to/am/file |
N | Full path to the acoustic model binary file. Ignored if emission_dir is specified. |
emission_dir |
string | '' |
--emission_dir path/to/emission/dir |
N | Path to the directory with stored emission data from the Test binary to prevent running the AM forward pass during beam-search decoding |
datadir |
string | '' |
--datadir path/to/the/list/file/dir |
Y | This prefix is used to define full path to the test list. Set it to '' in case you specify full path in the --test . |
test |
string | '' |
--test path/to/the/test/list/file |
Y | Path to the test list file (where id path duration transcription are stored, transcription can be empty). --datadir parameter is used as prefix for this path (concatenation of paths is done) |
maxload |
int | -1 | --maxload 300 |
N | Number of random sample to process (value -1, means all samples) |
show |
bool | false |
--show |
N | To print word transcriptions (target and predicted) for each sample into stdout |
showletters |
bool | false |
--showletters |
N | To print token transcriptions (target and predicted) for each sample into stdout |
nthread_decoder |
int | 1 | --nthread_decoder 4 |
N | Number of threads to run beam-search decoding (details in Distributed running section) |
nthread_decoder_am_forward |
int | 1 | --nthread_decoder_am_forward 2 |
N | Number of threads to run AM forward pass (details in Distributed running section) |
emission_queue_size |
int | 3000 | --emission_queue_size 1000 |
N | Maximum size of the emission queue (details in Distributed running section) |
sclite |
string | '' |
--sclite path/to/file |
N | Specifies the path to save the logs, including the stdout log and the hypotheses and references in sclite format (trn) |
Flags | Flag Type | Default Value | Flag Example Value | Reused from the AM training/ Emission Set | Description |
---|---|---|---|---|---|
uselexicon |
bool | true |
--uselexicon |
N | True to set lexicon-based beam-search decoding, false - to set lexicon-free |
lexicon |
string | '' |
--lexicon path/to/the/lexicon/file |
Y | Path to the lexicon file where mapping of words into tokens is given (is used in case of lexicon-based beam-search decoding) |
lm |
string | '' |
--lm path/to/the/lm/file |
N | Full path to the language model binary file (use '' to use zero LM) |
lm_vocab |
string | '' |
--lm_vocab path/to/lm/vocab/file |
N | Path to vocabulary file defines the mapping between indices and neural-based LM tokens |
lm_memory |
double | 5000 | --lm_memory 3000 |
N | Total memory to define the batch size used to run forward pass for neural-based LM model |
lmtype |
string: kenlm / convlm
|
kenlm |
--lmtype kenlm |
N | Language model type |
decodertype |
string: wrd / tkn
|
wrd |
--decodertype tkn |
N | Language model token type: wrd for word-level LM, tkn - for token-level LM (tokens should be the same as an acoustic model tokens set). If wrd value is set then uselexicon flag is ignored and lexicon-based beam search decoding is used. |
wordseparator |
string | | |
--wordseparator _ |
Y | Token to be used as a separator of words (is used to get word transcription from the token transcription for the lexicon-free beam-search decoder) |
usewordpiece |
bool | false |
--usewordpiece false |
Y | Defines if acoustic model is training with tokens where word separator is not a separate token, default false (for example with word-pieces hello world -> *he llo _world* * where * corresponds to word separation). |
smoothingtemperature |
double | 1 | --smoothingtemperature 1.2 |
Y | Smoothen the posterior distribution of acoustic model (for Seq2Seq criterion only) |
attentionthreshold |
int | -infinity |
--attentionthreshold 30 |
Y | Limit of the distance between the peak attenion locations on the encoded audio for 2 consequtive tokens (for Seq2Seq criterion only) |
Flags | Flag Type | Default Value | Flag Example Value | Description |
---|---|---|---|---|
beamsize |
int | 2500 | --beamsize 100 |
The number of top hypothesis to preserve at each decoding step |
beamsizetoken |
int | 250000 | --beamsizetoken 10 |
The number of top by acoustic model scores tokens set to be considered at each decoding step |
beamthreshold |
double | 25 | --beamthreshold 15 |
Cut of hypothesis far away by the current score from the best hypothesis |
lmweight |
double | 0 | --lmweight 1.1 |
Language model weight to accumulate with acoustic model score |
wordscore |
double | 0 | --wordscore -0.2 |
Score to add when word finishes (lexicon-based beam search decoder only) |
eosscore |
double | 0 | --eosscore 0.5 |
Score to add when end of sentence is generated (for Seq2Seq criterion) |
silscore |
double | 0 | --silscore 0.5 |
Silence appearance score to add (for CTC/ASG models) |
unkscore |
double | -infinity |
--unkscore 0.5 |
Unknown word appearance score to add (CTC/ASG with lexicon-based beam-search decoder) |
smearing |
string: none / max / logadd
|
none |
--smearing none |
Smearing procedure in case of lexicon-based beam-search decoder only |
We assume that saved datadir
, tokensdir
, tokens
are stored with existing paths inside AM model (otherwise you should redefine their in the command line command). Also criterion
, wordseparator
and usewordpiece
will be loaded from the model. To use saved previously Emission Set exchange am
flag to the emission_dir
wav2letter/build/Decoder \
--am path/to/train/am.bin \
--test path/to/test/list/file \
--maxload 10 \
--nthread_decoder 2 \
--show \
--showletters \
--lexicon path/to/the/lexicon/file \
--uselexicon [true, false] \
--lm path/to/lm/file \
--lmtype [kenlm, convlm] \
--decodertype [wrd, tkn] \
--beamsize 100 \
--beamsizetoken 100 \
--beamthreashould 20 \
--lmweight 1 \
--wordscore 0 \
--eosscore 0 \
--silscore 0 \
--unkscore 0 \
--smearing max
One can simply put all the flags into file, for example (name of the file decode.cfg
)
--am=path/to/train/am.bin
--test=/absolute/path/to/test/list/file
--maxload=10
--nthread_decoder=2
--show
--sholetters
--lexicon=path/to/the/lexicon/file
--uselexicon=true
--lm=path/to/lm/file
--lmtype=kenlm
--decodertype=wrd
--beamsize=100
--beamsizetoken=100
--beamthreshold=20
and then run Decode binary with these flags (also one can add other flags in the command line)
wav2letter/build/Decoder \
--flagsfile decode.cfg \
--lmweight 1 \
--wordscore 0 \
--eosscore 0 \
--silscore 0 \
--unkscore 0 \
--smearing max