We revisit self-training in the context of end-to-end speech recognition. We demonstrate that training with pseudo-labels can substantially improve the accuracy of a baseline model. Key to our approach are a strong baseline acoustic and language model used to generate the pseudo-labels, filtering mechanisms tailored to common errors from sequence-to-sequence models, and a novel ensemble approach to increase pseudo-label diversity. Experiments on the LibriSpeech corpus show that with an ensemble of four models and label filtering, self-training yields a 33.9% relative improvement in WER compared with a baseline trained on 100 hours of labelled data in the noisy speech setting. In the clean speech setting, self-training recovers 59.3% of the gap between the baseline and an oracle model, which is at least 93.8% relatively higher than what previous approaches can achieve.
Acoustic model configuration files are provided for each dataset to reproduce results from the paper (training and decoding steps).
Pretrained convolutional language models used in the paper are also included, as well as steps to generate the language model corpus used to train the language models and steps to reproduce acoustic model training.
Training and decoding broadly follow the existing TDS seq2seq recipes.
All results from the paper can be reproduced exactly with the following project commits:
- flashlight - commit
77ad2f79249c6833875f57865712de4666617d00
- wav2letter - commit
57b4904c8c4a808d393f047a9352c2d5be57ae8f
Each commit contains versioned documentation for building and installing requisite dependencies.
Dataset | Unlabeled Set | Lexicon | Tokens |
---|---|---|---|
LibriSpeech | train-clean-100 Baseline | Lexicon | Tokens |
LibriSpeech | train-clean-100 + train-clean-360 Oracle | Lexicon | Tokens |
LibriSpeech | train-clean-100 + train-other-500 Oracle | Lexicon | Tokens |
Tokens and lexicon files generated in the $MODEL_DST/am/
and $MODEL_DST/decoder/
directories following the LibriSpeech recipe are the same as in the table.
Components of the baseline model trained only on LibriSpeech training sets are below.
Dataset | Unlabeled Set | Acoustic Model: dev-clean | Acoustic Model: dev-other |
---|---|---|---|
LibriSpeech | train-clean-100 Baseline | dev-clean | dev-other |
LibriSpeech | train-clean-100 + train-clean-360 Oracle | dev-clean | dev-other |
LibriSpeech | train-clean-100 + train-other-500 Oracle | dev-clean | dev-other |
Below are models trained on pseudo-labels. All train sets include the base train-clean-100 set in addition to generated pseudo-labels. Steps for generating pseudo-labels can be found here:
Dataset | Pseudo-Labeled Set | AM: dev-clean | AM: dev-other | Synthetic Lexicon |
---|---|---|---|---|
LibriSpeech | train-clean-100 + train-clean-360 PLs (single) | dev-clean | dev-other | Synthetic Lexicon |
LibriSpeech | train-clean-100 + train-other-500 PLs (single) | dev-clean | dev-other | Synthetic Lexicon |
LibriSpeech | train-clean-100 + train-clean-360 (ensemble: 2+3+5+7+8) | dev-clean | dev-other |
The instructions in LibriSpeech contain steps to reproduce the language model training corpus. Below are components of the GCNN language model used for decoding:
LM type | Language model | Vocabulary | Architecture | LM fairseq | Dict fairseq |
---|---|---|---|---|---|
GCNN | word-piece GCNN | 4k WP | Archfile | fairseq LM | fairseq Dict |
@article{kahn2019selftraining,
title={Self-Training for End-to-End Speech Recognition},
author={Jacob Kahn and Ann Lee and Awni Hannun},
year={2019},
eprint={1909.09116},
archivePrefix={arXiv},
primaryClass={cs.CL}
}