Skip to content

Latest commit

 

History

History
72 lines (50 loc) · 7.94 KB

README.md

File metadata and controls

72 lines (50 loc) · 7.94 KB

Abstract

We revisit self-training in the context of end-to-end speech recognition. We demonstrate that training with pseudo-labels can substantially improve the accuracy of a baseline model. Key to our approach are a strong baseline acoustic and language model used to generate the pseudo-labels, filtering mechanisms tailored to common errors from sequence-to-sequence models, and a novel ensemble approach to increase pseudo-label diversity. Experiments on the LibriSpeech corpus show that with an ensemble of four models and label filtering, self-training yields a 33.9% relative improvement in WER compared with a baseline trained on 100 hours of labelled data in the noisy speech setting. In the clean speech setting, self-training recovers 59.3% of the gap between the baseline and an oracle model, which is at least 93.8% relatively higher than what previous approaches can achieve.

Reproduction

Acoustic model configuration files are provided for each dataset to reproduce results from the paper (training and decoding steps).

Pretrained convolutional language models used in the paper are also included, as well as steps to generate the language model corpus used to train the language models and steps to reproduce acoustic model training.

Training and decoding broadly follow the existing TDS seq2seq recipes.

Dependencies

All results from the paper can be reproduced exactly with the following project commits:

Each commit contains versioned documentation for building and installing requisite dependencies.

Tokens and Lexicon Sets

Dataset Unlabeled Set Lexicon Tokens
LibriSpeech train-clean-100 Baseline Lexicon Tokens
LibriSpeech train-clean-100 + train-clean-360 Oracle Lexicon Tokens
LibriSpeech train-clean-100 + train-other-500 Oracle Lexicon Tokens

Tokens and lexicon files generated in the $MODEL_DST/am/ and $MODEL_DST/decoder/ directories following the LibriSpeech recipe are the same as in the table.

Pre-Trained Models

Acoustic Models

Components of the baseline model trained only on LibriSpeech training sets are below.

Dataset Unlabeled Set Acoustic Model: dev-clean Acoustic Model: dev-other
LibriSpeech train-clean-100 Baseline dev-clean dev-other
LibriSpeech train-clean-100 + train-clean-360 Oracle dev-clean dev-other
LibriSpeech train-clean-100 + train-other-500 Oracle dev-clean dev-other

Below are models trained on pseudo-labels. All train sets include the base train-clean-100 set in addition to generated pseudo-labels. Steps for generating pseudo-labels can be found here:

Dataset Pseudo-Labeled Set AM: dev-clean AM: dev-other Synthetic Lexicon
LibriSpeech train-clean-100 + train-clean-360 PLs (single) dev-clean dev-other Synthetic Lexicon
LibriSpeech train-clean-100 + train-other-500 PLs (single) dev-clean dev-other Synthetic Lexicon
LibriSpeech train-clean-100 + train-clean-360 (ensemble: 2+3+5+7+8) dev-clean dev-other

Language Models

The instructions in LibriSpeech contain steps to reproduce the language model training corpus. Below are components of the GCNN language model used for decoding:

LM type Language model Vocabulary Architecture LM fairseq Dict fairseq
GCNN word-piece GCNN 4k WP Archfile fairseq LM fairseq Dict

Citation

@article{kahn2019selftraining,
    title={Self-Training for End-to-End Speech Recognition},
    author={Jacob Kahn and Ann Lee and Awni Hannun},
    year={2019},
    eprint={1909.09116},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}