The repository contains implementation of Minimum Word Error Rate (MWER) Training based on Monotonic RNN-T Loss for audio transduction. The solution is based on the TensorFlowASR Library (
NOTE: We assume the user has successfully installed and configured Python (and Docker in case of accessing the code via Docker) prior to the solution installation.
- Install requirements
pip install -r requirements-pre.txt && pip install -r requirements.txt
- Download LibriSpeech test dataset.
tar -xzvf test-clean.tar.gz
- Add working directory to PYTHONPATH.
- Create transcriptions (.tsv files) for downloaded datasets.
python ./scripts/ -d <extracted_dataset_dir> <target_dir>
python ./scripts/ -d ./LibriSpeech/test-clean ./LibriSpeech/test_transcriptions/test.tsv
- Provide paths to the generated transcriptions in the config file ('data_path' in train, eval and test subconfigs)
For convenience we provide a way to access the code using Docker. The guide assumes user has installed docker, docker-compose and nvidia-docker (
NOTE: By default docker installation downloads only the test dataset.
- Build and run docker container.
docker compose run tensorflow_asr
- Download LibriSpeech train and dev datasets. NOTE: Datasets are quite large (around 24GB) so download process might take a while. Also training will fail on majority of consumer tier GPU devices due to lack of memory.
tar -xzvf dev-clean.tar.gz
tar -xzvf train-clean-360.tar.gz
- Create transcriptions (.tsv files) for downloaded datasets.
python ./scripts/ -d ./LibriSpeech/train-clean-360 /data/LibriSpeech/test_transcriptions/train.tsv
python ./scripts/ -d ./LibriSpeech/dev-clean /data/LibriSpeech/test_transcriptions/dev.tsv
- In order to start training, under the path examples/conformer/, there's a script starting the training process. If user wishes to train with MWER training procedure, in config.yml under model_config there's a boolean mwer_training which, if set to True, starts the MWER training procedure. Otherwise it starts standard training procedure with regular RNN-T loss. receives specific arguments for training. Most important are:
- --config a path to model config.yml file.
- --sentence_piece a flag whether to use sentence_piece as text tokenizer.
- --bs batch size.
- --devices which GPU devices are supposed to be used.
The rest of arguments are described in file. Example of such command (that works under default setup) would be:
python examples/conformer/ --config examples/conformer/config.yml --sentence_piece --devices 0
In order to start testing, under the path examples/conformer/, there's a script starting the inference process. receives specific arguments for training. Most important are:
- --saved a path to saved model.
- --config a path to model config.yml file.
- --sentence_piece a flag whether to use sentence_piece as text tokenizer.
- --bs batch size.
- --output path to output transcriptions.
Example of such command (that works under default setup) would be:
python ./examples/conformer/ --config ./examples/conformer/config.yml \
--saved predefined_checkpoints/weights.hdf5 \
--sentence_piece \
--output test_result.tsv \
--bs 1
To run a demonstration on the actual flac file, from the root directory you need to run the command:
python examples/demonstration/ --config ./examples/conformer/config.yml \
--saved predefined_checkpoints/weights.hdf5 \
--sentence_piece \
--subwords ./vocabularies/librispeech/spm_512 \
--beam_width 1 \
This script demonstrates the usage of the model on real world data.