A new automatic speech recognizer for Brazilian Portuguese based on deep neural networks and transfer learning
Pytorch code of "A new automatic speech recognizer for Brazilian Portuguese based on deep neural networks and transfer learning" submitted to AES-LAC 2018
TL;DR
The paper demonstrates how to perform transfer learning from a pre-trained model based on Deep Speech 2 for English to Brazilian Portuguese, outperforming previous work, achieving a charater error rate (CER) of ~16%.
This paper addresses the problem of training deep learning models for automatic speech recognition on languages with few resources available, such as Brazilian Portuguese, by employing transfer learning strategies. From a backbone model trained in English, the best fine-tuned network reduces the character error rate by 8.5%, outperforming previous works.
Several libraries are needed to be installed for training to work. I will assume that everything is being installed in an Anaconda installation on Ubuntu.
Install PyTorch if you haven't already.
Clone this repo and run this within the repo:
pip install -r requirements.txt
Install this fork for Warp-CTC bindings:
git clone https://github.com/SeanNaren/warp-ctc.git
cd warp-ctc
mkdir build; cd build
cmake ..
make
export CUDA_HOME="/usr/local/cuda"
cd ../pytorch_binding
python setup.py install
Install pytorch audio:
sudo apt-get install sox libsox-dev libsox-fmt-all
git clone https://github.com/pytorch/audio.git
cd audio
python setup.py install
Install ignite:
git clone https://github.com/pytorch/ignite.git && \
cd ignite && \
python setup.py install && \
cd .. && \
rm -rf ignite
Also, we provide a Dockerfile. See here for how to setup correctly the docker.
Five datasets were used in this paper. The Librispeech was used to train the backbone model, while the VoxForge PT-BR, Sid and Spoltech were used for fine-tune the pre-trained model to Brazilian Portuguese, and the LapsBM was used to performing the validation and test.
To download and setup the Librispeech dataset run below command in the root folder of the repo:
python -m data.librispeech
Note that this dataset does not come with a validation dataset or test dataset.
To download and setup the VoxForge dataset run below command in the root folder of the repo:
python -m data.voxforge
Note that this dataset does not come with a validation dataset or test dataset.
To download and setup the Sid dataset run the below command in the root folder of the repo:
python -m data.sid
Note that this dataset does not come with a validation dataset or test dataset.
The Spoltech dataset is not publicly available, so, you need to buy and download here. Then, you must extract into data/spoltech_dataset/downloads/extracted/files
. Finally run the below command in the root folder of the repo:
python -m data.spoltech
Note that this dataset does not come with a validation dataset or test dataset.
To download and setup the LapsBM dataset run the below command in the root folder of the repo:
python -m data.lapsbm
To create a custom dataset you must create a CSV file containing the locations of the training data. This has to be in the format of:
/path/to/audio.wav,/path/to/text.txt
/path/to/audio2.wav,/path/to/text2.txt
...
The first path is to the audio file, and the second path is to a text file containing the transcript on one line. This can then be used as stated below.
The PT-BR training manifest is an ensemble of three smaller datasets: VoxForge, Spoltech and Sid.
To create bigger manifest files (to train/test on multiple datasets at once) we can merge manifest files together like below from a directory containing all the manifests you want to merge. You can also prune short and long clips out of the new manifest.
cd data/
python merge_manifests.py --output-path pt_BR.train.csv sid.train.csv spoltech.train.csv voxforge.train.csv
The script train.py allows training the Deep Speech 2 model using a variety of hyperparameters and arbitrary datasets by using a configuration .json
file as an input. You can check several examples in the scripts folder.
Also, options like checkpoints and the localization of the results folder are inserted through the command-line. You may prompt
python train.py --help
or just check out train.py for more details.
Training supports saving checkpoints of the model to continue training. To enable epoch checkpoints use:
python train.py --checkpoint
To continue from a checkpointed model that has been saved:
python train.py --continue-from path/to/model.pth.tar
To also note, there is no final softmax layer on the model as when trained, warp-ctc does this softmax internally. This will have to also be implemented in complex decoders if anything is built on top of the model, so take this into consideration!
To train the backbone model, we shall prompt
python train.py scripts/librispeech-from_scratch.json --data-dir data/ --train-manifest data/librispeech.train.csv --val-manifest data/librispeech.val.csv --local --checkpoint
and a folder called [result/librispeech-from_scratch] will be created containing the models checkpoint and the best model file. In the paper, our best backbone model achieved an word error rate (WER) of 11.66% and 30.70% on test-clean
and test-other
datasets.
The commands listed below are the experiments conducted in the Sec. 5.2 of the paper
python train.py scripts/pt_BR-from_scratch.json --data-dir data/ --train-manifest data/pt_BR.train.csv --val-manifest data/lapsbm.val.csv --local --checkpoint
python train.py scripts/pt_BR-finetune-freeze.json --data-dir data/ --train-manifest data/pt_BR.train.csv --val-manifest data/lapsbm.val.csv --continue-from results/librispeech-from_scratch/models/model_best-ckpt_5.pth --local --checkpoint
with lr=3e-4
python train.py scripts/pt_BR-finetune.json --data-dir data/ --train-manifest data/pt_BR.train.csv --val-manifest data/lapsbm.val.csv --continue-from results/librispeech-from_scratch/models/model_best-ckpt_5.pth --local --checkpoint
with lr=3e-5
python train.py scripts/pt_BR-finetune-lower-lr.json --data-dir data/ --train-manifest data/pt_BR.train.csv --val-manifest data/lapsbm.val.csv --continue-from results/librispeech-from_scratch/models/model_best-ckpt_5.pth --local --checkpoint
The results of these models in the test set are listed below
The commands listed below are the experiments conducted in the Sec. 5.3 of the paper
python train.py scripts/pt_BR-from_scratch-accents.json --data-dir data/ --train-manifest data/pt_BR.train.csv --val-manifest data/lapsbm.val.csv --local --checkpoint
python train.py scripts/pt_BR-finetune-accents-random-fc.json --data-dir data/ --train-manifest data/pt_BR.train.csv --val-manifest data/lapsbm.val.csv --continue-from results/librispeech-from_scratch/models/model_best-ckpt_5.pth --local --checkpoint
python train.py scripts/pt_BR-finetune-accents-map-fc.json --data-dir data/ --train-manifest data/pt_BR.train.csv --val-manifest data/lapsbm.val.csv --continue-from results/librispeech-from_scratch/models/model_best-ckpt_5.pth --local --checkpoint
scratch | random FC weights | non-random FC weights | |
---|---|---|---|
CER | 22.78% | 17.73% | 17.72% |
To evaluate a trained model on a test set (has to be in the same format as the training set):
python test.py --model-path models/deepspeech.pth --manifest /path/to/test_manifest.csv --cuda
Pre-trained models can be found under releases here.
If you use this code in your research, please use the following BibTeX entry
@inproceedings{quintanilha2018,
author = {Quintanilha, I. M. and Biscainho, L. W. P. and Netto, S. L.},
title = "A new automatic speech recognizer for Brazilian Portuguese based on deep neural networks and transfer learning",
booktitle = "Congreso Latinoamericano de Ingenier\'{i}a de Audio",
address = {Montevideo, Uruguay},
month = {September},
year = {2018},
note = {(Submitted)}
}
This research was partially supported by CNPq and CAPES.
Thanks to SeanNaren on whose implementation ours was inspired.
[12] I. M. Quintanilha, End-to-end speech recognition applied to Brazilian Portuguese using deep learning, M. Sc. dissertation, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil (2017).