This README is a quick-start guide to training or finetuning an STT model using the Coqui toolkit on the Kinyarwanda speech data.
To avoid problems setting up Coqui STT on your environment and compatibility issues, we recommend pulling or building the Coqui STT dockerfile from the stt-train:latest
image:
$ git clone --recurse-submodules https://github.com/coqui-ai/STT
$ cd STT
$ docker build -f Dockerfile.train . -t stt-train:latest
$ docker run -it stt-train:latest
After downloading and extracting the dataset, we found the following contents:
.tsv
files, containing metadata such as text transcripts.mp3
audio files, located in the clips directory
Coqui STT cannot directly work with Common Voice data, so we need the Coqui importer script bin/import_cv2.py to format the data correctly:
$ bin/import_cv2.py --validate_label_locale /path/to/validate_locale_rw.py /path/to/extracted/common-voice/archive
The importer script above would create .csv
files from the .tsv
files, and .wav
files from the .mp3
files.
The --validate_label_locale
flag is optional but needed for data cleaning. The details on the input to the flag can be found in the data cleaning section below.
- As a way to clean the data, we need to validate the text. It checks a sentence to see if it can be converted and if possible normalizes the encoding, removes special characters, etc. For this we use the commonvoice-utils tool to clean the text for Kinyarwanda (rw).
The file (script) below is passed as an argument to the --validate_label_locale
flag in the importer command above
#validate_locale_rw.py
from cvutils import Validator
def validate_label(label):
v = Validator("rw") #rw - locale for Kinyarwanda. You should change accordingly.
return v.validate(label)
- We also need to ensure that each audio/input is longer than the transcript/output. This step is important so as not to run into training errors. As a result, we remove data from (train, dev, and test CSVs) that don’t meet this criterion.
$ python3 /path/to/remove_outliers.py /path/to/train.csv --clips_dir /path/to/clips
Since we would be training (and validating) our model within the docker environment we created initially, we need to first create and run a container:
$ docker run -it --name sttcontainer -v ~/data:/code/data/host_data --gpus all stt-train:latest
The above command does the following:
- creates a container named
sttcontainer
- bind mounts the
/data
directory on the host environment to the/code/data/host_data
on the docker environment. - gives the docker environment access to
all
the host GPU
The following assumes we are within the docker environment. If not, run docker exec -it sttcontainer bash
to be in the environment:
# directory to save the training (loss) results
$ mkdir data/host_data/tensorboard
$ python -m coqui_stt_training.train \
--load_checkpoint_dir data/host_data/jan-8-2021-best-kinya-deepspeech \
--save_checkpoint_dir data/host_data/best-kinya-checkpoint \
--alphabet_config_path data/host_data/kinyarwanda_alphabet.txt \
--n_hidden 2048 \
--train_cudnn true \
--train_files data/host_data/misc/lg-rw-oct2021/rw/clips/train.csv \
--dev_files data/host_data/misc/lg-rw-oct2021/rw/clips/dev.csv \
--epochs 20 \
--train_batch_size 128 \
--dev_batch_size 128 \
--summary_dir data/host_data/tensorboard
The below flags were explored to (experimentally) get a better model. You may wish to consider them (with intuition)
--learning_rate 0.00001 \
--reduce_lr_on_plateau true
--plateau_epochs 5 \
--dropout_rate 0.5
By default, if a test file (with a test batch size) is specified in the training script (above), the trained model (after training) is tested on the test data at the end of the specified epoch. However, if you choose to omit the test file and test differently, you can use a previously saved model on some test data:
$ python -m coqui_stt_training.evaluate \
--show_progressbar true \
--train_cudnn true \
--test_batch_size 128 \
--test_files data/host_data/misc/lg-rw-oct2021/rw/clips/test.csv \
--checkpoint_dir data/host_data/best-kinya-checkpoint
The above script will test only the acoustic model on the test data. This produces the WER for the acoustic model alone.
If you have trained and generated a Language Model (scorer) previously, you can use it to produce a combined (overall) WER:
$ python -m coqui_stt_training.evaluate \
--show_progressbar true \
--train_cudnn true \
--test_batch_size 128 \
--test_output_file data/host_data/test_output \
--test_files data/host_data/misc/lg-rw-oct2021/rw/clips/test.csv \
--checkpoint_dir data/host_data/best-kinya-checkpoint \
--scorer data/host_data/kinyarwanda.scorer
If you have optimized your generated language model and have generated an optimized --default_alpha
and --default_beta
previously, you can use them to produce a better combined (overall) WER:
$ python -m coqui_stt_training.evaluate \
--show_progressbar true \
--train_cudnn true \
--test_batch_size 128 \
--test_output_file data/host_data/test_output \
--test_files data/host_data/misc/lg-rw-oct2021/rw/clips/test.csv \
--checkpoint_dir data/host_data/best-kinya-checkpoint \
--scorer data/host_data/kinyarwanda_optm.scorer \
--lm_alpha 0.7169565760990836 \
--lm_beta 1.750652309533554
You will usually want to deploy a language model in production. A good language model will improve transcription accuracy by correcting predictable spelling and grammatical mistakes. If you can predict what kind of speech your STT will encounter, you can make great gains in terms of accuracy with a custom language model.
This section assumes that you are using a Docker image and container for training, as outlined in the environment section. If you are not using the Docker image, then some of the scripts such as generate_lm.py
will not be available in your environment.
This section assumes that you have already trained an (acoustic) model and have a set of checkpoints for that model.
The following assumes we are within the docker environment. If not, run docker exec -it sttcontainer bash
to be in the environment:
$ python3 data/lm/generate_lm.py \
--input_txt data/host_data/common_voice_kinyarwanda_kinnews_corpus.txt \
--output_dir data/host_data/kinya_lm \
--top_k 500000 \
--kenlm_bins kenlm/build/bin \
--arpa_order 5 \
--max_arpa_memory "85%" \
--arpa_prune "0|0|1" \
--binary_a_bits 255 \
--binary_q_bits 8 \
--binary_type trie
The above script will save the new language model as two files in the specified output directory: lm.binary
and vocab-500000.txt
. The value 500000 comes from the specified value in the --top_k
flag.
To generate our language model for use, we have to satisfy some environmental requirements. To achieve this, we do:
$ docker exec -it sttcontainer bash
$ export PATH=${STT_DIR_PATH}:$PATH
$ export $LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${STT_DIR_PATH}:${KENLM_BIN_PATH}:${STT_DIR_PATH}/data/lm
# E.g
# Since we are in using the docker environment,
# STT_DIR = /code
# KENLM_BIN = /code/kenlm/build/bin
$ export PATH=/code:$PATH
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/code:/code/kenlm/build/bin:/code/data/lm
Note: The above steps need to be done every time we exit the docker environment.
After this is done, we proceed to generate the scorer:
$ data/lm/generate_scorer_package \
--checkpoint data/host_data/best-kinya-checkpoint \
--lm data/host_data/kinya_lm/lm.binary \
--vocab data/host_data/kinya_lm/vocab-500000.txt \
--package data/host_data/kinyarwanda.scorer \
--default_alpha 0.931289039105002 \
--default_beta 1.1834137581510284
The above script will create a scorer called “kinyarwanda.scorer” in the /data/host_data/
directory
The --checkpoint
flag should point to the acoustic model checkpoint with which you will use the generated scorer.
The --default_alpha
and --default_beta
parameters shown above are optimized parameters and were found with the lm_optimizer.py
Python script (on some data) and were used as a starting point. However, if you want to generate an optimized alpha and beta value specific to your data, do:
The following assumes we are within the docker environment. If not, run docker exec -it sttcontainer bash
to be in the environment:
$ python3 lm_optimizer.py \
--show_progressbar true \
--train_cudnn true \
--test_batch_size 128 \
--alphabet_config_path data/host_data/kinyarwanda_alphabet.txt \
--scorer_path data/host_data/kinyarwanda.scorer \
--test_files data/host_data/misc/lg-rw-oct2021/rw/clips/test.csv \
--checkpoint_dir data/host_data/best-kinya-checkpoint \
--n_hidden 2048 \
--n_trials 300
--n_hidden
should be the same as specified when training your (acoustic) model.
--n_trials
specifies how many trials lm_optimizer.py
should run to find the optimal values of --default_alpha
and --default_beta
. You may wish to reduce --n_trials
.
If you have generated an optimized alpha
and beta
value specific to your data, you can pass them as values to --default_alpha
and --default_beta
.
For example, on the Kinyarwanda data, the following values were found to be the best alpha and beta. Hence were used to generate an optimized scorer.
$ data/lm/generate_scorer_package \
--checkpoint data/host_data/best-kinya-checkpoint \
--lm data/host_data/kinya_lm/lm.binary \
--vocab data/host_data/kinya_lm/vocab-500000.txt \
--package data/host_data/kinyarwanda_optm.scorer \
--default_alpha 0.7169565760990836 \
--default_beta 1.750652309533554
After you train an STT model, your model will be stored on disk as a checkpoint file. Model checkpoints are useful for resuming training at a later date, but they are not the correct format for deploying a model into production. The model format for deployment is a TFLite file.
To export your model as a TFLite file:
$ python3 -m coqui_stt_training.export \
--show_progressbar true \
--checkpoint_dir data/host_data/best-kinya-checkpoint \
--export_dir data/host_data \
--export_author_id DigitalUmuganda \
--export_file_name kinyarwanda_am_lm \
--export_model_name kinyarwanda_model \
--scorer data/host_data/kinyarwanda_optm.scorer \
--lm_alpha 0.7169565760990836 \
--lm_beta 1.750652309533554
In the above command, we included the trained (and optimized) scorer together with the optimized alpha
and beta
values we generated earlier.
While this is entirely optional, it helps to generate a better model.