This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.
The texts were published between 1884 and 1964, and are in the public domain. The audio was recorded in 2016-17 by the LibriVox project and is also in the public domain.
The above information is from the LJSpeech website.
This recipe provides a VITS model trained on the LJSpeech dataset.
Pretrained model can be found here.
For tutorial and more details, please refer to the VITS documentation.
The training command is given below:
export CUDA_VISIBLE_DEVICES=0,1,2,3
./vits/train.py \
--world-size 4 \
--num-epochs 1000 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir vits/exp \
--max-duration 500
To inference, use:
./vits/infer.py \
--exp-dir vits/exp \
--epoch 1000 \
--tokens data/tokens.txt
If you feel that the trained model is slow at runtime, you can specify the
argument --model-type
during training. Possible values are:
-
low
, means low quality. The resulting model is very small in file size and runs very fast. The following is a wave file generatd by alow
quality modellow.mp4
The text is
Ask not what your country can do for you; ask what you can do for your country.
The exported onnx model has a file size of
26.8 MB
(float32). -
medium
, means medium quality. The following is a wave file generatd by amedium
quality modelmedium.mp4
The text is
Ask not what your country can do for you; ask what you can do for your country.
The exported onnx model has a file size of
70.9 MB
(float32). -
high
, means high quality. This is the default value.The following is a wave file generatd by a
high
quality modelhigh.mp4
The text is
Ask not what your country can do for you; ask what you can do for your country.
The exported onnx model has a file size of
113 MB
(float32).
A pre-trained low
model trained using 4xV100 32GB GPU with the following command can be found at
https://huggingface.co/csukuangfj/icefall-tts-ljspeech-vits-low-2024-03-12
export CUDA_VISIBLE_DEVICES=0,1,2,3
./vits/train.py \
--world-size 4 \
--num-epochs 1601 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir vits/exp \
--model-type low \
--max-duration 800
A pre-trained medium
model trained using 4xV100 32GB GPU with the following command can be found at
https://huggingface.co/csukuangfj/icefall-tts-ljspeech-vits-medium-2024-03-12
export CUDA_VISIBLE_DEVICES=4,5,6,7
./vits/train.py \
--world-size 4 \
--num-epochs 1000 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir vits/exp-medium \
--model-type medium \
--max-duration 500
# (Note it is killed after `epoch-820.pt`)