🎤✨ Official implementation of ✨🎤
“The T05 System for The VoiceMOS Challenge 2024:
Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech”
🏅🎉 accepted by IEEE Spoken Language Technology Workshop (SLT) 2024. 🎉🏅
ꔫ・-・ꔫ・-・ꔫ・-・ꔫ・-・ꔫ・-・ꔫ・-・ꔫ・-・ꔫ
✨ UTMOSv2 achieved 1st place in 7 out of 16 metrics ✨
✨🏆 and 2nd place in the remaining 9 metrics 🏆✨
✨ in the VoiceMOS Challenge 2024 Track1! ✨
✨ You can easily use the pretrained UTMOSv2 model!
Note
To clone the repository and use the pretrained UTMOSv2 model, make sure you have git lfs
installed. If it is not installed, you can follow the instructions at https://git-lfs.github.com/ to install it.
✨ allowing you to quickly create models and make predictions with minimal effort!! ✨
If you want to make predictions using the UTMOSv2 library, follow these steps:
-
Install the UTMOSv2 library from GitHub
# Prevents LFS files from being temporarily downloaded during the installation process GIT_LFS_SKIP_SMUDGE=1 pip install git+https://github.com/sarulab-speech/UTMOSv2.git
-
Make predictions
-
To predict the MOS of a single
.wav
file:import utmosv2 model = utmosv2.create_model(pretrained=True) mos = model.predict(input_path="/path/to/wav/file.wav")
-
To predict the MOS of all
.wav
files in a folder:import utmosv2 model = utmosv2.create_model(pretrained=True) mos = model.predict(input_dir="/path/to/wav/dir/")
-
Note
Either input_path
or input_dir
must be specified, but not both.
If you want to make predictions using the inference script, follow these steps:
-
Clone this repository and navigate to UTMOSv2 folder
git clone https://github.com/sarulab-speech/UTMOSv2.git cd UTMOSv2
-
Install Package
pip install --upgrade pip # enable PEP 660 support pip install -e .[optional] # install with optional dependencies
-
Make predictions
-
To predict the MOS of a single
.wav
file:python inference.py --input_path /path/to/wav/file.wav --out_path /path/to/output/file.csv
-
To predict the MOS of all
.wav
files in a folder:python inference.py --input_dir /path/to/wav/dir/ --out_path /path/to/output/file.csv
-
Note
If you are using zsh, make sure to escape the square brackets like this:
pip install -e '.[optional]'
Tip
If --out_path
is not specified, the prediction results will be output to the standard output. This is particularly useful when the number of files to be predicted is small.
Note
Either --input_path
or --input_dir
must be specified, but not both.
Note
These methods provide quick and simple predictions. For more accurate predictions and detailed usage of the inference script, please refer to the inference guide.
🤗 You can try a simple demonstration on Hugging Face Space:
If you want to train UTMOSv2 yourself, please refer to the training guide. To reproduce the training as described in the paper or used in the competition, please refer to this document.
Details of the datasets used in this project can be found in the datasets documentation.
If you find UTMOSv2 useful in your research, please cite the following paper:
@inproceedings{baba2024utmosv2,
title = {The T05 System for The {V}oice{MOS} {C}hallenge 2024: Transfer Learning from Deep Image Classifier to Naturalness {MOS} Prediction of High-Quality Synthetic Speech},
author = {Kaito, Baba and Wataru, Nakata and Yuki, Saito and Hiroshi, Saruwatari},
booktitle = {IEEE Spoken Language Technology Workshop (SLT)},
year = {2024},
}