How to train models using Multilingual LibriSpeech #879

justkowal · 2021-10-29T17:41:31Z

I want to train my own model using dataset from this website.
How can I adopt it to train my own model from it?

ghost · 2021-10-29T18:07:44Z

For each audio file, you'll need to make a corresponding .txt file using the data in transcripts.txt.
Write the text files to the same location as the .flac files.
Update the directory structure so it looks like LibriSpeech.
- Single speaker fine-tuning process and results #437 (comment)
Run synthesiser_preprocess_audio.py with the --no_alignments option.

In my limited experience with MLS, the audio files were not cut well and often stopped in the middle of a word or contained extra sounds. This is probably an issue of the automatic segmentation method used by the dataset authors. The extra sounds caused problems when training the stop token prediction.

justkowal · 2021-10-29T18:13:26Z

So it's useless as training data?

ghost · 2021-10-29T18:54:48Z

It can still be used, but if this is your first model from scratch, find a different dataset.

Also don't forget to update symbols.py with characters not found in the English alphabet.

Real-Time-Voice-Cloning/synthesizer/utils/symbols.py

Line 11 in 7432046

    
           _characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!\'\"(),-.:;? "

AlexSteveChungAlvarez · 2021-10-31T07:33:03Z

For each audio file, you'll need to make a corresponding .txt file using the data in transcripts.txt.

Write the text files to the same location as the .flac files.

Update the directory structure so it looks like LibriSpeech.

Single speaker fine-tuning process and results #437 (comment)

Run synthesiser_preprocess_audio.py with the --no_alignments option.

In my limited experience with MLS, the audio files were not cut well and often stopped in the middle of a word or contained extra sounds. This is probably an issue of the automatic segmentation method used by the dataset authors. The extra sounds caused problems when training the stop token prediction.

Hello! I have been all day reading the issues for getting to know what to do. I want to train on Spanish datasets tux-100h and Common Voice, which were proportioned in some of the past issues that refer to different languages. These days I will try to figure out how to order the datasets to have the same structure as LibriSpeech. If you have any handy code that could help with making the .txt from the transcript.txt it would be very useful, since first I will play around with train-clean-100 to get familiar with the process as you suggested in another issue. Thanks for all the effort, I found this experiment yesterday and I have seen how much you contributted to mantain it alive.

ghost · 2021-11-03T20:54:50Z

If you have any handy code that could help with making the .txt from the transcript.txt

@AlexSteveChungAlvarez https://gist.github.com/blue-fish/11552e89e95f32c14a370935c58f426c

ghost · 2021-11-03T21:01:59Z

I have also shared modifications to support audio preprocessing of the compressed .opus files: https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/b4e6c11c429bc6f8cdd86c048cba32413ab4109e

AlexSteveChungAlvarez · 2021-11-03T22:06:04Z

If you have any handy code that could help with making the .txt from the transcript.txt

@AlexSteveChungAlvarez https://gist.github.com/blue-fish/11552e89e95f32c14a370935c58f426c

Thank you very much! I will be using it this weekend.

I have also shared modifications to support audio preprocessing of the compressed .opus files: blue-fish@b4e6c11

I've found out that the program itself also accepts .ogg files (most of the audios sent via Whatsapp).

Ontopic · 2021-11-04T16:53:09Z

Ah perfect timing. Was just deciding between ogg and m4a before turnin it into wav, but if ogg will work without change that’s an easy choice.

Thanks from over here as well. Everything running really smoothly. Gonna start training now though for a different language, so hope things stay that way 🤞

Are there any other resources besides the research result page and the data in the repo? This repo seems weirdly underused...

AlexSteveChungAlvarez · 2021-11-04T17:54:36Z

Ah perfect timing. Was just deciding between ogg and m4a before turnin it into wav, but if ogg will work without change that’s an easy choice.

Thanks from over here as well. Everything running really smoothly. Gonna start training now though for a different language, so hope things stay that way 🤞

Are there any other resources besides the research result page and the data in the repo? This repo seems weirdly underused...

I passed the entire Friday searching for more updated repos which have their own papers, I took a look at mozilla's, tacotron's, tacotron2's, among others based on those repos...but for all of them you need a dataset to train the vocoder (or at least, that's what I understood from their documentation and discussions). With the code in this repo you only need one sample to hear a very similar voice to the target voice you want to clone. Which language are you going to train? It would be very helpful if you share your experience after doing it since I will start training with Spanish this weekend and I have seen that many more people wanted to do so, but they haven't shared their experiences after doing it (if they finally did it).

ghost · 2021-11-05T08:46:00Z

I can offer the following observations for MLS Spanish:

Using the default max_mel_frames = 900 causes utterances longer than 11.25 sec to be discarded. This can be a problem because the audios are evenly distributed in duration between 10-20 sec (see MLS paper). The raw dataset has 917 hours of Spanish audio, but using defaults will cut that to 202 hours.
Here are the unique symbols in the transcripts (for symbols.py):
- _characters = "'-aábcdeéfghiíjklmnñoópqrstuúüvwxyz "
Frequently used abbreviations in the transcripts need to be added to text cleaners.
- numbers.py also needs to be updated for inference
The transcripts have serious quality control issues, but the resulting TTS was still usable.
- rand om space s appearing within a word
- sometimes entire words have spaces i n t e r s p e r s e d throughout
- failure to normalize numbers when they appear as Roman numerals, e.g. xiii
- accented letters substituted with non-accented versions in some areas, e.g. using a in place of á
If there are problems with the synthesizer generating extra sounds, the stop threshold can be lowered to help prevent this. A threshold of 0.00001 seems to work well.

ghost · 2021-11-12T22:45:06Z

Closing inactive issue.

ghost closed this as completed Nov 12, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to train models using Multilingual LibriSpeech #879

How to train models using Multilingual LibriSpeech #879

justkowal commented Oct 29, 2021

ghost commented Oct 29, 2021

justkowal commented Oct 29, 2021

ghost commented Oct 29, 2021

AlexSteveChungAlvarez commented Oct 31, 2021

ghost commented Nov 3, 2021

ghost commented Nov 3, 2021

AlexSteveChungAlvarez commented Nov 3, 2021

Ontopic commented Nov 4, 2021

AlexSteveChungAlvarez commented Nov 4, 2021

ghost commented Nov 5, 2021 •

edited by ghost

Loading

ghost commented Nov 12, 2021

How to train models using Multilingual LibriSpeech #879

How to train models using Multilingual LibriSpeech #879

Comments

justkowal commented Oct 29, 2021

ghost commented Oct 29, 2021

justkowal commented Oct 29, 2021

ghost commented Oct 29, 2021

AlexSteveChungAlvarez commented Oct 31, 2021

ghost commented Nov 3, 2021

ghost commented Nov 3, 2021

AlexSteveChungAlvarez commented Nov 3, 2021

Ontopic commented Nov 4, 2021

AlexSteveChungAlvarez commented Nov 4, 2021

ghost commented Nov 5, 2021 • edited by ghost Loading

ghost commented Nov 12, 2021

ghost commented Nov 5, 2021 •

edited by ghost

Loading