Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to train models using Multilingual LibriSpeech #879

Closed
justkowal opened this issue Oct 29, 2021 · 11 comments
Closed

How to train models using Multilingual LibriSpeech #879

justkowal opened this issue Oct 29, 2021 · 11 comments

Comments

@justkowal
Copy link

I want to train my own model using dataset from this website.
How can I adopt it to train my own model from it?

@ghost
Copy link

ghost commented Oct 29, 2021

  1. For each audio file, you'll need to make a corresponding .txt file using the data in transcripts.txt.
  2. Write the text files to the same location as the .flac files.
  3. Update the directory structure so it looks like LibriSpeech.
  4. Run synthesiser_preprocess_audio.py with the --no_alignments option.

In my limited experience with MLS, the audio files were not cut well and often stopped in the middle of a word or contained extra sounds. This is probably an issue of the automatic segmentation method used by the dataset authors. The extra sounds caused problems when training the stop token prediction.

@justkowal
Copy link
Author

So it's useless as training data?

@ghost
Copy link

ghost commented Oct 29, 2021

It can still be used, but if this is your first model from scratch, find a different dataset.

Also don't forget to update symbols.py with characters not found in the English alphabet.

_characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!\'\"(),-.:;? "

@AlexSteveChungAlvarez
Copy link

  1. For each audio file, you'll need to make a corresponding .txt file using the data in transcripts.txt.

  2. Write the text files to the same location as the .flac files.

  3. Update the directory structure so it looks like LibriSpeech.

  4. Run synthesiser_preprocess_audio.py with the --no_alignments option.

In my limited experience with MLS, the audio files were not cut well and often stopped in the middle of a word or contained extra sounds. This is probably an issue of the automatic segmentation method used by the dataset authors. The extra sounds caused problems when training the stop token prediction.

Hello! I have been all day reading the issues for getting to know what to do. I want to train on Spanish datasets tux-100h and Common Voice, which were proportioned in some of the past issues that refer to different languages. These days I will try to figure out how to order the datasets to have the same structure as LibriSpeech. If you have any handy code that could help with making the .txt from the transcript.txt it would be very useful, since first I will play around with train-clean-100 to get familiar with the process as you suggested in another issue. Thanks for all the effort, I found this experiment yesterday and I have seen how much you contributted to mantain it alive.

@ghost
Copy link

ghost commented Nov 3, 2021

If you have any handy code that could help with making the .txt from the transcript.txt

@AlexSteveChungAlvarez https://gist.github.com/blue-fish/11552e89e95f32c14a370935c58f426c

@ghost
Copy link

ghost commented Nov 3, 2021

I have also shared modifications to support audio preprocessing of the compressed .opus files: https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/b4e6c11c429bc6f8cdd86c048cba32413ab4109e

@AlexSteveChungAlvarez
Copy link

If you have any handy code that could help with making the .txt from the transcript.txt

@AlexSteveChungAlvarez https://gist.github.com/blue-fish/11552e89e95f32c14a370935c58f426c

Thank you very much! I will be using it this weekend.

I have also shared modifications to support audio preprocessing of the compressed .opus files: blue-fish@b4e6c11

I've found out that the program itself also accepts .ogg files (most of the audios sent via Whatsapp).

@Ontopic
Copy link

Ontopic commented Nov 4, 2021

Ah perfect timing. Was just deciding between ogg and m4a before turnin it into wav, but if ogg will work without change that’s an easy choice.

Thanks from over here as well. Everything running really smoothly. Gonna start training now though for a different language, so hope things stay that way 🤞

Are there any other resources besides the research result page and the data in the repo? This repo seems weirdly underused...

@AlexSteveChungAlvarez
Copy link

Ah perfect timing. Was just deciding between ogg and m4a before turnin it into wav, but if ogg will work without change that’s an easy choice.

Thanks from over here as well. Everything running really smoothly. Gonna start training now though for a different language, so hope things stay that way 🤞

Are there any other resources besides the research result page and the data in the repo? This repo seems weirdly underused...

I passed the entire Friday searching for more updated repos which have their own papers, I took a look at mozilla's, tacotron's, tacotron2's, among others based on those repos...but for all of them you need a dataset to train the vocoder (or at least, that's what I understood from their documentation and discussions). With the code in this repo you only need one sample to hear a very similar voice to the target voice you want to clone. Which language are you going to train? It would be very helpful if you share your experience after doing it since I will start training with Spanish this weekend and I have seen that many more people wanted to do so, but they haven't shared their experiences after doing it (if they finally did it).

@ghost
Copy link

ghost commented Nov 5, 2021

I can offer the following observations for MLS Spanish:

  1. Using the default max_mel_frames = 900 causes utterances longer than 11.25 sec to be discarded. This can be a problem because the audios are evenly distributed in duration between 10-20 sec (see MLS paper). The raw dataset has 917 hours of Spanish audio, but using defaults will cut that to 202 hours.
  2. Here are the unique symbols in the transcripts (for symbols.py):
    • _characters = "'-aábcdeéfghiíjklmnñoópqrstuúüvwxyz "
  3. Frequently used abbreviations in the transcripts need to be added to text cleaners.
    • numbers.py also needs to be updated for inference
  4. The transcripts have serious quality control issues, but the resulting TTS was still usable.
    • rand om space s appearing within a word
    • sometimes entire words have spaces i n t e r s p e r s e d throughout
    • failure to normalize numbers when they appear as Roman numerals, e.g. xiii
    • accented letters substituted with non-accented versions in some areas, e.g. using a in place of á
  5. If there are problems with the synthesizer generating extra sounds, the stop threshold can be lowered to help prevent this. A threshold of 0.00001 seems to work well.

@ghost
Copy link

ghost commented Nov 12, 2021

Closing inactive issue.

@ghost ghost closed this as completed Nov 12, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants