Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-tuning for hindi #525

Closed
hetpandya opened this issue Sep 11, 2020 · 32 comments
Closed

Fine-tuning for hindi #525

hetpandya opened this issue Sep 11, 2020 · 32 comments

Comments

@hetpandya
Copy link

hetpandya commented Sep 11, 2020

Hi @blue-fish , I am trying to fine-tune the model to clone voices of hindi speakers. I wanted to know the steps to follow for the same and also the amount of data I'd need for the model to work well.

Edit - I shall use google colab for fine-tuning

@ghost
Copy link

ghost commented Sep 11, 2020

Hi @thehetpandya , please start by reading this: #431 (comment)

It is not possible to finetune the English model to another language. A new model needs to be trained from scratch. This is because the model relates the letters of the alphabet to their associated sounds, so what the model knows about English does not transfer over to Hindi. At a minimum, you will need a total of 50 hours of transcribed speech from at least 100 speakers. For a better model get 10 times this number.

This is what you need to do. Good luck and have fun!

  1. Replicate the training of the English synthesizer to learn how to use the data processing and training scripts.
    • I have no idea how to do this with Google colab, but it should be possible.
  2. Assemble and preprocess your dataset
  3. Train a synthesizer model
  4. Troubleshoot problems with the model.
  5. Repeat steps 3 and 4 until satisfied

@hetpandya
Copy link
Author

hetpandya commented Sep 11, 2020

@blue-fish Thanks a lot for the response! Yes, I have begun exploring the issues for now for a better understanding of the workflow before beginning with the training process.

I also read on #492 (comment) that beginning with training the synthesizer would be a good start and then only if the encoder doesn't seem to give proper results, one can proceed with training/fine-tuning the encoder. Does that same apply in case for a totally different language too, like in my case i.e. hindi?

@ghost
Copy link

ghost commented Sep 11, 2020

I agree with that suggestion. Encoder training requires a lot of data, time and effort. You can see #126 and #458 to get an idea. If your results are good enough without it, best to avoid that hassle.

@lawrence124
Copy link

@thehetpandya

I'm working on a forked version of sv2tts to train local dialect of chinese. Using the dataset from Common Voice (about 22k of utterances) , i couldn't get the data to converge. But if I add the local dialect on top of a pre-trained model (the main dialect of chinese), seems like the result is actually quite good. Fyi, the local dialect and the main dialect have different, but similar alphabet romanization system (for example, the main has 4 tones, but the local dialect has 8)

using common voice data only:
image

using pre-trained and then add local dataset:
image

@blue-fish
not sure if i'm abusing the model, but at least it works :)

@ghost
Copy link

ghost commented Sep 12, 2020

@lawrence124 Interesting, thanks for sharing that result! Occasionally the model fails to learn attention, you might try restarting the training from scratch with a different random seed. It might also help to trim the starting and ending silences. If your data is at 16 kHz then webrtcvad can do that for you (see the trim_long_silences function in encoder/audio.py).

@hetpandya
Copy link
Author

hetpandya commented Sep 14, 2020

Thanks @blue-fish I went through the issues you mentioned. You gave me a good amount of resources for a start. Much appreciated!

@hetpandya
Copy link
Author

@lawrence124 Glad to see your results! Did you have to train the encoder from scratch? Or using the pre-trained decoder/synthesizer worked for you?

@lawrence124
Copy link

lawrence124 commented Sep 14, 2020

i'm using the pretrained encoder from Kuangdd, but according to the file size and date...seems like it is the same as the pretrained encoder from here

@hetpandya
Copy link
Author

Okay, thanks @lawrence124 ! Seems like using the pretrained encoder is good to go for now.

@lawrence124
Copy link

lawrence124 commented Sep 14, 2020

btw, i modified a script from adueck a bit. This script will convert video/audio with srt to audio with script for training. I'm not quite sure the format for sv2tts though, but i think u may find it useful if u are trying to get some more date set to train on.

https://github.com/adueck/split-video-by-srt

srt-split.zip

@lawrence124
Copy link

@blue-fish

would like to ask a rather random question...have u tried using the demo TTS from https://www.readspeaker.com/ ??

from my point of view, the result in Chinese/Cantonese is pretty good and i would like to discuss...is that their proprietary algorithm is simply superior ?? or they simply has the resources to build a better dataset to train on ??

based on the job description, what they are doing is not too different from tacotron / sv2tts

https://www.isca-speech.org/iscapad/iscapad.php?module=article&id=17363&back=p,250

@ghost
Copy link

ghost commented Sep 15, 2020

@lawrence124 That website demo uses an different algorithm that probably does not involve machine learning. It sounds like a concatenative method of synthesis where prerecorded sounds are joined together. Listening closely, it is unnatural and obviously computer-generated. To their credit, they do use high-quality audio samples to build the output.

Here's a wav of the demo text synthesized by zhrtvc, using Griffin-Lim as the vocoder. Tacotron speech flows a lot more smoothly than their demo. zhrtvc could sound better than the demo TTS if 1) it is trained on higher quality audio, and 2) a properly configured vocoder is available.

@lawrence124
Copy link

lawrence124 commented Sep 15, 2020

@blue-fish yea, as with other data analysis....getting the good/clean dataset is always difficult. (the prelim result of adding youtube clips is not good)

20200915-204053_melgan_10240ms.zip

This is an example of using "mandarin + cantonese" as synthesizer, along with Melgan vocoder. I dont know if it is my ear or not, i dont really like the Griffin-Lim from zhrtvc, it has the "robotic" noise in the background.

btw, seems like u are updating the synthesizer of sv2tts ?? the backbone is still tacotron ??

@hetpandya
Copy link
Author

btw, i modified a script from adueck a bit. This script will convert video/audio with srt to audio with script for training. I'm not quite sure the format for sv2tts though, but i think u may find it useful if u are trying to get some more date set to train on.

https://github.com/adueck/split-video-by-srt

srt-split.zip

@lawrence124 thanks I shall take a look at it since I might need more data if I cannot find any public dataset

@GauriDhande
Copy link

@thehetpandya were you able to generate the model for cloning hindi sentences?

@hetpandya
Copy link
Author

@GauriDhande I'm still looking for a good hindi speech dataset. Do you have any sources?

@GauriDhande
Copy link

Was going to ask the same thing. I didn't find the Hindi open speech dataset on the internet yet.

@ghost
Copy link

ghost commented Sep 23, 2020

You might be able the combine the two sources below. First train a single-speaker model on source 1, then tune the voice cloning aspect on source 2. Some effort and experimentation will be required.

Source 1 (24 hours single-speaker): https://cvit.iiit.ac.in/research/projects/cvit-projects/text-to-speech-dataset-for-indian-languages
Source 2 (100 voices, 6 utterances each, untranscribed): https://github.com/shivam-shukla/Speech-Dataset-in-Hindi-Language

@hetpandya
Copy link
Author

Thanks @blue-fish, I've already applied for Source 1. Will also check out the second one. Your efforts on this project are much appreciated!

@ghost
Copy link

ghost commented Oct 6, 2020

Hi @thehetpandya , have you made any progress on this recently?

@hetpandya
Copy link
Author

hetpandya commented Oct 10, 2020

Hi @blue-fish , no I coudn't find progress on this one. I tried fine-tuning https://github.com/Kyubyong/dc_tts instead, which gave clearer pronunciation of hindi words.
Edit - I tried fine-tuning https://github.com/Kyubyong/dc_tts on Source 1 i.e. https://cvit.iiit.ac.in/research/projects/cvit-projects/text-to-speech-dataset-for-indian-languages

@ghost
Copy link

ghost commented Oct 12, 2020

Thanks for trying @thehetpandya . If you decide to work on this later please reopen the issue and I'll try to help.

@ghost ghost closed this as completed Oct 12, 2020
@amrahsmaytas
Copy link

Greetings @thehetpandya

Are you able to do a real time voice cloning for the given english text, with your experiment in indian accent?

Could you please help/guide me with Voice cloning of english Text In My voice with indian accent?

Thanks

@hetpandya
Copy link
Author

Hi @amrahsmaytas no I couldn't land to good results. And then I had to shift to another task. Still I'd be glad if I cloud be of any help.

@amrahsmaytas
Copy link

Hi @amrahsmaytas no I couldn't land to good results. And then I had to shift to another task. Still I'd be glad if I cloud be of any help.

Thanks for the reply,het!
I need your help in training, could you please check your mail (send from greetsatyamsharma@gmail.com) and connect me over there for further discussions!

Thanks ✌,
Awaiting for your response, dude
Satyam.

@rajuc110
Copy link

@GauriDhande and @thehetpandya were you guys able to generate the model for cloning Hindi sentences? Please reply.

Thanks.

@hetpandya
Copy link
Author

Hi @rajuc110 , sorry for the delayed response. No, I couldn't reproduce the results in hindi and had to shift to another task meanwhile.

@ghost ghost mentioned this issue Oct 8, 2021
@SayaliNagwkar17
Copy link

Hi @blue-fish , no I coudn't find progress on this one. I tried fine-tuning https://github.com/Kyubyong/dc_tts instead, which gave clearer pronunciation of hindi words. Edit - I tried fine-tuning https://github.com/Kyubyong/dc_tts on Source 1 i.e. https://cvit.iiit.ac.in/research/projects/cvit-projects/text-to-speech-dataset-for-indian-languages

Can you share your work?

@SohumKaliaCoder
Copy link

I am also facing this issue has anyone have update on this issue

@divyendrajadoun
Copy link

Hey guys, has anyone found a solution for hindi voice cloning? Thanks

@Harsh-Holy9
Copy link

Anybody has already trained model for Hindi language?

@Chetan-5ehgal
Copy link

any progress done on training real time voice cloning on hindi data set ?

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants