Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How i can train my audio files .to use indian assent . #429

Closed
ash1407 opened this issue Jul 18, 2020 · 9 comments
Closed

How i can train my audio files .to use indian assent . #429

ash1407 opened this issue Jul 18, 2020 · 9 comments

Comments

@ash1407
Copy link

ash1407 commented Jul 18, 2020

How i can train my audio files AS data for ecoder,vocoder .to use indian assent . assent of indian is different so i do not feel having my own voice when i listen it .

@ghost
Copy link

ghost commented Jul 18, 2020

This is not an easy undertaking so before you start, make sure you satisfy the prerequisites. You must be able to answer "yes" to all questions below:

  • Does your computer have a NVIDIA GPU?
  • Do you have coding experience?
  • Are you willing to devote at least 20 hours to the task?

I have not gone through the process myself, but I'll try to outline it since we don't have a good explanation. What you need to do is to fine-tune the pretrained synthesizer and vocoder models on a suitable dataset.

  1. Find a suitable dataset. Freely available resources include AccentDB (Indian accent) and VCTK (other English accents). For best results on your own voice, record your own dataset though this will take many hours.
  2. Follow the steps in README.md to enable GPU support.
  3. Go to the training wiki page and follow the steps for the synthesizer and vocoder training on the LibriSpeech dataset.
    • Review the preprocessing code and understand what it is doing.
    • Understand the format of the files in the <datasets_root>/SV2TTS folder
  4. Preprocess your dataset from step 1 to generate training data for the synthesizer.
    • At a minimum, this requires editing the preprocessing scripts.
    • You will likely need to write your own code to process the data into a suitable format for the toolbox.
    • We do not have a tutorial for this. You are on your own here!
  5. Continue training the pretrained synthesizer model on your dataset until it has converged.
  6. Using your new synthesizer model, preprocess your dataset to generate training data for the vocoder.
  7. Continue training the pretrained vocoder model on your dataset until the output is satisfactory.

With luck, your trained models will now generalize to your voice and impart the desired accent. There are no guarantees this will work.

If you succeed, please share your models and I will add them to the list in #400.

@ash1407
Copy link
Author

ash1407 commented Jul 18, 2020

This is not an easy undertaking so before you start, make sure you satisfy the prerequisites. You must be able to answer "yes" to all questions below:

  • Does your computer have a NVIDIA GPU?
  • Do you have coding experience?
  • Are you willing to devote at least 20 hours to the task?

I have not gone through the process myself, but I'll try to outline it since we don't have a good explanation. What you need to do is to fine-tune the pretrained synthesizer and vocoder models on a suitable dataset.

  1. Find a suitable dataset. Freely available resources include AccentDB (Indian accent) and VCTK (other English accents). For best results on your own voice, record your own dataset though this will take many hours.

  2. Follow the steps in README.md to enable GPU support.

  3. Go to the training wiki page and follow the steps for the synthesizer and vocoder training on the LibriSpeech dataset.

    • Review the preprocessing code and understand what it is doing.
    • Understand the format of the files in the <datasets_root>/SV2TTS folder
  4. Preprocess your dataset from step 1 to generate training data for the synthesizer.

    • At a minimum, this requires editing the preprocessing scripts.
    • You will likely need to write your own code to process the data into a suitable format for the toolbox.
    • We do not have a tutorial for this. You are on your own here!
  5. Continue training the pretrained synthesizer model on your dataset until it has converged.

  6. Using your new synthesizer model, preprocess your dataset to generate training data for the vocoder.

  7. Continue training the pretrained vocoder model on your dataset until the output is satisfactory.

With luck, your trained models will now generalize to your voice and impart the desired accent. There are no guarantees this will work.

If you succeed, please share your models and I will add them to the list in #400.

I will give a try. Thanks for the guidance friend.

@ghost
Copy link

ghost commented Jul 23, 2020

@ash1407 Are you still trying? When you get to step 4 (synthesizer preprocessing on new dataset), pull the latest master. The #441 changes should make this step a lot easier.

If using AccentDB, will you finetune a single accent or just throw them all into the mix? It would be interesting to find out if this is enough voices to generalize well for cloning. Also see my latest reply in #437 , it is a promising result to see the synthesizer acquire the accent after a small number of steps (with the caveat that I finetuned with data from a single speaker).

screenshot

@ash1407
Copy link
Author

ash1407 commented Jul 23, 2020

@ash1407 Are you still trying? When you get to step 4 (synthesizer preprocessing on new dataset), pull the latest master. The #441 changes should make this step a lot easier.

If using AccentDB, will you finetune a single accent or just throw them all into the mix? It would be interesting to find out if this is enough voices to generalize well for cloning. Also see my latest reply in #437 , it is a promising result to see the synthesizer acquire the accent after a small number of steps (with the caveat that I finetuned with data from a single speaker).

screenshot

I was not having Nvidia GPU , any idea which Gpu i should purchase for Machine learning (i have budget of 4oooRS INR)

@ghost
Copy link

ghost commented Jul 23, 2020

@ash1407 My fine-tuning in #437 is done using CPU only, and the models are converging quickly enough. Do not get a GPU unless you find it to be much too slow.

@ghost
Copy link

ghost commented Jul 26, 2020

So I've got some good news and bad news.

@ghost
Copy link

ghost commented Jul 29, 2020

@ash1407 If you're not working on this actively then I'll close the issue for now. Reopen it when you're ready to give it a try.

@ghost ghost closed this as completed Jul 29, 2020
@ghost ghost mentioned this issue Oct 8, 2021
@Vinotha638
Copy link

This is not an easy undertaking so before you start, make sure you satisfy the prerequisites. You must be able to answer "yes" to all questions below:

  • Does your computer have a NVIDIA GPU?
  • Do you have coding experience?
  • Are you willing to devote at least 20 hours to the task?

I have not gone through the process myself, but I'll try to outline it since we don't have a good explanation. What you need to do is to fine-tune the pretrained synthesizer and vocoder models on a suitable dataset.

  1. Find a suitable dataset. Freely available resources include AccentDB (Indian accent) and VCTK (other English accents). For best results on your own voice, record your own dataset though this will take many hours.

  2. Follow the steps in README.md to enable GPU support.

  3. Go to the training wiki page and follow the steps for the synthesizer and vocoder training on the LibriSpeech dataset.

    • Review the preprocessing code and understand what it is doing.
    • Understand the format of the files in the <datasets_root>/SV2TTS folder
  4. Preprocess your dataset from step 1 to generate training data for the synthesizer.

    • At a minimum, this requires editing the preprocessing scripts.
    • You will likely need to write your own code to process the data into a suitable format for the toolbox.
    • We do not have a tutorial for this. You are on your own here!
  5. Continue training the pretrained synthesizer model on your dataset until it has converged.

  6. Using your new synthesizer model, preprocess your dataset to generate training data for the vocoder.

  7. Continue training the pretrained vocoder model on your dataset until the output is satisfactory.

With luck, your trained models will now generalize to your voice and impart the desired accent. There are no guarantees this will work.

If you succeed, please share your models and I will add them to the list in #400.

Is anyone got result for training indian assent? please let me know

@shah0eer
Copy link

This is not an easy undertaking so before you start, make sure you satisfy the prerequisites. You must be able to answer "yes" to all questions below:

* Does your computer have a NVIDIA GPU?

* Do you have coding experience?

* Are you willing to devote at least 20 hours to the task?

I have not gone through the process myself, but I'll try to outline it since we don't have a good explanation. What you need to do is to fine-tune the pretrained synthesizer and vocoder models on a suitable dataset.

1. Find a suitable dataset. Freely available resources include [AccentDB](https://accentdb.org/) (Indian accent) and [VCTK](https://datashare.is.ed.ac.uk/handle/10283/3443) (other English accents). For best results on your own voice, record your own dataset though this will take many hours.

2. Follow the steps in [README.md](https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/master/README.md) to enable GPU support.

3. Go to the [training wiki page](https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/Training) and follow the steps for the synthesizer and vocoder training on the LibriSpeech dataset.
   
   * Review the preprocessing code and understand what it is doing.
   * Understand the format of the files in the <datasets_root>/SV2TTS folder

4. Preprocess your dataset from step 1 to generate training data for the synthesizer.
   
   * At a minimum, this requires editing the preprocessing scripts.
   * You will likely need to write your own code to process the data into a suitable format for the toolbox.
   * **We do not have a tutorial for this. You are on your own here!**

5. Continue training the pretrained synthesizer model on your dataset until it has converged.

6. Using your new synthesizer model, preprocess your dataset to generate training data for the vocoder.

7. Continue training the pretrained vocoder model on your dataset until the output is satisfactory.

With luck, your trained models will now generalize to your voice and impart the desired accent. There are no guarantees this will work.

If you succeed, please share your models and I will add them to the list in #400.

Hi, I have looked up on your comments and I need to clone my own voice with ascent so I produce it with text. Can you share step by step direction. I also open an issue #1228

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants