Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single speaker fine-tuning process and results #437

Closed
ghost opened this issue Jul 22, 2020 · 72 comments
Closed

Single speaker fine-tuning process and results #437

ghost opened this issue Jul 22, 2020 · 72 comments

Comments

@ghost
Copy link

ghost commented Jul 22, 2020

Summary

A relatively easy way to improve the quality of the toolbox output is through fine-tuning of the multispeaker pretrained models on a dataset of a single target speaker. Although it is no longer voice cloning, it is a shortcut for obtaining a single-speaker TTS model with less training data needed relative to training from scratch. This idea is not original, but a sample single-speaker model is presented along with a process and data for replicating the model.

Improvement in quality is obtained by taking the pretrained synthesizer model and training a few thousand steps on a single-speaker dataset. This amount of training can be done in less than a day on a CPU, and even faster with a GPU.

Procedure

Pretrained models and all files and commands needed to replicate this training can be found here: https://www.dropbox.com/s/bf4ti3i1iczolq5/logs-singlespeaker.zip?dl=0

  1. First, create a dataset of a single speaker from LibriSpeech. All embeddings are updated to reference the same file. (I'm not sure if this helps or not, but the idea is to get it to converge faster.)
    • It doesn't have to be LibriSpeech. This demonstrates the concept with minimal changes to existing files.
    • Total of 13.28 minutes (train-clean-100/211/122425/*)
  2. Next, continue training of the pretrained synthesizer model using the restricted dataset. Running overnight on a CPU, loss decreased from 0.70 to 0.50 over 2,600 steps. I plan to go further in subsequent tests.
  3. Generate new training data for the vocoder using the updated synthesizer model.
  4. Continue training of the pretrained vocoder. I only added 1,000 steps for now because I was eager to see if it worked, but the difference is noticeable even with a little fine-tuning.

Results

Download audio samples: samples.zip

These are generated with demo_toolbox.py are demonstrate the effect of synthesizer fine-tuning. "Pretrained" uses the original models, and "singlespeaker" uses the fine-tuned synthesizer model with the original vocoder model. I found the #432 changes helpful for benchmarking: all samples are generated with seed=1, no trim silences. The single-speaker model is noticeably better, with fewer long gaps and artifacts for short utterances. However, gaps still occur sometimes: one example is "this is a big red apple." Output is also somewhat better with a fine-tuned vocoder model, though no samples with the new vocoder are shared at this time.

Discussion

This work helps to demonstrate the following points:

  1. Deficiencies with the synthesizer and its pretrained model can be compensated to some extent, by fine-tuning to a single speaker. This is much easier than implementing a new synthesizer and requires far less training.
  2. A small dataset of 0.2 hours is sufficient for fine-tuning the synthesizer.
  3. Better single-speaker performance can be obtained with just a few thousand steps of additional synthesizer training.

The major obstacle preventing single-speaker fine-tuning is the lack of a suitable tool for creating a custom dataset. The existing preprocessing scripts are suited to batch processing of organized, labeled datasets. The existing scripts are not helpful unless the target speaker is already part of a supported dataset. The preprocessing does not need to be fully automated because a small dataset on the order of 100 utterances is sufficient for fine-tuning. I am going to write a tool that will allow users to manually select or record files to add to a custom dataset, and facilitate transcription (maybe using DeepSpeech). This tool will be hosted in a separate repository.

Acknowledgements

@ghost
Copy link
Author

ghost commented Jul 23, 2020

Pretrained synthesizer + 200 steps of training on VCTK p240 samples (0.34 hours of speech). Still using the original vocoder model. This is just a few minutes of CPU time for fine-tuning. It is remarkable that the synthesizer is already imparting the accent on the result. This is good news for anyone who is fine-tuning an accent: it should not take too long, even for multispeaker.

I did notice a lot more gaps and sound artifacts than usual with the finetuned model (this result is cherry-picked). Is it because I did not hardcode all the samples to a single utterance embedding?

samples_vctkp240_200steps.zip

@ghost
Copy link
Author

ghost commented Jul 24, 2020

Single-speaker finetuning using VCTK dataset: samples_vctkp240.zip

Here are some samples from the latest experiment. VCTK p240 is used to add 4.4k steps to the synthesizer, and 1.0k to the vocoder. Synthesized audios have filename speaker_utterance_SYN_VOC.wav and use all combinations of pretrained ("pre") and finetuned ("fin") models for the synthesizer and vocoder, respectively.

Synthesized utterances using speaker p240's hardcoded embedding (derived from p240_001_mic1.flac) show the success in finetuning to match the voice, including the accent. Samples made from speaker p260's embedding demonstrate how much quality is lost when finetuning a single-speaker model.

In these samples, the synthesizer has far more impact on quality, though this result could be due to insufficient finetuning of the vocoder. Though the finetuned vocoder has only a slight advantage over the original for p240, it severely degrades voice cloning quality for p260.

Also compare to the samples for p240 and p260 in the Google SV2TTS paper: https://google.github.io/tacotron/publications/speaker_adaptation/

Replicating this experiment

Here is a preprocessed p240 dataset if you would like to repeat this experiment. The embeds for utterances 002-380 are overwritten with the one for 001, as the hardcoding makes for a more consistent result. Use the audio file p240_001.flac to generate embeddings for inference. The audios are not included to keep the file size down, so if you care to do vocoder training you will need to get and preprocess VCTK.

Directions:

  1. Copy the folder synthesizer/saved_models/logs-pretrained to logs-vctkp240 in the same location. This will make a copy of your pretrained model to be finetuned.
  2. Unzip the dataset files to datasets_p240 in your Real-Time-Voice-Cloning folder (or somewhere else if you desire)
  3. Train the model: python synthesizer_train.py vctkp240 dataset_p240/SV2TTS/synthesizer --checkpoint_interval 100
  4. Let it run for 200 to 400 iterations, then stop the program.
    • This should complete in a reasonable amount of time even on CPU.
    • You can safely stop and resume training at any time though you will lose all progress since the last checkpoint
  5. Test the finetuned model in the toolbox using dataset_p240/p240_001.flac to generate the embedding

@mbdash
Copy link
Collaborator

mbdash commented Jul 24, 2020

Wow that is amazing... I only asked your opinion and you actually did it!

The difference is incredible.

Now I just need to dumb down all you wrote to be able to reproduce it.

Also try your_input_text.replace('hi', 'eye') it is a little cheat that I find gives better results currently.
At least in the multi speaker model.

@ghost
Copy link
Author

ghost commented Jul 24, 2020

Now I just need to dumb down all you wrote to be able to reproduce it.

@mbdash In the first post I included a dropbox link that has fairly detailed instructions for the single-speaker LibriSpeech example. You can try that and ask if you have any trouble reproducing the results. If you want VCTKp240 I can make a zip file for you tomorrow.

This was much easier and faster than expected. I am sharing the results to generate interest, so we can collaborate on how much training is needed, best values of hparams, etc.

@mbdash
Copy link
Collaborator

mbdash commented Jul 24, 2020

Thank you,
I will look at it tomorrow morning I am only staying up for a few more minutes, I am a bit too tired to think straight right now..

Tonight I am trying to keep it simple and see if I can Jam a regular "hand modeled" 3D head mesh into VOCA (Voice Operated Character Animation) (another GitHub project)

Update: nope it exploded.

@ghost
Copy link
Author

ghost commented Jul 26, 2020

Some general observations to share:

  1. Finetuning improves both quality and similarity with the target voice, and transfers accent.
  2. Decent single-speaker models require as little as 5 min of audio and 400 steps of synthesizer training.
  3. Finetuning the vocoder is not as impactful as finetuning the synthesizer. In fact given the quality limitations of the underlying models (see poor performance in compare to the main paper? #411) I would not bother with additional vocoder training.

Also I did another experiment and trained the synthesizer for about 5,000 additional steps on the entire VCTK dataset (trying to help out on #388). The accent still does not transfer for zero-shot cloning. I suspect the synthesizer needs to be trained from scratch if that is the goal.

P.S. @mbdash I updated the VCTKp240 post with a single-speaker dataset if you would like to try that out. #437 (comment)

@ghost
Copy link
Author

ghost commented Jul 28, 2020

Also I did another experiment and trained the synthesizer for about 5,000 additional steps on the entire VCTK dataset (trying to help out on #388). The accent still does not transfer for zero-shot cloning. I suspect the synthesizer needs to be trained from scratch if that is the goal.

Changing my mind on training from scratch, I think we just need to add an extra input parameter to the synthesizer which indicates the accent or more accurately the dataset that it is trained on. A simple implementation might be a single bit representing LibriSpeech or VCTK. Next, finetune the existing models on VCTK with the added parameter. Then for inference specify the dataset that you want the result to sound like. I'm at a loss how to implement this with the current set of models, but I think this repo will have clues: https://github.com/Tomiinek/Multilingual_Text_to_Speech

I'm all done with accent experiments for now but I hope this is helpful to anyone who wants to continue this work.

@Adam-Mortimer
Copy link

Adam-Mortimer commented Jul 29, 2020

"I am going to write a tool that will allow users to manually select or record files to add to a custom dataset, and facilitate transcription (maybe using DeepSpeech). This tool will be hosted in a separate repository."

Thank you for all your hard work on this repo - even as an almost complete newcomer to deep learning, I've been able to decipher some things, but I'm still stymied by the inability to create custom datasets from scratch. Are you still working on this "custom dataset" tool that you mention here?

@Ori-Pixel
Copy link

@blue-fish any reason why im getting the following error: "synthesizer_train.py: error: the following arguments are required: synthesizer_root"? I'm trying to run:

synthesizer_train.py H:\ttss\Real-Time-Voice-Cloning-master\dataset_p240\SV2TTS\synthesizer --checkpoint_interval 100

the second argument is the folder that contains embeds, mels, and train.txt

nevermind I fixed it while writing this. The argument isn't --synthesizer_root as all of the other arguments, but actually just synthesizer_root. Also, the above testing instructions are thus wrong (or at least not working for me). The command should be:

python synthesizer_train.py synthesizer_root dataset_p240/SV2TTS/synthesizer --checkpoint_interval 100

(it at least bumped me to a dll error - still working through that one)

@ghost
Copy link
Author

ghost commented Jul 30, 2020

I'm still stymied by the inability to create custom datasets from scratch. Are you still working on this "custom dataset" tool that you mention here?

Hi @Adam-Mortimer. The custom dataset tool is still planned, but currently on hold as I've just started working #447 (switching out the synthesizer for fatchord's tacotron). #447 will be bigger than all of my existing pull requests combined if it ever gets finished. In other words, it's going to take quite some time.

I started writing the custom dataset tool for a voice cloning experiment. I didn't get very far with the tool before I added LibriTTS support in #441 which made it much easier to create a dataset by putting your data in this kind of directory structure:

datasets_root
    * LibriTTS
        * train-clean-100
            * speaker-001
                * book-001
                    * utterance-001.wav
                    * utterance-001.txt
                    * utterance-002.wav
                    * utterance-002.txt
                    * utterance-003.wav
                    * utterance-003.txt

Where each utterance-###.wav is a short utterance (2-10 sec) and the utterance-###.txt contains the corresponding transcript. Then you can process this dataset using:

python synthesizer_preprocess_audio.py datasets_root --datasets_name LibriTTS --subfolders train-clean-100 --no_alignments

When this completes, your dataset is in the SV2TTS format and subsequent preprocessing commands (synthesizer_preprocess_embeds.py, vocoder_preprocess.py) will work as described on the training wiki page.

I would still like to write the custom dataset tool but I think #447 is a more pressing matter since the toolbox is incompatible with Python 3.8 due to our reliance on Tensorflow 1.x.

@ghost
Copy link
Author

ghost commented Jul 30, 2020

@Ori-Pixel There was a problem with my command and I fixed it. If you are following everything to the letter it should be:

python synthesizer_train.py vctkp240 dataset_p240/SV2TTS/synthesizer --checkpoint_interval 100

Where the first arg vctkp240 describes the path to the model you are training (in this case, it tells python to look for the model in synthesizer/saved_models/logs-vctkp240), and the second arg is the path to the location containing train.txt, and the mels and embeds folders. Please share your results and feel free to ask for help if you get stuck.

@Ori-Pixel
Copy link

Ori-Pixel commented Jul 30, 2020

@blue-fish thanks. yeah, I can see that it's saving to a new directory, I'll run it again with the correct params and post results.

Also, thanks for the preprocessing tips you gave to @Adam-Mortimer . I was not looking forward to custom labeling, but it doesn't seem that bad if I only have ~200 lines/~34 minutes. I'm trying to make a fake (semi-Gaelic) accent video game character say some lines, so I'll probably scrape the audio files from the wiki site, then slap them into a folder structure like above with a simple script and then run this single speaker fine tuning again. And for the accent, I think I can just find a semi-close one in the VCTK dataset (although a 10Gb download will take me a few days sadly).

@ghost
Copy link
Author

ghost commented Jul 30, 2020

@Ori-Pixel If you have a GPU you can quickly run a few experiments to see how far you can trim the dataset before the audio quality breaks down. Simply delete lines from train.txt and they won't be used.

One of my experiments involved re-recording some of the VCTK p240 utterances with a different voice. 5 minutes of mediocre data (80 utterances) still resulted in a half-decent model. If the labeling is extremely tedious you can try training a model on part of it while continuing to label.

I have preprocessed VCTK, if you can make your decision based on a single recording, request up to 3 speakers and I'll put them on dropbox for you. https://www.dropbox.com/s/6ve00tjjaab4aqj/VCTK_samples.zip?dl=0

@ghost
Copy link
Author

ghost commented Jul 30, 2020

Oh, and just to be clear, you cannot train the voice and accent independently at this time. The accent is associated with the voice via the speaker embedding. After #447, we will work on #230 to add the Mozilla TTS implementation of GSTs. That should allow us to generalize accents to new voices.

@Ori-Pixel
Copy link

@blue-fish

I have preprocessed VCTK, if you can make your decision based on a single recording, request up to 3 speakers and I'll put them on dropbox for you. https://www.dropbox.com/s/6ve00tjjaab4aqj/VCTK_samples.zip?dl=0

Is there a list of their speakers somewhere? I only was able to find the 10GB file with not even a magnet link or anything denoting samples or file structure. I mean realistically anything Irish, Scottish, or Gaelic would work. I may also look into downloading it direct to drive (if possible) and even possibly training there (if possible -- as far as I'm aware you can mount the drive and run bash.)

Oh, and just to be clear, you cannot train the voice and accent independently at this time. The accent is associated with the voice via the speaker embedding. After #447, we will work on #230 to add the Mozilla TTS implementation of GSTs. That should allow us to generalize accents to new voices.

Yeah, I just meant using a vctk pretrained that wasn't horribly inconsistent with my single speaker's accent and then fine tuning with my custom labeled lines on top.

I also have a couple idle GPUs in my machine but I always run into venv issues with gpu training so I'll just use colab if I really need a GPU. Too bar downloading from a link to

@ghost
Copy link
Author

ghost commented Jul 30, 2020

Is there a list of their speakers somewhere?

The zip file I uploaded includes speaker-log.txt (which is included in the full VCTK dataset) which has a list of speaker metadata such as:

ID  AGE  GENDER  ACCENTS  REGION COMMENTS 
p225  23  F    English    Southern  England
p226  22  M    English    Surrey
p227  38  M    English    Cumbria
p228  22  F    English    Southern  England

@Ori-Pixel
Copy link

Ah I see. I'll give it a look tomorrow along with the results and let you know then, thanks again for being so active!

@Ori-Pixel
Copy link

Ori-Pixel commented Jul 30, 2020

@blue-fish p261 is relatively close. if I could get that slice, that would be very helpful (my internet at my current house is sadly 1MB/s)

I trained as per instructions above, sadly I didnt get to see the console output as my power went out after about an hour or so, but I did get this in the training logs, so I think this is as far as it trained.

[2020-07-30 01:36:31.676] Step 278202 [28.894 sec/step, loss=0.64379, avg_loss=0.64339]

Also, just to make sure I did the test, this is the cmd I used:

python demo_toolbox.py -d H:\ttss\Real-Time-Voice-Cloning-master\dataset_p240

Where random seed = 1, enhanced vocoder output is checked, embedding was from p240_1.flac.

resulting audio: https://raw.githubusercontent.com/Ori-Pixel/files/master/welcome_to_toolbox_fine_tuned.flac

@ghost
Copy link
Author

ghost commented Jul 30, 2020

@Ori-Pixel

Here is the dataset in the same format as p240 (embeds overwritten with the one corresponding to p261_001.flac): https://www.dropbox.com/s/o6fz2r6w56djwkf/dataset_p261.zip?dl=0

resulting audio: https://raw.githubusercontent.com/Ori-Pixel/files/master/welcome_to_toolbox_fine_tuned.flac

Your results sound American to me. Check that you are using the new synthesizer model, then try this text: Take a look at these pages for crooked creek drive.
And compare to my results for 200 steps: #437 (comment)

@Ori-Pixel
Copy link

Ori-Pixel commented Jul 30, 2020

Check that you are using the new synthesizer model

Ah, I didn't have that drop down selected. My results are then this, with the same settings:

https://raw.githubusercontent.com/Ori-Pixel/files/master/take%20a%20look%20at%20these%20pages%20for%20crooked%20creek%20drive%20fine%20tuned.flac

I'm also taking your comment above and trying to train my own dataset, but at first I got a dataset roots folder doesnt exist error, so I made the folder and added my files, but when I go to train, I get:

Arguments:
datasets_root:   datasets_root
out_dir:         datasets_root\SV2TTS\synthesizer
n_processes:     None
skip_existing:   False
hparams:
no_alignments:   False
datasets_name:   LibriTTS
subfolders:      train-clean-100
Using data from:
datasets_root\LibriTTS\train-clean-100
LibriTTS:   0%|                                                                            | 0/1 [00:00<?, ?speakers/s]2

gpu warnings here

LibriTTS: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.20s/speakers]
The dataset consists of 0 utterances, 0 mel frames, 0 audio timesteps (0.00 hours).
Traceback (most recent call last):
  File "synthesizer_preprocess_audio.py", line 59, in <module>
    preprocess_dataset(**vars(args))
  File "H:\ttss\Real-Time-Voice-Cloning-master\synthesizer\preprocess.py", line 49, in preprocess_dataset
    print("Max input length (text chars): %d" % max(len(m[5]) for m in metadata))
ValueError: max() arg is an empty sequence

image

Utterances have just the text that was spoken in them, so utterance-000.txt contains Let's have some fun, shall we...

edit: I assume I will need to go through the training docs and start by training the encoder?

@ghost
Copy link
Author

ghost commented Jul 30, 2020

@Ori-Pixel You also need to add the --no_alignments option to use a non-LibriSpeech dataset that doesn't have an alignments file. I've also fixed the command in the instructions above. Sorry for leaving that out earlier.

python synthesizer_preprocess_audio.py datasets_root --datasets_name LibriTTS --subfolders train-clean-100 --no_alignments

Edit: If preprocessing completes without finding a wav file, we should remind the user to pass the --no_alignments flag. Or possibly default it to True if the datasets_name is not LibriSpeech.

@Ori-Pixel
Copy link

@blue-fish Okay, so I got it to train, and I can also train my own dataset for the synthesizer. Really thankful for the help. Here's a result from 200 steps of training if you're interested:

https://raw.githubusercontent.com/Ori-Pixel/files/master/crooked_creek_dw.flac

https://raw.githubusercontent.com/Ori-Pixel/files/master/biggest_oversight.flac

@ghost
Copy link
Author

ghost commented Jul 31, 2020

@Ori-Pixel Nice! It's remarkable how much that voice comes through after 200 steps of finetuning. In my own experiments going up to 400 steps yields a noticeable improvement in the voice quality. More than 400 doesn't seem to help, though it doesn't hurt either.

Edit: You trained on CPU right? How long did it take?

@Ori-Pixel
Copy link

Ori-Pixel commented Jul 31, 2020

@blue-fish I did train on CPU(autocorrect!!) (I always have issues with gpu setup. Luckily im building a new pc when the 30xx cards drop with the new zen2 amd cpus). After trying to train from 200-400 it would seem that it takes ~25s per step after 20 steps, so around 2 hours for 200 steps on i5 4690k.

The next steps for me would be encoder/vocoder training but I don't want to invest the compute power since Im working on another NLP problem for my actual research (sentiment analysis) I'll let it run overnight again and this time see how far it gets :)

edit: as @blue-fish said, it seems training it to 400 steps made a large difference. Here's an example of the same voice as above, but with 400 steps of training the p261 set on my own collected voice samples:

original voice: https://raw.githubusercontent.com/Ori-Pixel/files/master/Vo_dark_willow_sylph_attack_14.mp3
200 steps: https://raw.githubusercontent.com/Ori-Pixel/files/master/biggest_oversight.flac
400 steps: https://raw.githubusercontent.com/Ori-Pixel/files/master/dark%20willow%20400.flac

@adfost
Copy link

adfost commented Aug 3, 2020

@blue-fish I did exactly what you said, after over 10000 steps with the synthesizer, I try to open the toolbox. I type the text to convert into the box, and I get some unrelated text in an almost incomprehensible ramble.

@ghost
Copy link
Author

ghost commented Aug 3, 2020

@adfost Which set of instructions are you following? LibriSpeech (#437 (comment)) or VCTKp240 (#437 (comment))?

Most likely, when you run synthesizer_train.py it cannot find the pretrained model so it starts training a new synthesizer model from scratch. Please make sure you copied the entire contents of synthesizer/saved_models/logs-pretrained to another "logs-XXXX" folder in the same location, and specify the name (XXXX) to synthesizer_train.py as the first argument.

@tiomaldy
Copy link

For another language in single speaker ?

@ghost
Copy link
Author

ghost commented Oct 7, 2021

I'm still stymied by the inability to create custom datasets from scratch. Are you still working on this "custom dataset" tool that you mention here?

For those recording their own utterances, this is a useful tool: https://github.com/MycroftAI/mimic-recording-studio

Another dataset recording tool: https://github.com/babua/TTSDatasetRecorder

@prince6635
Copy link

Does anyone have the dropbox links? they're invalid right now.

@maophp
Copy link

maophp commented Jul 2, 2022

the "https ://www.dropbox.com/s/bf4ti3i1iczolq5/logs-singlespeaker.zip?dl=" lost, please fix it tks guys.

@samoliverschumacher
Copy link

I've made public a repo with a workflow for creating a dataset to perform synthesizer fine tuning.

Not sure if this is the best place to let people know, but hopefully it helps someone.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants