-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single speaker fine-tuning process and results #437
Comments
Pretrained synthesizer + 200 steps of training on VCTK p240 samples (0.34 hours of speech). Still using the original vocoder model. This is just a few minutes of CPU time for fine-tuning. It is remarkable that the synthesizer is already imparting the accent on the result. This is good news for anyone who is fine-tuning an accent: it should not take too long, even for multispeaker. I did notice a lot more gaps and sound artifacts than usual with the finetuned model (this result is cherry-picked). Is it because I did not hardcode all the samples to a single utterance embedding? |
Single-speaker finetuning using VCTK dataset: samples_vctkp240.zipHere are some samples from the latest experiment. VCTK p240 is used to add 4.4k steps to the synthesizer, and 1.0k to the vocoder. Synthesized audios have filename Synthesized utterances using speaker p240's hardcoded embedding (derived from p240_001_mic1.flac) show the success in finetuning to match the voice, including the accent. Samples made from speaker p260's embedding demonstrate how much quality is lost when finetuning a single-speaker model. In these samples, the synthesizer has far more impact on quality, though this result could be due to insufficient finetuning of the vocoder. Though the finetuned vocoder has only a slight advantage over the original for p240, it severely degrades voice cloning quality for p260. Also compare to the samples for p240 and p260 in the Google SV2TTS paper: https://google.github.io/tacotron/publications/speaker_adaptation/ Replicating this experimentHere is a preprocessed p240 dataset if you would like to repeat this experiment. The embeds for utterances 002-380 are overwritten with the one for 001, as the hardcoding makes for a more consistent result. Use the audio file Directions:
|
Wow that is amazing... I only asked your opinion and you actually did it! The difference is incredible. Now I just need to dumb down all you wrote to be able to reproduce it. Also try your_input_text.replace('hi', 'eye') it is a little cheat that I find gives better results currently. |
@mbdash In the first post I included a dropbox link that has fairly detailed instructions for the single-speaker LibriSpeech example. You can try that and ask if you have any trouble reproducing the results. If you want VCTKp240 I can make a zip file for you tomorrow. This was much easier and faster than expected. I am sharing the results to generate interest, so we can collaborate on how much training is needed, best values of hparams, etc. |
Thank you, Tonight I am trying to keep it simple and see if I can Jam a regular "hand modeled" 3D head mesh into VOCA (Voice Operated Character Animation) (another GitHub project) Update: nope it exploded. |
Some general observations to share:
Also I did another experiment and trained the synthesizer for about 5,000 additional steps on the entire VCTK dataset (trying to help out on #388). The accent still does not transfer for zero-shot cloning. I suspect the synthesizer needs to be trained from scratch if that is the goal. P.S. @mbdash I updated the VCTKp240 post with a single-speaker dataset if you would like to try that out. #437 (comment) |
Changing my mind on training from scratch, I think we just need to add an extra input parameter to the synthesizer which indicates the accent or more accurately the dataset that it is trained on. A simple implementation might be a single bit representing LibriSpeech or VCTK. Next, finetune the existing models on VCTK with the added parameter. Then for inference specify the dataset that you want the result to sound like. I'm at a loss how to implement this with the current set of models, but I think this repo will have clues: https://github.com/Tomiinek/Multilingual_Text_to_Speech I'm all done with accent experiments for now but I hope this is helpful to anyone who wants to continue this work. |
Thank you for all your hard work on this repo - even as an almost complete newcomer to deep learning, I've been able to decipher some things, but I'm still stymied by the inability to create custom datasets from scratch. Are you still working on this "custom dataset" tool that you mention here? |
@blue-fish any reason why im getting the following error: "synthesizer_train.py: error: the following arguments are required: synthesizer_root"? I'm trying to run: synthesizer_train.py H:\ttss\Real-Time-Voice-Cloning-master\dataset_p240\SV2TTS\synthesizer --checkpoint_interval 100 the second argument is the folder that contains embeds, mels, and train.txt nevermind I fixed it while writing this. The argument isn't --synthesizer_root as all of the other arguments, but actually just synthesizer_root. Also, the above testing instructions are thus wrong (or at least not working for me). The command should be:
(it at least bumped me to a dll error - still working through that one) |
Hi @Adam-Mortimer. The custom dataset tool is still planned, but currently on hold as I've just started working #447 (switching out the synthesizer for fatchord's tacotron). #447 will be bigger than all of my existing pull requests combined if it ever gets finished. In other words, it's going to take quite some time. I started writing the custom dataset tool for a voice cloning experiment. I didn't get very far with the tool before I added LibriTTS support in #441 which made it much easier to create a dataset by putting your data in this kind of directory structure:
Where each
When this completes, your dataset is in the SV2TTS format and subsequent preprocessing commands ( I would still like to write the custom dataset tool but I think #447 is a more pressing matter since the toolbox is incompatible with Python 3.8 due to our reliance on Tensorflow 1.x. |
@Ori-Pixel There was a problem with my command and I fixed it. If you are following everything to the letter it should be:
Where the first arg |
@blue-fish thanks. yeah, I can see that it's saving to a new directory, I'll run it again with the correct params and post results. Also, thanks for the preprocessing tips you gave to @Adam-Mortimer . I was not looking forward to custom labeling, but it doesn't seem that bad if I only have ~200 lines/~34 minutes. I'm trying to make a fake (semi-Gaelic) accent video game character say some lines, so I'll probably scrape the audio files from the wiki site, then slap them into a folder structure like above with a simple script and then run this single speaker fine tuning again. And for the accent, I think I can just find a semi-close one in the VCTK dataset (although a 10Gb download will take me a few days sadly). |
@Ori-Pixel If you have a GPU you can quickly run a few experiments to see how far you can trim the dataset before the audio quality breaks down. Simply delete lines from train.txt and they won't be used. One of my experiments involved re-recording some of the VCTK p240 utterances with a different voice. 5 minutes of mediocre data (80 utterances) still resulted in a half-decent model. If the labeling is extremely tedious you can try training a model on part of it while continuing to label. I have preprocessed VCTK, if you can make your decision based on a single recording, request up to 3 speakers and I'll put them on dropbox for you. https://www.dropbox.com/s/6ve00tjjaab4aqj/VCTK_samples.zip?dl=0 |
Is there a list of their speakers somewhere? I only was able to find the 10GB file with not even a magnet link or anything denoting samples or file structure. I mean realistically anything Irish, Scottish, or Gaelic would work. I may also look into downloading it direct to drive (if possible) and even possibly training there (if possible -- as far as I'm aware you can mount the drive and run bash.)
Yeah, I just meant using a vctk pretrained that wasn't horribly inconsistent with my single speaker's accent and then fine tuning with my custom labeled lines on top. I also have a couple idle GPUs in my machine but I always run into venv issues with gpu training so I'll just use colab if I really need a GPU. Too bar downloading from a link to |
The zip file I uploaded includes
|
Ah I see. I'll give it a look tomorrow along with the results and let you know then, thanks again for being so active! |
@blue-fish p261 is relatively close. if I could get that slice, that would be very helpful (my internet at my current house is sadly 1MB/s) I trained as per instructions above, sadly I didnt get to see the console output as my power went out after about an hour or so, but I did get this in the training logs, so I think this is as far as it trained.
Also, just to make sure I did the test, this is the cmd I used:
Where random seed = 1, enhanced vocoder output is checked, embedding was from p240_1.flac. resulting audio: https://raw.githubusercontent.com/Ori-Pixel/files/master/welcome_to_toolbox_fine_tuned.flac |
Here is the dataset in the same format as p240 (embeds overwritten with the one corresponding to p261_001.flac): https://www.dropbox.com/s/o6fz2r6w56djwkf/dataset_p261.zip?dl=0
Your results sound American to me. Check that you are using the new synthesizer model, then try this text: |
Ah, I didn't have that drop down selected. My results are then this, with the same settings: I'm also taking your comment above and trying to train my own dataset, but at first I got a dataset roots folder doesnt exist error, so I made the folder and added my files, but when I go to train, I get:
gpu warnings here
Utterances have just the text that was spoken in them, so utterance-000.txt contains edit: I assume I will need to go through the training docs and start by training the encoder? |
@Ori-Pixel You also need to add the
Edit: If preprocessing completes without finding a wav file, we should remind the user to pass the |
@blue-fish Okay, so I got it to train, and I can also train my own dataset for the synthesizer. Really thankful for the help. Here's a result from 200 steps of training if you're interested: https://raw.githubusercontent.com/Ori-Pixel/files/master/crooked_creek_dw.flac https://raw.githubusercontent.com/Ori-Pixel/files/master/biggest_oversight.flac |
@Ori-Pixel Nice! It's remarkable how much that voice comes through after 200 steps of finetuning. In my own experiments going up to 400 steps yields a noticeable improvement in the voice quality. More than 400 doesn't seem to help, though it doesn't hurt either. Edit: You trained on CPU right? How long did it take? |
@blue-fish I did train on CPU(autocorrect!!) (I always have issues with gpu setup. Luckily im building a new pc when the 30xx cards drop with the new zen2 amd cpus). After trying to train from 200-400 it would seem that it takes ~25s per step after 20 steps, so around 2 hours for 200 steps on i5 4690k. The next steps for me would be encoder/vocoder training but I don't want to invest the compute power since Im working on another NLP problem for my actual research (sentiment analysis) I'll let it run overnight again and this time see how far it gets :) edit: as @blue-fish said, it seems training it to 400 steps made a large difference. Here's an example of the same voice as above, but with 400 steps of training the p261 set on my own collected voice samples: original voice: https://raw.githubusercontent.com/Ori-Pixel/files/master/Vo_dark_willow_sylph_attack_14.mp3 |
@blue-fish I did exactly what you said, after over 10000 steps with the synthesizer, I try to open the toolbox. I type the text to convert into the box, and I get some unrelated text in an almost incomprehensible ramble. |
@adfost Which set of instructions are you following? LibriSpeech (#437 (comment)) or VCTKp240 (#437 (comment))? Most likely, when you run |
For another language in single speaker ? |
Another dataset recording tool: https://github.com/babua/TTSDatasetRecorder |
Does anyone have the dropbox links? they're invalid right now. |
the "https ://www.dropbox.com/s/bf4ti3i1iczolq5/logs-singlespeaker.zip?dl=" lost, please fix it tks guys. |
I've made public a repo with a workflow for creating a dataset to perform synthesizer fine tuning. Not sure if this is the best place to let people know, but hopefully it helps someone. |
Summary
A relatively easy way to improve the quality of the toolbox output is through fine-tuning of the multispeaker pretrained models on a dataset of a single target speaker. Although it is no longer voice cloning, it is a shortcut for obtaining a single-speaker TTS model with less training data needed relative to training from scratch. This idea is not original, but a sample single-speaker model is presented along with a process and data for replicating the model.
Improvement in quality is obtained by taking the pretrained synthesizer model and training a few thousand steps on a single-speaker dataset. This amount of training can be done in less than a day on a CPU, and even faster with a GPU.
Procedure
Pretrained models and all files and commands needed to replicate this training can be found here: https://www.dropbox.com/s/bf4ti3i1iczolq5/logs-singlespeaker.zip?dl=0
Results
Download audio samples: samples.zip
These are generated with demo_toolbox.py are demonstrate the effect of synthesizer fine-tuning. "Pretrained" uses the original models, and "singlespeaker" uses the fine-tuned synthesizer model with the original vocoder model. I found the #432 changes helpful for benchmarking: all samples are generated with seed=1, no trim silences. The single-speaker model is noticeably better, with fewer long gaps and artifacts for short utterances. However, gaps still occur sometimes: one example is "this is a big red apple." Output is also somewhat better with a fine-tuned vocoder model, though no samples with the new vocoder are shared at this time.
Discussion
This work helps to demonstrate the following points:
The major obstacle preventing single-speaker fine-tuning is the lack of a suitable tool for creating a custom dataset. The existing preprocessing scripts are suited to batch processing of organized, labeled datasets. The existing scripts are not helpful unless the target speaker is already part of a supported dataset. The preprocessing does not need to be fully automated because a small dataset on the order of 100 utterances is sufficient for fine-tuning. I am going to write a tool that will allow users to manually select or record files to add to a custom dataset, and facilitate transcription (maybe using DeepSpeech). This tool will be hosted in a separate repository.
Acknowledgements
The text was updated successfully, but these errors were encountered: