-
Notifications
You must be signed in to change notification settings - Fork 158
Parameters for dataset in the wild #58
Comments
Hey, |
Hi Eliya, Thank you for your input. I have tried to train the politician dataset with noise_std=1.0 but I still couldn't get the resulting model to generate voices. I'm attaching the logs for the two training stages. main_politicians_step2.log The first stage was done with the following command: python train.py --expName politicians --noise 1 --seq-len 100 --nspk 4 --epochs 90 --data data/politicians/ The second stage was done with the command: python train.py --expName politicians_step2 --noise 1 --seq-len 1000 --nspk 4 --epochs 90 --data data/politicians/ --checkpoint checkpoints/politicians/bestmodel.pth Before training I extracted features just like with the VCTK dataset: python extract_feats.py —txt_dir path/to/your/txt-dir —wav_dir path/to/your/wav-dir Is there something especial that needs to be done for data in the wild to work? Does it need more epochs ? Should seq-len be different or any new parameters should be introduced ? I have also generated other 5 in the wild datasets from TV Shows (Friends, The Office, South Park, Sponge Bob) and made sure the samples are perfectly aligned/cropped. Yet none of the models generate voices after trained. I am happy to share my datasets with you. I am only able to generate voices for new speakers if they are not in the wild (i.e. recorded in a noise free environment). Here are the voices generated from the in the wild politician expirement: Here are the voices generated by the model trained on professional impersonators for Trump and Morgan Freeman Looking forward to your reply. Thank you |
Hey,
And then |
Eliya, Thank you again. The training for the third step of the politicians dataset has finished. I attach the logs for all 3 steps of training. It seems weird that during the 3rd stage of training the train loss goes down to 0 while test loss remains at something about 28% which indicates some kind of overfitting. I also attach the generated samples for 4 speakers. Some sounds are now produced but they are unrelated to the given text. I don't think there's an issue with my dataset as the phrases are very clearly spoken and segmented. It seems that the in the wild training simply doesn't succeed. Should learning rate be modified? Thank you |
Hi, This is from your log - |
Thank you, Eliya. Do you have an idea of why am I getting that error message? Does it mean I need to fine tune the learning rate or other parameters? I've created a repository where I release my in the wild datasets. I'll be releasing the code I used to fetch datasets from youtube videos and align the automated captions as well. Hopefully more people can jump in and help us train an in the wild model for VoiceLoop so that it becomes a baseline/benchmark dataset for future new approaches in the field. Hopefully you guys can help us train it as well ;) If the method really works as well as the paper says there's no reason to be shy! Here's the link for the repository https://github.com/aomv/voiceloop-in-the-wild-experiments thank you |
@aomv Maybe you can add gradient clipping before the |
I've tried training a model to output a trump voice using data from https://github.com/aomv/voiceloop-in-the-wild-experiments and am also failing to get any results. Apart from some audio files containing background noise (crowd applause), most of the audio is clean and clear and the transcriptions are accurate. I preprocessed the audio files per instructions, including doing this step #41 (comment) for all files. These are the commands used to train the model and then output an audio file (same as the readme suggests):
Attached is the audio output (garbage) and the training logs which show no errors and report a steadily declining train and test set loss. I could train for more epochs but if the results are nonexistent after 100 epochs it seems unlikely that's the issue. I'm still unsure if the issue is the data (probably most likely, though I can't pinpoint what exactly) or an issue in how I'm training the model. trump_voice.zip Any assistance appreciated. |
Did you plot the learned attention? If the attention is not somewhat diagonal/low variance, the output will sound like this. Basically what I found reproducing the results is that the model is very fragile the the input features. The original authors used the merlin toolkit to extract the features, which does silence removal using merlin's default learned DNN duration/acoustic model. Using a dB value as heuristic for a silence cutoff threshold did not work. Furthermore, merlin normalizes the world vocoder features, I guess to STD of 1. The documentation is a bit lacking though and I didn't take the time to dig throug the code, so I can't tell you what they do exactly. To run the feature extraction I would suggest to take a look at the merlin build a voice tutorial up to step 06_train_acoustic_model which should give you the extracted features. (The process is also a bit convoluted, but at least somewhat readable, compared to the script linked in the readme) |
I tried running the build your own voice steps in merlin (got up to step 5) but ran into all kinds of issues with different versions of packages and incompatibility issues. |
I am trying to prepare vctk dataset for all speakers but |
Hi,
I was able to train the VCTK dataset from scratch and replicated results. I was also able to make new trainning data by recording clear voices and trained successfully.
I am now trying to train voices in the wild but so far have been not successful. I have collected speeches from YouTube for trump,Obama,Bush and Hillary Clinton. I made sure the automatic transcription is accurate and the times are perfectly aligned. I also made sure the samples are around 3 second long. I have 2000 samples for each speaker with a total of 8000 samples just like the in the wild experiment reported in the paper. I’ve used the same training parameters as the VCTK dataset training paramenters reported on GitHub. Unfortunately the model cannot generate voices.
Would you be able to share more details of how you trained the samples in th the wild? Did you use a different noise argument or learning rate?
Thank you very much for your clarification.
The text was updated successfully, but these errors were encountered: