Skip to content
This repository has been archived by the owner on May 28, 2019. It is now read-only.

Parameters for dataset in the wild #58

Open
aomv opened this issue Jun 26, 2018 · 12 comments
Open

Parameters for dataset in the wild #58

aomv opened this issue Jun 26, 2018 · 12 comments

Comments

@aomv
Copy link

aomv commented Jun 26, 2018

Hi,

I was able to train the VCTK dataset from scratch and replicated results. I was also able to make new trainning data by recording clear voices and trained successfully.

I am now trying to train voices in the wild but so far have been not successful. I have collected speeches from YouTube for trump,Obama,Bush and Hillary Clinton. I made sure the automatic transcription is accurate and the times are perfectly aligned. I also made sure the samples are around 3 second long. I have 2000 samples for each speaker with a total of 8000 samples just like the in the wild experiment reported in the paper. I’ve used the same training parameters as the VCTK dataset training paramenters reported on GitHub. Unfortunately the model cannot generate voices.

Would you be able to share more details of how you trained the samples in th the wild? Did you use a different noise argument or learning rate?

Thank you very much for your clarification.

@aomv aomv mentioned this issue Jun 26, 2018
@enk100
Copy link

enk100 commented Jul 1, 2018

Hey,
try to train with noise_std=1.0 for all phases
keep the learning rate with the default value

@aomv
Copy link
Author

aomv commented Jul 7, 2018

Hi Eliya,

Thank you for your input.

I have tried to train the politician dataset with noise_std=1.0 but I still couldn't get the resulting model to generate voices.

I'm attaching the logs for the two training stages.

main_politicians_step2.log
main_politicians_step1.log

The first stage was done with the following command:

python train.py --expName politicians --noise 1 --seq-len 100 --nspk 4 --epochs 90 --data data/politicians/

The second stage was done with the command:

python train.py --expName politicians_step2 --noise 1 --seq-len 1000 --nspk 4 --epochs 90 --data data/politicians/ --checkpoint checkpoints/politicians/bestmodel.pth

Before training I extracted features just like with the VCTK dataset:

python extract_feats.py —txt_dir path/to/your/txt-dir —wav_dir path/to/your/wav-dir

Is there something especial that needs to be done for data in the wild to work? Does it need more epochs ? Should seq-len be different or any new parameters should be introduced ?

I have also generated other 5 in the wild datasets from TV Shows (Friends, The Office, South Park, Sponge Bob) and made sure the samples are perfectly aligned/cropped. Yet none of the models generate voices after trained. I am happy to share my datasets with you.

I am only able to generate voices for new speakers if they are not in the wild (i.e. recorded in a noise free environment).

Here are the voices generated from the in the wild politician expirement:

generated-in-the-wild.zip

Here are the voices generated by the model trained on professional impersonators for Trump and Morgan Freeman

impersonators.zip

Looking forward to your reply.

Thank you

@enk100
Copy link

enk100 commented Jul 9, 2018

Hey,
Please start with the normal training-

  1. Seqlen=100, noise=4
  2. Seqlen=1000, noise=2

And then
3. Seqlen=1000, noise=1

@aomv
Copy link
Author

aomv commented Jul 13, 2018

Eliya,

Thank you again.

The training for the third step of the politicians dataset has finished. I attach the logs for all 3 steps of training. It seems weird that during the 3rd stage of training the train loss goes down to 0 while test loss remains at something about 28% which indicates some kind of overfitting. I also attach the generated samples for 4 speakers. Some sounds are now produced but they are unrelated to the given text.

I don't think there's an issue with my dataset as the phrases are very clearly spoken and segmented. It seems that the in the wild training simply doesn't succeed.

Should learning rate be modified?

Thank you

voices.zip
logs.zip

@enk100
Copy link

enk100 commented Jul 16, 2018

Hi,
Your gradient is explode, this is the reason why it doesn’t train…

This is from your log -
INFO - 07/13/18 10:39:05 - 1 day, 2:22:54 - Not a finite gradient or too big, ignoring.

@aomv
Copy link
Author

aomv commented Jul 16, 2018

Thank you, Eliya.

Do you have an idea of why am I getting that error message? Does it mean I need to fine tune the learning rate or other parameters?

I've created a repository where I release my in the wild datasets. I'll be releasing the code I used to fetch datasets from youtube videos and align the automated captions as well.

Hopefully more people can jump in and help us train an in the wild model for VoiceLoop so that it becomes a baseline/benchmark dataset for future new approaches in the field.

Hopefully you guys can help us train it as well ;) If the method really works as well as the paper says there's no reason to be shy!

Here's the link for the repository https://github.com/aomv/voiceloop-in-the-wild-experiments

thank you

@G-Wang
Copy link

G-Wang commented Aug 8, 2018

@aomv Maybe you can add gradient clipping before the optim.step() line in the train.py script.

@aomv
Copy link
Author

aomv commented Aug 15, 2018

Thanks for the suggestion @G-Wang .
I'm not sure why the authors stopped replying.

@enk100 do you think @G-Wang's idea is a good strategy?

Is there any other pointers you could provide into why the current code base for VoiceLoop doesn't work with data in the wild?

@wanshun123
Copy link

I've tried training a model to output a trump voice using data from https://github.com/aomv/voiceloop-in-the-wild-experiments and am also failing to get any results. Apart from some audio files containing background noise (crowd applause), most of the audio is clean and clear and the transcriptions are accurate. I preprocessed the audio files per instructions, including doing this step #41 (comment) for all files. These are the commands used to train the model and then output an audio file (same as the readme suggests):

sudo python2 train.py --noise 1 --expName trump_silenced_clean --seq-len 1600 --max-seq-len 1600 --data latest_features --nspk 1 --lr 1e-5 --epochs 10

sudo python2 train.py --noise 1 --expName trump_silenced_clean_training2 --seq-len 1600 --max-seq-len 1600 --data latest_features --nspk 1 --lr 1e-4 --checkpoint checkpoints/trump_silenced_clean/bestmodel.pth --epochs 90

sudo python2 generate.py  --text "I am extremely happy and excited to officially announce my candidacy to the president of the united states" --checkpoint checkpoints/trump_silenced_clean_training2/bestmodel.pth

Attached is the audio output (garbage) and the training logs which show no errors and report a steadily declining train and test set loss. I could train for more epochs but if the results are nonexistent after 100 epochs it seems unlikely that's the issue. I'm still unsure if the issue is the data (probably most likely, though I can't pinpoint what exactly) or an issue in how I'm training the model.

trump_voice.zip
training_part2.log
training_part1.log

Any assistance appreciated.

@pfriesch
Copy link

Did you plot the learned attention? If the attention is not somewhat diagonal/low variance, the output will sound like this.

Basically what I found reproducing the results is that the model is very fragile the the input features. The original authors used the merlin toolkit to extract the features, which does silence removal using merlin's default learned DNN duration/acoustic model. Using a dB value as heuristic for a silence cutoff threshold did not work. Furthermore, merlin normalizes the world vocoder features, I guess to STD of 1. The documentation is a bit lacking though and I didn't take the time to dig throug the code, so I can't tell you what they do exactly.

To run the feature extraction I would suggest to take a look at the merlin build a voice tutorial up to step 06_train_acoustic_model which should give you the extracted features. (The process is also a bit convoluted, but at least somewhat readable, compared to the script linked in the readme)

@wanshun123
Copy link

I tried running the build your own voice steps in merlin (got up to step 5) but ran into all kinds of issues with different versions of packages and incompatibility issues.

@SibtainRazaJamali
Copy link

I am trying to prepare vctk dataset for all speakers but
python extract_feats.py —txt_dir path/to/your/txt-dir —wav_dir path/to/your/wav-dir
this command does not work

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants