Parameters for dataset in the wild #58

aomv · 2018-06-26T23:57:38Z

Hi,

I was able to train the VCTK dataset from scratch and replicated results. I was also able to make new trainning data by recording clear voices and trained successfully.

I am now trying to train voices in the wild but so far have been not successful. I have collected speeches from YouTube for trump,Obama,Bush and Hillary Clinton. I made sure the automatic transcription is accurate and the times are perfectly aligned. I also made sure the samples are around 3 second long. I have 2000 samples for each speaker with a total of 8000 samples just like the in the wild experiment reported in the paper. I’ve used the same training parameters as the VCTK dataset training paramenters reported on GitHub. Unfortunately the model cannot generate voices.

Would you be able to share more details of how you trained the samples in th the wild? Did you use a different noise argument or learning rate?

Thank you very much for your clarification.

enk100 · 2018-07-01T08:34:00Z

Hey,
try to train with noise_std=1.0 for all phases
keep the learning rate with the default value

aomv · 2018-07-07T05:46:17Z

Hi Eliya,

Thank you for your input.

I have tried to train the politician dataset with noise_std=1.0 but I still couldn't get the resulting model to generate voices.

I'm attaching the logs for the two training stages.

main_politicians_step2.log
main_politicians_step1.log

The first stage was done with the following command:

python train.py --expName politicians --noise 1 --seq-len 100 --nspk 4 --epochs 90 --data data/politicians/

The second stage was done with the command:

python train.py --expName politicians_step2 --noise 1 --seq-len 1000 --nspk 4 --epochs 90 --data data/politicians/ --checkpoint checkpoints/politicians/bestmodel.pth

Before training I extracted features just like with the VCTK dataset:

python extract_feats.py —txt_dir path/to/your/txt-dir —wav_dir path/to/your/wav-dir

Is there something especial that needs to be done for data in the wild to work? Does it need more epochs ? Should seq-len be different or any new parameters should be introduced ?

I have also generated other 5 in the wild datasets from TV Shows (Friends, The Office, South Park, Sponge Bob) and made sure the samples are perfectly aligned/cropped. Yet none of the models generate voices after trained. I am happy to share my datasets with you.

I am only able to generate voices for new speakers if they are not in the wild (i.e. recorded in a noise free environment).

Here are the voices generated from the in the wild politician expirement:

generated-in-the-wild.zip

Here are the voices generated by the model trained on professional impersonators for Trump and Morgan Freeman

impersonators.zip

Looking forward to your reply.

Thank you

enk100 · 2018-07-09T02:01:44Z

Hey,
Please start with the normal training-

Seqlen=100, noise=4
Seqlen=1000, noise=2

And then
3. Seqlen=1000, noise=1

aomv · 2018-07-13T22:23:45Z

Eliya,

Thank you again.

The training for the third step of the politicians dataset has finished. I attach the logs for all 3 steps of training. It seems weird that during the 3rd stage of training the train loss goes down to 0 while test loss remains at something about 28% which indicates some kind of overfitting. I also attach the generated samples for 4 speakers. Some sounds are now produced but they are unrelated to the given text.

I don't think there's an issue with my dataset as the phrases are very clearly spoken and segmented. It seems that the in the wild training simply doesn't succeed.

Should learning rate be modified?

Thank you

voices.zip
logs.zip

enk100 · 2018-07-16T08:11:51Z

Hi,
Your gradient is explode, this is the reason why it doesn’t train…

This is from your log -
INFO - 07/13/18 10:39:05 - 1 day, 2:22:54 - Not a finite gradient or too big, ignoring.

aomv · 2018-07-16T20:39:11Z

Thank you, Eliya.

Do you have an idea of why am I getting that error message? Does it mean I need to fine tune the learning rate or other parameters?

I've created a repository where I release my in the wild datasets. I'll be releasing the code I used to fetch datasets from youtube videos and align the automated captions as well.

Hopefully more people can jump in and help us train an in the wild model for VoiceLoop so that it becomes a baseline/benchmark dataset for future new approaches in the field.

Hopefully you guys can help us train it as well ;) If the method really works as well as the paper says there's no reason to be shy!

Here's the link for the repository https://github.com/aomv/voiceloop-in-the-wild-experiments

thank you

G-Wang · 2018-08-08T02:26:32Z

@aomv Maybe you can add gradient clipping before the optim.step() line in the train.py script.

aomv · 2018-08-15T19:21:26Z

Thanks for the suggestion @G-Wang .
I'm not sure why the authors stopped replying.

@enk100 do you think @G-Wang's idea is a good strategy?

Is there any other pointers you could provide into why the current code base for VoiceLoop doesn't work with data in the wild?

wanshun123 · 2018-11-15T15:00:02Z

I've tried training a model to output a trump voice using data from https://github.com/aomv/voiceloop-in-the-wild-experiments and am also failing to get any results. Apart from some audio files containing background noise (crowd applause), most of the audio is clean and clear and the transcriptions are accurate. I preprocessed the audio files per instructions, including doing this step #41 (comment) for all files. These are the commands used to train the model and then output an audio file (same as the readme suggests):

sudo python2 train.py --noise 1 --expName trump_silenced_clean --seq-len 1600 --max-seq-len 1600 --data latest_features --nspk 1 --lr 1e-5 --epochs 10

sudo python2 train.py --noise 1 --expName trump_silenced_clean_training2 --seq-len 1600 --max-seq-len 1600 --data latest_features --nspk 1 --lr 1e-4 --checkpoint checkpoints/trump_silenced_clean/bestmodel.pth --epochs 90

sudo python2 generate.py  --text "I am extremely happy and excited to officially announce my candidacy to the president of the united states" --checkpoint checkpoints/trump_silenced_clean_training2/bestmodel.pth

Attached is the audio output (garbage) and the training logs which show no errors and report a steadily declining train and test set loss. I could train for more epochs but if the results are nonexistent after 100 epochs it seems unlikely that's the issue. I'm still unsure if the issue is the data (probably most likely, though I can't pinpoint what exactly) or an issue in how I'm training the model.

trump_voice.zip
training_part2.log
training_part1.log

Any assistance appreciated.

pfriesch · 2018-11-15T15:32:52Z

Did you plot the learned attention? If the attention is not somewhat diagonal/low variance, the output will sound like this.

Basically what I found reproducing the results is that the model is very fragile the the input features. The original authors used the merlin toolkit to extract the features, which does silence removal using merlin's default learned DNN duration/acoustic model. Using a dB value as heuristic for a silence cutoff threshold did not work. Furthermore, merlin normalizes the world vocoder features, I guess to STD of 1. The documentation is a bit lacking though and I didn't take the time to dig throug the code, so I can't tell you what they do exactly.

To run the feature extraction I would suggest to take a look at the merlin build a voice tutorial up to step 06_train_acoustic_model which should give you the extracted features. (The process is also a bit convoluted, but at least somewhat readable, compared to the script linked in the readme)

wanshun123 · 2018-11-20T05:53:07Z

I tried running the build your own voice steps in merlin (got up to step 5) but ran into all kinds of issues with different versions of packages and incompatibility issues.

SibtainRazaJamali · 2019-03-14T12:46:08Z

I am trying to prepare vctk dataset for all speakers but
python extract_feats.py —txt_dir path/to/your/txt-dir —wav_dir path/to/your/wav-dir
this command does not work

aomv mentioned this issue Jun 26, 2018

Trump voice #54

Closed

wanshun123 mentioned this issue Nov 15, 2018

Also failing to get results aomv/voiceloop-in-the-wild-experiments#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parameters for dataset in the wild #58

Parameters for dataset in the wild #58

aomv commented Jun 26, 2018

enk100 commented Jul 1, 2018 •

edited

Loading

aomv commented Jul 7, 2018 •

edited

Loading

enk100 commented Jul 9, 2018

aomv commented Jul 13, 2018 •

edited

Loading

enk100 commented Jul 16, 2018

aomv commented Jul 16, 2018

G-Wang commented Aug 8, 2018

aomv commented Aug 15, 2018

wanshun123 commented Nov 15, 2018

pfriesch commented Nov 15, 2018

wanshun123 commented Nov 20, 2018

SibtainRazaJamali commented Mar 14, 2019

Parameters for dataset in the wild #58

Parameters for dataset in the wild #58

Comments

aomv commented Jun 26, 2018

enk100 commented Jul 1, 2018 • edited Loading

aomv commented Jul 7, 2018 • edited Loading

enk100 commented Jul 9, 2018

aomv commented Jul 13, 2018 • edited Loading

enk100 commented Jul 16, 2018

aomv commented Jul 16, 2018

G-Wang commented Aug 8, 2018

aomv commented Aug 15, 2018

wanshun123 commented Nov 15, 2018

pfriesch commented Nov 15, 2018

wanshun123 commented Nov 20, 2018

SibtainRazaJamali commented Mar 14, 2019

enk100 commented Jul 1, 2018 •

edited

Loading

aomv commented Jul 7, 2018 •

edited

Loading

aomv commented Jul 13, 2018 •

edited

Loading